Course: Data Science Studio.

Python pandas Machine Learning Clustering Linear Regression ENVS615
January 20, 2020

Lecturer: Dani Arribas-Bel

I am a Senior Lecturer in Geographic Data Science at the Department of Geography and Planning , and member of the Geographic Data Science Lab, at the University of Liverpool (UK), where I direct the MSc in Geographic Data Science.

Data Science Papers

Arribas-Bel, D. (2014). Accidental, open and everywhere: Emerging data sources for the understanding of cities..

Lazer & Radford, (2017) Data ex Machina: Introduction to Big Data..

Davenport and Patil (2012). Data Scientist: Sexiest Job of the 21st Century.

Donoho, D. (2015). 50 Years of Data Science.

Tukey, J. (1962). The Future of Data Analysis.

Snow, C.P. (1959, …, 2002). The Two Cultures. Cambridge: University Press.

First mention of data science: Peter Naur (1975). Concise Survey of Computer Methods.

What was Bell Labs?

For a long stretch of the twentieth century, Bell Labs was the most innovative scientific organization in the world.

Gertner (2012). The Idea Factory.

R vs. Python.

Be a ninja in one. Be able to survive in the other.

Open Science

Computational Notebooks, Open-Source Packages, and Reproducible Platforms: Jupyter, Python, Docker.

pandas

Developed by Wes McKinney while working for an investment management firm.

Now an open source, BSD-licensed library “providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language”.

“For users of the R language for statistical computing, the DataFrame name will be familiar as the object was named after the similar R data.frame object.”

McKinney, W (2018). Python For Data Analysis: p.5

Reading & Manipulating Tabular Data

import pandas

db = pandas.read_csv("...")
db.head()
db.info()
db.describe()
db.set_index("id")

db.columns()

# Columns
db["Price"].head()

#Full slice of one dimension
db.loc[0:]

# Conditional queries
db.loc[db["neighbourhood_cleansed"] == "Observatoire", 
       ["neighbourhood_cleansed", "Price"]]\
  .head()
  
db.loc[db["Price"] < 100, ["id", "neighbourhood_cleansed"]].head()

db.loc[(db["Price"] < 100) & \
       (db["bathrooms"] >= 8),
       :]
       
# Conditional filters
fltr = db["bathrooms"] > 8
fltr.head()
db[fltr]

# Concatenated queries
db.loc[(db["Price"] < 100) & \
       (db["bathrooms"] >= 8),
       :]

# Unique       
db.neighbourhood_cleansed.unique()

# New columns
db["more_beds_than_accomodates"] = db["beds"] > db["accommodates"]


Visualization

matplotlib is the latex of visualization”

~ Declarative (ggplot2) vs. Imperative (matplotlib) Approaches to Visualization.

ggplot2 and the Grammar of Graphics, as implemented by [Hadley Wickham]((https://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf)

Seaborn.

import seaborn as sns

Because of the West Wing.

“Check out the Seaborn tutorial

The limits of 1-to-1 mapping.

The box-plot and the violin-plot.

Deceitful visualisations. ‘How Charts Lie’. Australia temperature? https://www.bbc.co.uk/news/world-australia-50585968

matplotlib first principles

Unsupervised Machine Learning = Clustering.

Clustering Algorithms

K-Means Algorithm ~ Andrew Ng

Recommendation Algorithms and Clustering. - Geodemographics - Amazon - Spotify - Cambridge Analytica

“Machine learning is about two things: 1. Clustering, and 2. Predicting”

code environment hygiene

pseudo-random number generator - The Mersenne Twister

Euclidean metrics and multi-dimensional space ~ consider also Manhattan distance metric https://en.wikipedia.org/wiki/Taxicab_geometry

Reducing dimensionality using Principal Component Analysis.

Principal Component Analysis and - Latent Variable Model - Structural Equation Modelling


Standardized multi-dimensional data removes (some of) the problem of units.

Tidy Data. ~ ‘Long-Form’. every observation a row; every variable a column. Hadley Wickham and the Tidyverse.

Leo Breiman: invented the Random Forest

Statistical Modeling: The Two Cultures

Wired Magazine 2008. The End of Theory: The Data Deluge Makes the Scientific Method Obsolete

Machine Learning: An Applied Econometric Approach

~ inference plus prediction

names of gradients and constants: beta and alpha, or m plus c

‘hedonic models’

Social Scientists like to have a Theory before they model Data.

“There’s no such thing as a free lunch”

“As we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know.” Donald Rumsfeld

Andrew Gelman


‘Overfitting & Cross-Validation’

Overfitting is bad. It happens when your model fits itself to the random idiosynchracies of your sample, rather than the fundamental generative properties of the data.

Linear Models are good because they don’t overfit. They are inately parsimonious and simple.

Contrast sk-learn with statsmodels: former focussed on prediction, the latter on p-values.

~ sci-kit learn returns mean-square error as negative: see discussion