January 20, 2020
Lecturer: Dani Arribas-Bel
I am a Senior Lecturer in Geographic Data Science at the Department of Geography and Planning , and member of the Geographic Data Science Lab, at the University of Liverpool (UK), where I direct the MSc in Geographic Data Science.
Data Science Papers
Arribas-Bel, D. (2014). Accidental, open and everywhere: Emerging data sources for the understanding of cities..
Lazer & Radford, (2017) Data ex Machina: Introduction to Big Data..
Davenport and Patil (2012). Data Scientist: Sexiest Job of the 21st Century.
Donoho, D. (2015). 50 Years of Data Science.
Snow, C.P. (1959, …, 2002). The Two Cultures. Cambridge: University Press.
First mention of data science: Peter Naur (1975). Concise Survey of Computer Methods.
What was Bell Labs?
For a long stretch of the twentieth century, Bell Labs was the most innovative scientific organization in the world.
R vs. Python.
Be a ninja in one. Be able to survive in the other.
Computational Notebooks, Open-Source Packages, and Reproducible Platforms: Jupyter, Python, Docker.
Developed by Wes McKinney while working for an investment management firm.
Now an open source, BSD-licensed library “providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language”.
“For users of the R language for statistical computing, the DataFrame name will be familiar as the object was named after the similar R
Reading & Manipulating Tabular Data
import pandas db = pandas.read_csv("...") db.head() db.info() db.describe() db.set_index("id") db.columns() # Columns db["Price"].head() #Full slice of one dimension db.loc[0:] # Conditional queries db.loc[db["neighbourhood_cleansed"] == "Observatoire", ["neighbourhood_cleansed", "Price"]]\ .head() db.loc[db["Price"] < 100, ["id", "neighbourhood_cleansed"]].head() db.loc[(db["Price"] < 100) & \ (db["bathrooms"] >= 8), :] # Conditional filters fltr = db["bathrooms"] > 8 fltr.head() db[fltr] # Concatenated queries db.loc[(db["Price"] < 100) & \ (db["bathrooms"] >= 8), :] # Unique db.neighbourhood_cleansed.unique() # New columns db["more_beds_than_accomodates"] = db["beds"] > db["accommodates"]
~ Declarative (ggplot2) vs. Imperative (matplotlib) Approaches to Visualization.
ggplot2 and the Grammar of Graphics, as implemented by [Hadley Wickham]((https://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf)
import seaborn as sns
“Check out the Seaborn tutorial”
The limits of 1-to-1 mapping.
The box-plot and the violin-plot.
Deceitful visualisations. ‘How Charts Lie’. Australia temperature? https://www.bbc.co.uk/news/world-australia-50585968
matplotlib first principles
Unsupervised Machine Learning = Clustering.
Recommendation Algorithms and Clustering.
- Cambridge Analytica
“Machine learning is about two things: 1. Clustering, and 2. Predicting”
code environment hygiene
- The Mersenne Twister
Reducing dimensionality using Principal Component Analysis.
Principal Component Analysis and
Standardized multi-dimensional data removes (some of) the problem of units.
Tidy Data. ~ ‘Long-Form’. every observation a row; every variable a column. Hadley Wickham and the Tidyverse.
~ inference plus prediction
names of gradients and constants: beta and alpha, or m plus c
Social Scientists like to have a Theory before they model Data.
“There’s no such thing as a free lunch”
“As we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know.” Donald Rumsfeld
‘Overfitting & Cross-Validation’
Overfitting is bad. It happens when your model fits itself to the random idiosynchracies of your sample, rather than the fundamental generative properties of the data.
Linear Models are good because they don’t overfit. They are inately parsimonious and simple.
former focussed on prediction, the latter on p-values.
~ sci-kit learn returns mean-square error as negative: see discussion