##### January 20, 2020

## Lecturer: Dani Arribas-Bel

I am a Senior Lecturer in Geographic Data Science at the Department of Geography and Planning , and member of the Geographic Data Science Lab, at the University of Liverpool (UK), where I direct the MSc in Geographic Data Science.

## Data Science Papers

Arribas-Bel, D. (2014). Accidental, open and everywhere: Emerging data sources for the understanding of cities..

Lazer & Radford, (2017) Data ex Machina: Introduction to Big Data..

Davenport and Patil (2012). Data Scientist: Sexiest Job of the 21st Century.

Donoho, D. (2015). 50 Years of Data Science.

Tukey, J. (1962). The Future of Data Analysis.

Snow, C.P. (1959, …, 2002). The Two Cultures. Cambridge: University Press.

First mention of data *science*: Peter Naur (1975). Concise Survey of Computer Methods.

## What was Bell Labs?

For a long stretch of the twentieth century, Bell Labs was the most innovative scientific organization in the world.

Gertner (2012). The Idea Factory.

### R vs. Python.

Be a ninja in one. Be able to survive in the other.

## Open Science

Computational Notebooks, Open-Source Packages, and Reproducible Platforms: Jupyter, Python, Docker.

`pandas`

Developed by Wes McKinney while working for an investment management firm.

Now an open source, BSD-licensed library “providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language”.

“For users of the R language for statistical computing, the DataFrame name will be familiar as the object was named after the similar R

`data.frame`

object.”

McKinney, W (2018). Python For Data Analysis: p.5

## Reading & Manipulating Tabular Data

```
import pandas
db = pandas.read_csv("...")
db.head()
db.info()
db.describe()
db.set_index("id")
db.columns()
# Columns
db["Price"].head()
#Full slice of one dimension
db.loc[0:]
# Conditional queries
db.loc[db["neighbourhood_cleansed"] == "Observatoire",
["neighbourhood_cleansed", "Price"]]\
.head()
db.loc[db["Price"] < 100, ["id", "neighbourhood_cleansed"]].head()
db.loc[(db["Price"] < 100) & \
(db["bathrooms"] >= 8),
:]
# Conditional filters
fltr = db["bathrooms"] > 8
fltr.head()
db[fltr]
# Concatenated queries
db.loc[(db["Price"] < 100) & \
(db["bathrooms"] >= 8),
:]
# Unique
db.neighbourhood_cleansed.unique()
# New columns
db["more_beds_than_accomodates"] = db["beds"] > db["accommodates"]
```

## Visualization

“matplotlib is the latex of visualization”

~ Declarative (ggplot2) vs. Imperative (matplotlib) Approaches to Visualization.

ggplot2 and the Grammar of Graphics, as implemented by [Hadley Wickham]((https://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf)

### Seaborn.

```
import seaborn as sns
```

Because of the West Wing.

“Check out the Seaborn tutorial”

The limits of 1-to-1 mapping.

The box-plot and the violin-plot.

Deceitful visualisations. ‘How Charts Lie’. Australia temperature? https://www.bbc.co.uk/news/world-australia-50585968

matplotlib first principles

## Unsupervised Machine Learning = Clustering.

Recommendation Algorithms and Clustering.

- Geodemographics
- Amazon
- Spotify
- Cambridge Analytica

“Machine learning is about two things: 1. Clustering, and 2. Predicting”

code environment hygiene

pseudo-random number generator

- The Mersenne Twister

Euclidean metrics and multi-dimensional space ~ consider also Manhattan distance metric https://en.wikipedia.org/wiki/Taxicab_geometry

Reducing dimensionality using Principal Component Analysis.

Principal Component Analysis and

Standardized multi-dimensional data removes (some of) the problem of units.

Tidy Data. ~ ‘Long-Form’. every observation a row; every variable a column. Hadley Wickham and the Tidyverse.

Leo Breiman: invented the Random Forest

Statistical Modeling: The Two Cultures

Wired Magazine 2008. The End of Theory: The Data Deluge Makes the Scientific Method Obsolete

Machine Learning: An Applied Econometric Approach

~ inference plus prediction

names of gradients and constants: beta and alpha, or m plus c

Social Scientists like to have a Theory before they model Data.

“There’s no such thing as a free lunch”

“As we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know.” Donald Rumsfeld

## ‘Overfitting & Cross-Validation’

Overfitting is bad. It happens when your model fits itself to the random idiosynchracies of your sample, rather than the fundamental generative properties of the data.

Linear Models are good because they don’t overfit. They are inately parsimonious and simple.

Contrast `sk-learn`

with `statsmodels`

:
former focussed on prediction, the latter on p-values.

~ sci-kit learn returns mean-square error as negative: see discussion