Course: Data Mining & Visualization.

Python Data Mining Machine Learning COMP527
January 27, 2020

Last Year’s Notes

Lecturer: Shagufta Scanlon

~ which is to say that she is standing at the front reading through the notes from the previous course lecturer.

Cancelled last minute for first and third weeks.



Course Summary


Knowledge discovery [= ‘data mining’] is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. Given a set of facts (data) F, a language L, and some measure of certainty C, we define a pattern as a statement S in L that describes relationships among a subset FS of F with a certainty c, such that S is simpler (in some sense) than the enumeration of all facts in FS. A pattern that is interesting (according to a user-imposed interest measure) and certain enough (again according to the user’s criteria) is called knowledge. The output of a program that monitors the set of facts in a database and produces patterns in this sense is discovered knowledge.

(Piatetsky-Shapiro et al (1992))

…the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, … or data streams (Han, page xxi)

…the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful…” (Witten, page 5)


Playing GO

Recognizing cat videos


Two Main Goals in Data Mining



Linear Algebra

Vectors. Matrices. Vector Arithmetic: Addition, Hadamard Product, Inner-Product = Dot Product, Outer-Product. Matrix Arithmetic: Addition, Multiplication. Transpose and Inverse. Determinant. Trace. Linear Independence. Rank. Trace. Eigenstuff: eigenvectors, eigenvalues, eigendecomposition. Vector Calculus. Differentiation: Product Rule, Quotient Rule, Sum Rule, Chain Rule. Partial Derivatives: Definition, Jacobian Gradient Vector. Multivariate Chain Rule. Useful Identities.

Jaccard Coefficient

See Wikipedia

Jupyter Notebook

def jaccard_coefficient(s1, s2):
    """A function to find the Jaccard Coefficient of two sentences.
        s1 (string): First sentence.
        s2 (string): Second sentence.

    v1 = s1.lower().split()
    v2 = s2.lower().split()

    A = set(v1)
    B = set(v2)
    jaccard = len(A.intersection(B)) / len(A.union(B))
    print(f'Jaccard coefficient of "{s1}" and "{s2}" is {jaccard}.')
    return jaccard

Data Representation


Types of Data


We could represent a given sentence as:


“features are attributes of data points that we can use to represent the data points”


  1. Scaling.
  2. Gaussian Normalization.





Missing Values


  1. Ignore.

  2. Re-measure.

  3. Set a ‘missing’ constant, eg. 0

  4. Replace with mean.

  5. Predicting missing values.

  6. Accept missing values.

Noisy Values


  1. Manual inspection and removal.

  2. Clustering and outlier detection.

  3. Linear regression and outlier removal.

  4. Ignore values below given frequency threshold.


Redundant values.

Repeated data points.

Over-fitting and Under-fitting.