Brief Introduction to This Course

3 Modules

  1. Clustering - 4 Lectures
    • Representative-based
    • Density-based
    • Hierarchical and Subspace
    • Outlier Detection
  2. Graph Mining - 5 Lectures
    • Spectral Theory and Clustering
    • Community Detection
    • Link Analysis
    • Similarities and Graph Embeddings
    • Graph Convolutional Networks
  3. Pattern Mining - 3 Lectures
    • Frequent Subgraph Mining
    • Frequent Items and Association Rules
    • Sequence Mining
    • Similarities and Stream Mining

Three hand-ins (⚠️Graded 10% each) for each topic.

Oral exam 70%

  • A number of topics covering the 3 modules
  • 20 min in total
    • 10 min presentation
    • 8 min questions
    • 2 min grading

Introduction to Data Mining

Why data mining?

Data explosion problem. Desideratum: Useful knowledge!

Solution: Data mining and data warehousing, i.e., extraction of interesting knowledge from data in large databases.

Definition of Data Mining

  • Given lots of data
  • Discover patterns and models that are:
    • Valid: hold on new data with some certainty
    • Useful: should be possible to act on the item
    • Unexpected: non-obvious to the system
    • Understandable: humans should be able to interpret the pattern

What’s not data mining?

  • deductive query processing
  • expert system or small ML/statistical programs

To extract the knowledge data needs to be:

  1. stored - systems
  2. managed - databases and data management
  3. analyzed - this class

✅ Data Mining ≈ Big Data ≈ Predictive Analytics ≈ Data Science

Applications

Database Analysis and Decision Support

Market analysis and management

Risk analysis and management

Fraud detection and management

Other Applications

Text mining and web analysis

Intelligent query answering

The KDD (knowledge discovery in databases) process:

  1. Data cleaning/integration in databases/info repositories
  2. Selections/projections/transformations in data warehouse
  3. task relevant data fed into data mining methods, gains patterns
  4. Pattens after visualization and evaluation, gains knowledge.

Basic Data Mining Tasks

Clustering

Association Rules (Frequent Pattern Mining)

Outlier Detection

Other methods

  • Concept Characterization and Discrimination
  • Sequential Patterns
  • Trends and Analysis of Changes
  • Methods for special data types e.g. Spatial data mining, web mining

Clustering

❗Class labels are unknown.

Task: Group objects into sub-groups (clusters)

  • Similarity function to measure similarity between objects (distance)
  • Objective: maximize intra-class similarity and minimize interclass similarity

Applications

  • Customer profiling/segmentation
  • Document or image collections
  • Web access patterns

Compare: Classification

❗ Class labels are known.

  • Task is to distinguish classes
  • predict class membership for new objects

Supervised vs Unsupervised

Supervised: classification - known labels

Unsupervised: clustering - unknown label

Outlier Detection

What is outlier?

An object that deviates so much from the rest of the data set as to arouse suspicion that it was generated by a different mechanism.

Application

  • Credit card fraud detection
  • Telecom fraud detection
  • Customer segmentation
  • Medical analysis

(Mostly unsupervised, but can also be supervised or semi-supervised)

Graph Mining

Graph: Structure that models connections between data objects

Data mining on graphs finds additional types of patterns

Association Rules

Association rule mining: find frequent patterns, associations, correlations, or causal structures among sets of items in transactions or other information repositories.

查找交易或其他信息存储库中的项目集之间的频繁模式、关联、相关性或因果结构。

  • Rule form: Body → Head

  • Applications: market-basket analysis, cross-marketing, catalog design

  • Transaction database

  • Associations:

    buys(butter) → buys(milk), buys(butter) → buys(sugar)

Data exploration: explore data attributes individually or extract key characteristics of a data sample

Large part of data mining is data pre-processing 😔

Bonferroni’s principle: Roughly, if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap

粗略地说,如果在更多地方寻找有趣的模式,但超出了数据量所能支持的范围,那一定会发现垃圾。

Thousand of patterns: not all of them are interesting 😔

A pattern is interesting if it is easily understood by humans, valid on test data, potentially useful, novel or validates some hypothesis that a user seeks to confirm

Objective vs. subjective interestingness measures

  • Objective: based on statistics and structures of patterns, e.g., support, confidence etc.
  • Subjective: based on users’ belief in the data, e.g., unexpectedness, novelty, action-ability etc.

Statistic for Exploratory Data Analysis

怎么突然开始复习概率统计的知识了……回想起本科大二被概率论与数理统计这门课支配的恐惧了吗?

An example dataset has some attributes

Selecting two attributes for visualization using PCA.

以下内容均是本科或高中的数学知识:

Random variables

Mean

Sample mean

Median

Mode 众数

Robustness: A statistic is robust if it is not affected by extreme values

Covariance 协方差:判断两个随机变量是否相关

Correlation (线性)相关性

Covariance Matrix

Data Normalization xμσN(0,1)\frac{x-\mu}{\sigma}\sim\mathcal N(0,1)

Chi-Square χ2\chi^2 Independence Test 高中数学是吧:happy:

Distance and similarities (dd dimensions, ss common values)

  • Eucledian distance: 2(ds)\sqrt{2(d-s)}

  • Hamming distance: dsd-s

  • Cosine similarity: cosθ=xiTxjxixj=sd\cos\theta=\frac{x_i^Tx_j}{||x_i||||x_j||}=\frac{s}{d}

  • Jaccard Coefficient: s2dss\over2d-s

  • 如果距离越小,说明越相似

    距离的几个性质

    1. positive semidefinite 非负性
    2. definite 距离为零,两值相等
    3. symmetry 对称
    4. triangle inequality 三角不等式
  • Minkowski distance (ixixjp)1p(\sum_i|x_i-x_j|^p)^{1\over p}

Discretization 离散化

  • equal-width intervals 等距分为
  • equal-frequency intervals 等概率分位