数据挖掘 1 Introduction
Brief Introduction to This Course
3 Modules
- Clustering - 4 Lectures
- Representative-based
- Density-based
- Hierarchical and Subspace
- Outlier Detection
- Graph Mining - 5 Lectures
- Spectral Theory and Clustering
- Community Detection
- Link Analysis
- Similarities and Graph Embeddings
- Graph Convolutional Networks
- Pattern Mining - 3 Lectures
- Frequent Subgraph Mining
- Frequent Items and Association Rules
- Sequence Mining
- Similarities and Stream Mining
Three hand-ins (⚠️Graded 10% each) for each topic.
Oral exam 70%
- A number of topics covering the 3 modules
- 20 min in total
- 10 min presentation
- 8 min questions
- 2 min grading
Introduction to Data Mining
Why data mining?
Data explosion problem. Desideratum: Useful knowledge!
Solution: Data mining and data warehousing, i.e., extraction of interesting knowledge from data in large databases.
Definition of Data Mining
- Given lots of data
- Discover patterns and models that are:
- Valid: hold on new data with some certainty
- Useful: should be possible to act on the item
- Unexpected: non-obvious to the system
- Understandable: humans should be able to interpret the pattern
What’s not data mining?
- deductive query processing
- expert system or small ML/statistical programs
To extract the knowledge data needs to be:
- stored - systems
- managed - databases and data management
- analyzed - this class
✅ Data Mining ≈ Big Data ≈ Predictive Analytics ≈ Data Science
Applications
Database Analysis and Decision Support
Market analysis and management
Risk analysis and management
Fraud detection and management
Other Applications
Text mining and web analysis
Intelligent query answering
The KDD (knowledge discovery in databases) process:
- Data cleaning/integration in databases/info repositories
- Selections/projections/transformations in data warehouse
- task relevant data fed into data mining methods, gains patterns
- Pattens after visualization and evaluation, gains knowledge.
Basic Data Mining Tasks
Clustering
Association Rules (Frequent Pattern Mining)
Outlier Detection
Other methods
- Concept Characterization and Discrimination
- Sequential Patterns
- Trends and Analysis of Changes
- Methods for special data types e.g. Spatial data mining, web mining
Clustering
❗Class labels are unknown.
Task: Group objects into sub-groups (clusters)
- Similarity function to measure similarity between objects (distance)
- Objective: maximize intra-class similarity and minimize interclass similarity
Applications
- Customer profiling/segmentation
- Document or image collections
- Web access patterns
Compare: Classification
❗ Class labels are known.
- Task is to distinguish classes
- predict class membership for new objects
Supervised vs Unsupervised
Supervised: classification - known labels
Unsupervised: clustering - unknown label
Outlier Detection
What is outlier?
An object that deviates so much from the rest of the data set as to arouse suspicion that it was generated by a different mechanism.
Application
- Credit card fraud detection
- Telecom fraud detection
- Customer segmentation
- Medical analysis
(Mostly unsupervised, but can also be supervised or semi-supervised)
Graph Mining
Graph: Structure that models connections between data objects
Data mining on graphs finds additional types of patterns
Association Rules
Association rule mining: find frequent patterns, associations, correlations, or causal structures among sets of items in transactions or other information repositories.
查找交易或其他信息存储库中的项目集之间的频繁模式、关联、相关性或因果结构。
Rule form: Body → Head
Applications: market-basket analysis, cross-marketing, catalog design
Transaction database
Associations:
buys(butter) → buys(milk), buys(butter) → buys(sugar)
Data exploration: explore data attributes individually or extract key characteristics of a data sample
Large part of data mining is data pre-processing 😔
Bonferroni’s principle: Roughly, if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap
粗略地说,如果在更多地方寻找有趣的模式,但超出了数据量所能支持的范围,那一定会发现垃圾。
Thousand of patterns: not all of them are interesting 😔
A pattern is interesting if it is easily understood by humans, valid on test data, potentially useful, novel or validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
- Objective: based on statistics and structures of patterns, e.g., support, confidence etc.
- Subjective: based on users’ belief in the data, e.g., unexpectedness, novelty, action-ability etc.
Statistic for Exploratory Data Analysis
怎么突然开始复习概率统计的知识了……回想起本科大二被概率论与数理统计这门课支配的恐惧了吗?
An example dataset has some attributes
Selecting two attributes for visualization using PCA.
以下内容均是本科或高中的数学知识:
Random variables
Mean
Sample mean
Median
Mode 众数
Robustness: A statistic is robust if it is not affected by extreme values
Covariance 协方差:判断两个随机变量是否相关
Correlation (线性)相关性
Covariance Matrix
Data Normalization
Chi-Square Independence Test 高中数学是吧:happy:
Distance and similarities ( dimensions, common values)
Eucledian distance:
Hamming distance:
Cosine similarity:
Jaccard Coefficient:
如果距离越小,说明越相似
距离的几个性质
- positive semidefinite 非负性
- definite 距离为零,两值相等
- symmetry 对称
- triangle inequality 三角不等式
Minkowski distance
Discretization 离散化
- equal-width intervals 等距分为
- equal-frequency intervals 等概率分位