数据挖掘 11 Frequent Subgraph Mining
A labeled graph is a triple where is a labeling function assigning a label from a set of labels to each node and each edge.
Definition of frequency: it is encoded into a support measure that return the number of occurrences of the subgraph in the graph.
FSM in Graph Collections
A graph collection aka a graph database is a set of labeled graphs .
Since a subgraph is uniquely identified by the nodes and edges of a certain graph, we need to relax the definition, so that it counts the occurrences of that subgraph in different graphs.
A graph is isomorphic 同构 to a graph if there exists a bijective function such that if then and and if then and .
If a graph is isomorphic to a subgraph of another graph , we say is subgraph isomorphic to . Subgraph isomorphism is NP-complete, which graph isomorphism is NP.
Denote: a subgraph isomorphism from to .
Given a collection of graphs and a graph , the support set is the subset of , s.t. is subgraph isomorphic to every graph in , i.e., . The support of is the proportion .
The Frequent Subgraph Mining FSM problem in collection of graphs finds all the subgraphs having at least support , i.e., .
Downward Closure Property
- Downward closure is the property under which a support measure decreases by expanding a pattern. In particular in our case if .
The downward closure property leads to 2 different family of approaches: apriori-based approaches and pattern-growth approaches.
Apriori-based Approaches
This approaches stem from the observation that any subgraph of a frequent subgraph must be frequent as well.
FSG is one of the first methods that utilizes the apriori principle. The algorithm builds on the idea that a frequent -subgraph is built on top of frequent -subgraphs, where indicates the number of edges.
The algorithm simply starts enumerating frequent -subgraphs and -subgraphs and store them into the set respectively.
Similarly, is the set of frequent -subgraphs. FSG then builds larger -subgraphs by merging the subgraphs obtained in the previous iteration.
Pseudo code
- frequent -subgraphs with counts
- frequent -subgraphs with counts
- while do
candidate_generation
: get , the set of candidate -subgraphs fromcandidate_pruning
: for a candidate to be frequent, each -subgraph should be frequentfrequency_pruning
: scan , count occurrences of subgraphs in
- return
each of 3 steps run isomorphism checks to understand whether some candidate is isomorphic to any of the other candidates. Moreover, in order to compute the support of a subgraph the algorithm runs subgraph isomorphism on all the candidate graphs in . FSG employs heuristics to reduce the number of isomorphism tests at each iteration.
Candidate Generation
Generate candidates by merging the frequent -subgraphs. The main idea is to first detect cores to generate the candidates easily. A core is the maximum common subgraph among two subgraphs. Although generating the core requires to run subgraph isomorphism, in practice
- one core is common to many subgraphs.
- the subgraphs are typically small.
Candidate pruning
In order to prune the candidates we can check if every subgraph of a candidate is also frequent. In theory, this requires to perform isomorphism tests for each of the candidate’s subgraph.
In order to hash a graph, we require a signature, i.e., a numeric representation, that is unique for each graph. This is possible through canonical labeling. Canonical labeling, finds a unique representation of a graph in a smart manner. In particular, FSG empolys a canonical scheme that reorders the column and the rows of the adjacency matrix. The string obtained by reading the sequence of node and edge labels in the column of the matrix that is smallest lexicographic string among all of those obtained by reordering is the canonical code and is unique, i.e., is adjacency matrix.
It seems that this method effectively solves graph isomorphism in . However, this is not the case as the canonical labeling is obtained by a, potentially exponential, number of reordering of the adjacency matrix. By reordering, we are computing automorphisms over the graph. In practice, not all the permutations lead to a valid matrix and can be pruned by simple heuristics.
Frequency Counting
Frequency counting requires to scan the data to find which graphs contain a subgraph. In frequency counting, most of the subgraph isomorphism tests can be avoided with a simple shortcut.
Let be the set of graphs in the collection that contain the subgraph and , respectively. If we merge the two patterns, it follows immediately that . As such all the graphs in can be ignored as they will not contribute to the support.
👍
- Finds all frequent subgraphs.
- Easy to include heuristics to reduce the computation time.
👎
- FSG generates the same subgraph a large number of times.
- The canonical labeling scheme is quadratic in the size of the subgraph.
- FSG requires several isomorphism and subgraph isomorphism tests.
Pattern-growth Approaches
It follows a different philosophy.
Start first with empty subgraphs and progressively expand these subgraphs by adding one edge at the time. The pattern expands always in a certain order to redundancy as much as possible.
gSpan is the most famous algorithm in the pattern-growth family.
gSpan organizes the patterns in a prefix-tree, a data structure for alphabetically ordered string that store string prefixes as nodes of a tree. However, graphs do not have a predefined order of traversal, or a unique way of enumerating the nodes.
To create a prefix-tree of non-repeating patterns gSpan:
- Defines a traversal over the nodes in the graphs
- Identifies in advance which nodes of a subgraph to expand
- Defines a total ordering over the subgraphs, so as to detect duplicate patterns.
Node Traversal
gSpan traverses the subgraphs in a DFS manner, that means that the next node to visit after is the first in the list of neighbors . A DFS vist defines a spanning-tree over the nodes in the subgraph. The edges in the spanning tree are called forward edges; the remaining edges are called backward edges.
In the beginning of the DFS a counter assigns a unique identifier to each node. The counter represents the discovery time of the node.
Rightmost Expansion
Since a DFS defines a specific traversal and the order of the visit is market by a counter, there is only one way to add an edge to the subgraph, namely expanding from the rightmost path. The rightmost path is the path from the root of the spanning tree to the leaf with the highest identifier.
Total Ordering
The main innovation of gSpan is to treat canonical labeling as a DFS-visit and thus ensure a reduced number of operations and redundant subgraphs.
DFS Codes
Representation of a graph that is comparable among other DFS codes. Such a comparison as a total ordering for which a minimum exists. The minimum DFS-code is a canonical labeling of a graph.
Definition of DFS-edge. It’s a tuple where are identifiers given by the DFS traversal order, are the labels of node , edge and node .
DFS-code for a graph is a sequence where each is a DFS-edge and .
Valid DFS-codes. Precedence order on DFS-edges corresponding to the DFS traversal. Given 2 DFS-edges connecting nodes with discovery time and connecting nodes with discovery time . A DFS-code is valid if it complies to:
- and backward edges ordering
- and forward edges ordering
- and transitive property
Total ordering and minimum codes. A graph can have multiple valid DFS-codes depending on the DFS traversal order. gSpan further defines a total ordering over the DFS-codes so to define a minimum DFS-code for a graph.
The minimum DFS-code is unique and a canonical labeling of the graph. This implies that 2 graphs are isomorphic iff they have the same DFS-code.
Let two DFS-codes. We say iff either of the following conditions are true:
- s.t. for and
- for and
Simply speaking, if either 1. and are equal up to some point and the next corresponding DFS-edges is s.t. , or 2. and are the same in terms of DFS-edges but contains more edges.
We need to define : Let be two DFS-edges. We say if
- Both are forward edges
- OR
- and the labels of are lexicographical less than labels of , in order of tuple.
- Both are backward edges
- OR
- and the edge label of is lexicographical less than the one of .
- is backward and is forward.
The above enforce a total order on DFS-codes and, as such, the min-DFS() of a graph is a canonical labeling of .
DFS-code tree. It allows also to organize data into DFS-tree that is prefix tree. Because the codes have a precedence order. If min-DFS() and min-DFS(), then is parent of , since min-DFS() is a prefix of min-DFS().
Given a parent-child relationship defined, we can construct a DFS Tree. An interesting property of the DFS-tree is that an in-order traveersal follows DFS lexicographic order.
gSpan algorithm
Input: Collection of graphs , minimum support
Output: Frequent subgraph set
- Let be frequent one-edge subgraphs in using DFS code.
- Sort in lexicographic order
- for each do
GSpanExtend
- return
GSpanExtend
// checking minimality of DFS-code is computationally expensive
if not minimal then return
else
for each Rightmost expansion of do
if then
GSpanExtend
👍
- Finds all frequent subgraphs.
- Generates less redundant subgraphs.
👎
- Works only in collections of graphs.
- Generally slow for low .
Frequent Subgraph Mining on a Single Graph
If the support is defined a the number of time a specific subgraph appears, the downward property does not hold anymore as bigger subgraphs can have a larger support.
We defined a support measure to be admissible if for any graph and any subgraph , the support of is not larger than the support of . In the last few years, a number of admissible support measures have appeared. These include,
- Maximum independent set support (MIS): Based on maximum number of non-overlapping matches
- Harmful overlap support (HO): Based on number of patterns for which no multi-node subgraph is identical
- Minimum image-based support (MNI): Based on the number of times a node in the pattern is mapped into a distinct node in the graph.
MIS Support
The maximum independent set support counts how many times a pattern maps to a subgraph that does not overlap to any other subgraphs. The computation of MIS is as follows
- Construct an overlap graph , in which the set of nodes is the set of matches of a pattern into a graph and E_O=\{(f_1,f_2)|f_1,f_2\in V_O\or f_1\neq f_2\neq f_1\sqcap f_2\neq\empty\}, i.e., has an edge among each pair of overlapping matches.
- Compute the as the size of the maximum independent set of nodes in the overlap graph.
Recall that the maximum finding the maximum independent set of NP-hard.
Harmful Overlap Support
The HO support is less restrictive than MIS support as it considers overlap only those matchings of a pattern that share one are more edges. Two matching subgraph are connected through a node are not considered overlapping.
NMI Support
The minimum image-based support is even less restrictive as it considers the minimum number of times a node of the pattern matches a node in the graph.
Definition of MNI. Let be the set of isomorphism of a pattern in a graph . Let be the number of distinct mappings of a node to a node in by functions . The MNI
Since MNI considers one node at the time, given the matching of pattern in the graph, NMI support can be computed in polynomial time.
Approaches for Large Graphs
With an anti-monotone support measure, we can apply any previous support measure such as gSPAN, just changing how the support is computed and leaving the rest unchanged. Such approaches are called grow-and-store
- Add all frequent edges to the list of frequent subgraphs
- Extend frequent subgraphs to larger candidates
- Compute candidate frequency (find all occurrences)
- Repeat step 2 until no more frequent subgraphs is found
One of the methods that exploits and optimizes gSpan is GraMi