A labeled graph is a triple G=(V,E,)G=(V,E,\ell) where :VEL\ell:V\cup E\rarr L is a labeling function assigning a label from a set of labels LL to each node and each edge.

Definition of frequency: it is encoded into a support measure that return the number of occurrences of the subgraph in the graph.

FSM in Graph Collections

A graph collection aka a graph database is a set D={G1,G2,,Gn}\mathcal D=\{G_1,G_2,\cdots,G_n\} of labeled graphs Gi=(Vi,Vi,i),i[1,n]G_i=(V_i,V_i,\ell_i),i\in[1,n].

Since a subgraph is uniquely identified by the nodes and edges of a certain graph, we need to relax the definition, so that it counts the occurrences of that subgraph in different graphs.

A graph G1(V1,E1,1)G_1(V_1,E_1,\ell_1) is isomorphic 同构 to a graph G2=(V2,E2,2)G_2=(V_2,E_2,\ell_2) if there exists a bijective function f:V1V2f:V_1\rarr V_2 such that if uV1u\in V_1 then f(u)V2f(u)\in V_2 and 1(u)=2(f(u))\ell_1(u)=\ell_2(f(u)) and if (u,v)E1(u,v)\in E_1 then (f(u),f(v))E2(f(u),f(v))\in E_2 and 1((u,v))=2((f(u),f(v)))\ell_1((u,v))=\ell_2((f(u),f(v))).

If a graph G1G_1 is isomorphic to a subgraph of another graph GG, we say G1G_1 is subgraph isomorphic to GG. Subgraph isomorphism is NP-complete, which graph isomorphism is NP.

Denote: GG1G\sqsubseteq G_1 a subgraph isomorphism from GG to G1G_1.

Given a collection of graphs D={G1,G2,,Gn}\mathcal D=\{G_1,G_2,\cdots,G_n\} and a graph GG, the support set DG\mathcal D_G is the subset of D\mathcal D, s.t. GG is subgraph isomorphic to every graph in GDGG'\in\mathcal D_G, i.e., DG={GDGG}\mathcal D_G=\{G'\in\mathcal D|G\sqsubseteq G'\}. The support sup(G)\sup(G) of GG is the proportion σ(G)=DGD\sigma(G)=\frac{|\mathcal D_G|}{|\mathcal D|}.

The Frequent Subgraph Mining FSM problem in collection of graphs D={G1,G2,,Gn}\mathcal D=\{G_1,G_2,\cdots,G_n\} finds all the subgraphs GG having at least support minsup\min\sup, i.e., σ(G)minsup\sigma(G)\ge\min\sup.

Downward Closure Property

  • Downward closure is the property under which a support measure decreases by expanding a pattern. In particular in our case sup(G)sup(G)\sup(G)\le\sup(G') if GGG'\sqsubseteq G.

The downward closure property leads to 2 different family of approaches: apriori-based approaches and pattern-growth approaches.

Apriori-based Approaches

This approaches stem from the observation that any subgraph of a frequent subgraph must be frequent as well.

FSG is one of the first methods that utilizes the apriori principle. The algorithm builds on the idea that a frequent kk-subgraph is built on top of frequent (k1)(k-1)-subgraphs, where kk indicates the number of edges.

The algorithm simply starts enumerating frequent 11-subgraphs and 22-subgraphs and store them into the set F1,F2F_1,F_2 respectively.

Similarly, FkF_k is the set of frequent kk-subgraphs. FSG then builds larger kk-subgraphs by merging the subgraphs obtained in the previous iteration.

Pseudo code

  1. F1F_1\gets frequent 11-subgraphs with counts
  2. F2F_2\gets frequent 22-subgraphs with counts
  3. k3k\gets3
  4. while Fk1F_{k-1}\neq\empty do
    • candidate_generation: get CkC_k, the set of candidate kk-subgraphs from Fk1F_{k-1}
    • candidate_pruning: for a candidate to be frequent, each (k1)(k-1)-subgraph should be frequent
    • frequency_pruning: scan D\mathcal D, count occurrences of subgraphs in CkC_k
    • Fk{cCkσ(c)minsup}F_k\gets\{c\in C_k|\sigma(c)\ge\min\sup\}
    • kk+1k\gets k+1
  5. return i=1kFi\cup_{i=1}^kF_i

each of 3 steps run isomorphism checks to understand whether some candidate is isomorphic to any of the other candidates. Moreover, in order to compute the support of a subgraph the algorithm runs subgraph isomorphism on all the candidate graphs in CkC_k. FSG employs heuristics to reduce the number of isomorphism tests at each iteration.

Candidate Generation

Generate candidates by merging the frequent (k1)(k-1)-subgraphs. The main idea is to first detect cores to generate the candidates easily. A core is the maximum common subgraph among two subgraphs. Although generating the core requires to run subgraph isomorphism, in practice

  1. one core is common to many subgraphs.
  2. the subgraphs are typically small.

Candidate pruning

In order to prune the candidates we can check if every subgraph of a candidate is also frequent. In theory, this requires to perform isomorphism tests for each of the candidate’s subgraph.

In order to hash a graph, we require a signature, i.e., a numeric representation, that is unique for each graph. This is possible through canonical labeling. Canonical labeling, finds a unique representation of a graph in a smart manner. In particular, FSG empolys a canonical scheme that reorders the column and the rows of the adjacency matrix. The string obtained by reading the sequence of node and edge labels in the column of the matrix that is smallest lexicographic string among all of those obtained by reordering is the canonical code and is unique, i.e., code(G)=mincode(M)Mcode(G)=\min code(M)|M is adjacency matrix.

It seems that this method effectively solves graph isomorphism in O(1)\mathcal O(1). However, this is not the case as the canonical labeling is obtained by a, potentially exponential, number of reordering of the adjacency matrix. By reordering, we are computing automorphisms over the graph. In practice, not all the permutations lead to a valid matrix and can be pruned by simple heuristics.

Frequency Counting

Frequency counting requires to scan the data to find which graphs contain a subgraph. In frequency counting, most of the subgraph isomorphism tests can be avoided with a simple shortcut.

Let TID(G),TID(G)TID(G'),TID(G'') be the set of graphs in the collection D\mathcal D that contain the subgraph GG' and GG'', respectively. If we merge the two patterns, it follows immediately that TID(merge(G,G))TID(G)TID(G)TID(merge(G',G''))\subseteq TID(G')\cap TID(G''). As such all the graphs in (TID(G)TID(G))(TID(G)TID(G))(TID(G')\cup TID(G''))\diagdown(TID(G')\cap TID(G'')) can be ignored as they will not contribute to the support.

👍

  • Finds all frequent subgraphs.
  • Easy to include heuristics to reduce the computation time.

👎

  • FSG generates the same subgraph a large number of times.
  • The canonical labeling scheme is quadratic in the size of the subgraph.
  • FSG requires several isomorphism and subgraph isomorphism tests.

Pattern-growth Approaches

It follows a different philosophy.

Start first with empty subgraphs and progressively expand these subgraphs by adding one edge at the time. The pattern expands always in a certain order to redundancy as much as possible.

gSpan is the most famous algorithm in the pattern-growth family.

gSpan organizes the patterns in a prefix-tree, a data structure for alphabetically ordered string that store string prefixes as nodes of a tree. However, graphs do not have a predefined order of traversal, or a unique way of enumerating the nodes.

To create a prefix-tree of non-repeating patterns gSpan:

  1. Defines a traversal over the nodes in the graphs
  2. Identifies in advance which nodes of a subgraph to expand
  3. Defines a total ordering over the subgraphs, so as to detect duplicate patterns.

Node Traversal

gSpan traverses the subgraphs in a DFS manner, that means that the next node vjv_j to visit after viv_i is the first in the list of neighbors N(i)N(i). A DFS vist defines a spanning-tree over the nodes in the subgraph. The edges in the spanning tree are called forward edges; the remaining edges are called backward edges.

In the beginning of the DFS a counter assigns a unique identifier to each node. The counter represents the discovery time of the node.

Rightmost Expansion

Since a DFS defines a specific traversal and the order of the visit is market by a counter, there is only one way to add an edge to the subgraph, namely expanding from the rightmost path. The rightmost path is the path from the root of the spanning tree to the leaf with the highest identifier.

Total Ordering

The main innovation of gSpan is to treat canonical labeling as a DFS-visit and thus ensure a reduced number of operations and redundant subgraphs.

DFS Codes

Representation of a graph that is comparable among other DFS codes. Such a comparison as a total ordering for which a minimum exists. The minimum DFS-code is a canonical labeling of a graph.

Definition of DFS-edge. It’s a tuple (i,j,(i),((i,j)),(j))(i,j,\ell(i),\ell((i,j)),\ell(j)) where i,ji,j are identifiers given by the DFS traversal order, (i),((i,j)),(j)\ell(i),\ell((i,j)),\ell(j) are the labels of node ii, edge (i,j)(i,j) and node jj.

DFS-code for a graph G=(V,E,)G=(V,E,\ell) is a sequence e1,,et\langle e_1,\cdots,e_t\rangle where each eie_i is a DFS-edge and t=Et=|E|.

Valid DFS-codes. Precedence order \prec on DFS-edges corresponding to the DFS traversal. Given 2 DFS-edges e1e_1 connecting nodes with discovery time (i1,j1)(i_1,j_1) and e2e_2 connecting nodes with discovery time (i2,j2)(i_2,j_2). A DFS-code is valid if it complies to:

  1. i1=i2i_1=i_2 and j1<j2e1e2j_1\lt j_2\Rarr e_1\prec e_2 backward edges ordering
  2. i1<j1i_1\lt j_1 and j1=i2e1e2j_1=i_2\Rarr e_1\prec e_2 forward edges ordering
  3. e1e2e_1\prec e_2 and e2e3e1e3e_2\prec e_3\Rarr e_1\prec e_3 transitive property

Total ordering and minimum codes. A graph can have multiple valid DFS-codes depending on the DFS traversal order. gSpan further defines a total ordering over the DFS-codes so to define a minimum DFS-code for a graph.

The minimum DFS-code is unique and a canonical labeling of the graph. This implies that 2 graphs are isomorphic iff they have the same DFS-code.

Let α,β\alpha,\beta two DFS-codes. We say αβ\alpha\prec\beta iff either of the following conditions are true:

  • t,0tmin(n,m)\exist t,0\le t\le\min(n,m) s.t. ak=bka_k=b_k for k<tk\lt t and atebta_t\prec_e b_t
  • ak=bka_k=b_k for 0km0\le k\le m and nmn\ge m

Simply speaking, αβ\alpha\prec\beta if either 1. α\alpha and β\beta are equal up to some point tt and the next corresponding DFS-edges is s.t. atebta_t\prec_e b_t, or 2. α\alpha and β\beta are the same in terms of DFS-edges but β\beta contains more edges.

We need to define e\prec_e: Let at=(ia,ja,Lia,Lia,ja,Lja),bt=(ib,jb,Lib,Lib,jb,Ljb)a_t=(i_a,j_a,L_{i_a},L_{i_a,j_a},L_{j_a}),b_t=(i_b,j_b,L_{i_b},L_{i_b,j_b},L_{j_b}) be two DFS-edges. We say atebta_t\prec_e b_t if

  1. Both are forward edges
    • ib<iai_b\lt i_a OR
    • ib=iai_b=i_a and the labels of aa are lexicographical less than labels of bb, in order of tuple.
  2. Both are backward edges
    • jajbj_a\le j_b OR
    • ja=jbj_a=j_b and the edge label of aa is lexicographical less than the one of bb.
  3. ata_t is backward and btb_t is forward.

The above enforce a total order on DFS-codes and, as such, the min-DFS(GG) of a graph GG is a canonical labeling of GG.

DFS-code tree. It allows also to organize data into DFS-tree that is prefix tree. Because the codes have a precedence order. If min-DFS(G1G_1)={a0,a1,,an}=\{a_0,a_1,\cdots,a_n\} and min-DFS(G2G_2)={a0,a1,,an,b}=\{a_0,a_1,\cdots,a_n,b\}, then G1G_1 is parent of G2G_2, since min-DFS(G1G_1) is a prefix of min-DFS(G2G_2).

Given a parent-child relationship defined, we can construct a DFS Tree. An interesting property of the DFS-tree is that an in-order traveersal follows DFS lexicographic order.

gSpan algorithm

Input: Collection of graphs D\mathcal D, minimum support minsup\min\sup

Output: Frequent subgraph set SS

  1. Let SS be frequent one-edge subgraphs in D\mathcal D using DFS code.
  2. Sort SS in lexicographic order
  3. NSN\gets S
  4. for each nNn\in N do
    • GSpanExtend(D,n,minsup,S)(\mathcal D,n,\min\sup,S)
  5. return SS

GSpanExtend(D,n,minsup,S)(\mathcal D,n,\min\sup,S) // checking minimality of DFS-code is computationally expensive

  1. if nn not minimal then return

  2. else

    • SSnS\gets S\cup n

    • for each Rightmost expansion ee of nn do

      • if δ(e)minsup\delta(e)\ge\min\sup then

        GSpanExtend(D,n,minsup,S)(\mathcal D,n,\min\sup,S)

👍

  • Finds all frequent subgraphs.
  • Generates less redundant subgraphs.

👎

  • Works only in collections of graphs.
  • Generally slow for low minsup\min\sup.

Frequent Subgraph Mining on a Single Graph

If the support is defined a the number of time a specific subgraph appears, the downward property does not hold anymore as bigger subgraphs can have a larger support.

We defined a support measure to be admissible if for any graph PP and any subgraph QPQ\sqsubseteq P, the support of PP is not larger than the support of QQ. In the last few years, a number of admissible support measures have appeared. These include,

  • Maximum independent set support (MIS): Based on maximum number of non-overlapping matches
  • Harmful overlap support (HO): Based on number of patterns for which no multi-node subgraph is identical
  • Minimum image-based support (MNI): Based on the number of times a node in the pattern is mapped into a distinct node in the graph.

MIS Support

The maximum independent set support σMIS\sigma_{MIS} counts how many times a pattern maps to a subgraph that does not overlap to any other subgraphs. The computation of MIS is as follows

  1. Construct an overlap graph GO=(VO,EO)G_O=(V_O,E_O), in which the set of nodes VOV_O is the set of matches of a pattern PP into a graph GG and E_O=\{(f_1,f_2)|f_1,f_2\in V_O\or f_1\neq f_2\neq f_1\sqcap f_2\neq\empty\}, i.e., EOE_O has an edge among each pair of overlapping matches.
  2. Compute the σMIS\sigma_{MIS} as the size of the maximum independent set of nodes in the overlap graph.

Recall that the maximum finding the maximum independent set of NP-hard.

Harmful Overlap Support

The HO support is less restrictive than MIS support as it considers overlap only those matchings of a pattern that share one are more edges. Two matching subgraph are connected through a node are not considered overlapping.

NMI Support

The minimum image-based support is even less restrictive as it considers the minimum number of times a node of the pattern matches a node in the graph.

Definition of MNI. Let f1,,fmf_1,\cdots,f_m be the set of isomorphism of a pattern P:VP,EP,PP:\langle V_P,E_P,\ell_P\rangle in a graph GG. Let F(v)={f1(v),,fm(v)}F(v)=|\{f_1(v),\cdots,f_m(v)\}| be the number of distinct mappings of a node vVPv\in V_P to a node in GG by functions f1,,fmf_1,\cdots,f_m. The MNI

σMNI(P) ofP in G is σMNI(P)=min{F(v),vVP}\sigma_{MNI}(P)\ of P\ in\ G\ is\ \sigma_{MNI}(P)=\min\{F(v),v\in V_P\}

Since MNI considers one node at the time, given the matching of pattern in the graph, NMI support can be computed in polynomial time.

Approaches for Large Graphs

With an anti-monotone support measure, we can apply any previous support measure such as gSPAN, just changing how the support is computed and leaving the rest unchanged. Such approaches are called grow-and-store

  • Add all frequent edges to the list of frequent subgraphs
  • Extend frequent subgraphs to larger candidates
  • Compute candidate frequency (find all occurrences)
  • Repeat step 2 until no more frequent subgraphs is found

One of the methods that exploits and optimizes gSpan is GraMi