数据挖掘 7 Community Detection
Community detection finds sets of nodes, i.e., that are highly connected inside the community and coarsely connected outside the community.
Community: a group of individuals sharing common interests.
A community is a subset of the nodes . Non-overlapping communities or partitions are pairwise disjoint set s.t. for each . Overlapping communities are not necessarily pairwise disjoint.
Non-overlapping Community Detection
Disjoint aka partitions.
Two common algorithms for non-overlapping community detection are Spectral clustering and Modularity optimization.
Modularity Optimization
Modularity is a measure of cohesiveness 凝聚力 among communities.
Modularity compares the density inside communities with a null model of density. The null model assumes that nodes can form random connections with any other node.
[number of edges within community - expected number of edges within community ]
The main question is how do we quantify the edges within a community and what is a good null model.
Edge within community. In order to find how many edges belong to the same community we simply sum the edges in each community and normalize by the total number of edges.
We consider all pairs of nodes, we need to normalize by twice number of edges as we count both and .
Modularity’s null model. Given a graph with vertices and edges, construct a rewired graph . The rewired graph has the same degree distribution but the connections among nodes appear at random. By ensuring the same degree, we assert that we do not compare the graph with a completely random one.
Node with large degrees represents celebrities. Altering their degrees might end up transforming a star into a ordinary person.
If nodes are allowed to form any connection with any other node as long as the degree is preserved, the probability that given one node the other one is
where are the degrees of and respectively. The expected number of edges between nodes and is . As such, the expected number of edges in is
Modularity. Putting all together, the Modularity of a partition of is
This value is taken in . A modularity close to indicates a good partition, while a negative value indicates a partition in which connected nodes are separated. A strong community structure is usually considered between and ,
The direct modularity optimization is NP-hard. The Louvain algorithm propose a greedy algorithm that runs in . The Louvain method works in a bottom-up fashion by progressively merging the two communities that bring the largest increase in modularity.
Spectral Modularity
A spectral method for modularity maximization.
First, with a bi-partition of the graph i.e. dividing the graph in 2 communities . Let be a real vector that indicates whether a node belongs to one or the other community. An entry is
We can rewrite equation for as
where if and otherwise.
Notice that . Thus
where with entries is referred to as modularity matrix.
The objective is very close to spectral clustering. However, in this case we try to maximize the quantity. Since is symmetric, the variational characterization of the eigenvalues tells that the solution of the optimization is the eigenvector that corresponds to the largest of the eigenvalues of the modularity matrix.
In order to compute the partition , we relax the vector to take any real value. Our object (omitting ) becomes
The last equality is an eigenvalue - eigenvalue equation, where is an eigenvalue for the eigenvector . If we substitute , we will obtain
In order to maximize the objective we need to select the largest eigenvalue and the corresponding eigenvector.
Finally, the partition of the graph is obtained by
Detecting multiple communities. Similar to spectral clustering, we would be tempted to generalize the spectral modularity to the case of multiple communities by recursive bi-partitioning. This method, however, assumes that the graph is cut into two parts and that the edges among the two partitions are removed. Unfortunately, removing edges changes also the degree of the nodes, and in turns the null model that forms the matrix , ending up maximizing the wrong quantity. As a remedy, we consider the change in modularity after splitting a community into two parts
where if and otherwise, is called the Kronecker delta and .
The above is generalized spectral method, in which we can repeatedly find eigenvectors in the modified matrix . It also provides a convenient stopping criteria: Once the increment in modularity is negative, there is no other community that will increase the modularity further and the method can stop.
Pros and Cons
👍
- Clusters can have arbitrary shape and size, i.e., clusters are not restricted to have convex shapes.
- Efficient in common real graphs.
- Robust to noise.
👎
- Inefficient on dense graphs with in the worse case.
- Resolution limit: When merging communities if the number of edges is larges.
Overlapping Community Detection
Nodes can be shared among communities. Harder than Non-overlapping.
Clique Percolation
Clique percolation detects overlapping communities by exploiting the definition of clique.
A clique is a fully connected subgraph. In other words, a clique is a set of nodes that have an edge with any other node in the clique.
A -clique is a complete subgraph on nodes.
A -clique is adjacent to another clique if it shares vertices. In clique percolation we define a community the union of adjacent -cliques. Given such a definition it’s easy to formulate our first algorithm CFinder.
The CFinder algorithm starts with a -clique defining a community, where is a user parameter, and expands the community until there is no more adjacent cliques to add.
Pseudo Code
Input: Graph , Size of cliques
- Start with a -clique
- Roll the clique over adjacent cliques
- A -cliques community is the largest subgraph obtained by the union of all adjacent -cliques
- Other -cliques that can not be reached correspond to other clique-communities
Pros and Cons
👍
- Easy to implement and understand
- Finding -cliques is polynomial
- Communities are easily explainable
👎
- Requires size of cliques:
- Rigid communities: the results depend on the existence of the cliques.
AGM/BigCLAM
This method instead take a probabilistic approach.
AGM and BigCLAM are two different models but with very similar premises.
Intuition: graph is a sample of an unknown distribution in which we flip a coin on the existence of an edge.
The existence probability is determined by the community structure: the more communities two nodes share, the higher the probability.
Given a graph with communities each of them existing with probability , a membership set . The probability of existence of an edge in this model is
The above probability ensures that an edge exists if the nodes share at least one community. The above is an instance of a noisy-OR. If the nodes share no community, we assign a fixed small probability .
Our model assumes that each edge is generated independently by any other edge. As such, given the parameters , the likelihood that our graph is sampled by the AGM model is
AGM is a rather flexible model and can express a variety of community structures, such as non-overlapping and hierarchical communities.
The model assumes that we know already the parameters in order to compute the likelihood of a network.
By maximizing this equation, we can find the parameters of the model that maximize the likelihood. Bit how to find this model? Finding means finding a community membership for each node. The task is hard.
From AGM to BigCLAM. AGM model is hard to optimize. BigCLAM drops some of the requirements of AGM to attain efficiency.
Under BigCLAM each node belongs to some community with some strength . If , node is not a member of community . The probability that an edge between and exists if they belong to community is
This means that the probability that an edge exists is proportional to the product of the strengths. If one node does not belong to the community () then its probability is . The probability that at least one common community links two nodes is
where is the row vector of the matrix .
To compute to fit the model to the data, we use a maximum likelihood approach
Instead of maximizing the likelihood, we maximize the log-likelihood
Since matrix is real, the model can be optimized with coordinate ascent over the rows.
Efficiency optimization. The gradient can be optimized by caching and compute an iteration in linear time.
The factor can be written as
which runs in linear time as both can be cashed in advance.
Pros and Cons
👍
- Detect overlapping and non-overlapping communities
- Cluster can have arbitrary shapes
- Nearly linear Complexity
- Flexible model.
👎
- Requires the number of communities upfront
- Hard to find the threshold for determining the membership of each nodes.