数据挖掘 9 Graph Embedding
Main problem with graphs: lack of predefined node ordering.
Graph is non-Euclidean: the distance among two nodes does not depend on the coordinates of the nodes.
Origin idea of graph embedding: graph could become a set of multi-dimensional points in which Euclidean distances represent similarities among the nodes.
2 methods of embedding:
- Linear embedding: linear transformation of the edges.
- Random-walk embedding: preserving the probability that a node is reachable from another node.
Linear Embeddings
Simplest form of graph embeddings.
Let the embedding vector of the node .
The similarity between vector is . If two nodes are “close” under some definition of similarity, they should be also similar in the embedding space.
To define embedding vectors, we need
- An encoder from nodes to the embedding space.
- A similarity among nodes in the graph.
Then we need to define an objective or loss function, s.t. the embeddings are the parameters of optimization. Depending on the loss function, we can then use the appropriate optimizer to find the embeddings
Shallow encoding. We define the embedding matrix in which row is the embedding of node . Given we can encode each node in its corresponding embedding by a lookup on the embedding matrix. Let the indicator vector that has a in position and otherwise. Then
Node similarity. Intuitively check whether two nodes connected, are far form each other in shortest-path, are reachable from random walk, or simply share any common information.
Adjacency-based Similarity
The simplest: two nodes are similar if they share an edge between them.
implies the dot-product among the embedding vectors should approximate the entries of the adjacency matrix .
We then find embedding that minimize the distance among the entries and the dot-product among the embedding vectors :
which corresponds to the minimization of the norm between and the matrix .
In order to minimize the objective, we use gradient descent or matrix decomposition (such as SVD or QR-decomposition) on .
Pros and Cons
👍
- The similarity function is straightforward
- Many methods for solving the optimization problem.
👎
- running time to consider all pairs, if we sum only over the non-zero entries.
- parameters, one vector for each node.
- Only consider direct and local connections through the adjacency matrix.
Multi-hop Similarity
Connections at distance from a certain node to remedy to the strict requirements of the adjacency-based similarity.
The -th power of the adjacency matrix captures -hop connections. Indeed the position in represents the number of paths of lengths from node to node .
Similarly to the case above loss function minimizes the differences between the entry in the -th power of the adjacency matrix and the dot-product among the embedding vectors of the nodes .
Similarly, this corresponds to minimizing . In practice, to avoid numerical explosion, GraRep use a log-transformed probabilistic adjacency matrix
where is node degree, and is a constant shift. GraRep trains embeddings with different and concatenates the embedding vectors.
A more flexible option to capture multi-hop similarities relies on the overlap between node neighborhoods, using overlap functions such as Jaccard Similarity and Adamic-Adar score which is . The loss function becomes
One drawback with Linear embedding is, in their current form, they require to compute the pairwise similarities for all nodes, taking in the worst case. For large graphs, a quadratic complexity is not feasible. Moreover. we need to define the similarity beforehand.
Neural Embeddings
Alternative to the pairwise computation by exploiting the efficiency of probability distributions.
Random-walk based embeddings, non-linear similarity embeddings
Random Walk Embeddings
Let be the probability that node and node co-occur in a random walk over the graph. We can estimate the probability of visiting node from node using random-walk or methods like Personalized PageRank. If we sample a large enough number of random walks. the process converges to the PageRank.
Random walks are efficient as they do not require to traverse the entire graphs, they can be easily simulated and they converge to probability distributions.
Assume we have a strategy to sample random walks. We can define our embedding algorithm:
Run short random walk starting from each node using strategy .
For each node , collect , the multi-set of nodes visited on random walks starting from
Optimize the embeddings according to
The optimization is to find embeddings that maximize the likelihood of random walk co-occurrences. We can parametrize by using softmax
The above loss can be optimized using gradient descent. However, the nested sum takes , although the real bottleneck is the denominator of the probability, that sums over all nodes. One strategy to reduce the complexity is to employ a negative sampling approach.
Negative sampling treats the problem as a classification problem and splits the fraction into two parts.
where is a node sampled from a negative distribution over the nodes and is the sigmoid function. A typical choice of the negative distribution is to sample negative nodes proportional to the degree of each node. The more we sample, the more robust is the approximation. A higher corresponds to higher prior on negative events.
Pros and Cons
👍
- Expressive and non-linear
- Scale up with sampling techniques
👎
- Similarity is fixed to random walks
General Similarities
Another approach to graph embedding is to generalize the similarity-based approaches to any samplable similarity. An appealing choice for similarity is PPR. For a given restart vector for node the PPR is
from which we obtain the PPR-matrix
where each row represents the PPR vector .
One choice to embed is to use SVD; however this embedding method is again linear. We instead consider each row separately as probability distributions to reach any node from node . This is the approach of VERSE. VERSE computes the embedding matrix to preserve the distribution of the similarities node by node.
VERSE optimize the KL-divergence between the distribution of node and node , where the KL-divergence between distribution and is
Typically, with KL-divergence the distribution is fixed and given as input, while the distribution is the one to learn. If that is the case, in order to learn , notice that
Entropy of minus Cross entropy
The first term, the entropy of is constant if we minimize for . So it suffices to reduce objective to .
For each node we can parametrize as the similarity one-to-all in embedding space given by softmax of the dot-product of the embeddings
Notice the numerator multiplies for the entire embedding matrix . The final loss becomes
can benefit from a strategy similar to negative sampling. Another alternative is to use Noise Contrastive estimation and change the loss function into.