数据挖掘 10 Graph Neural Networks
GNN - graph neural networks
Intuition: how to propagate information over the neighbors in a non-linear manner.
Graph Embeddings as Matrix Factorization
Linear embeddings correspond to SVD of the similarity matrices chosen in the objective.
For adjacency-based similarity, the best linear -dimensional embedding corresponds to the projection of the dataset into the directions associated to the top- singular values:
Question: Does there exist a matrix that, once factorized, would be approximately equivalent to run gradient descent over the non-linear objective?
NetMF
DeepWalk, one of the most popular random-walk embedding methods.
A random-walk approach samples random walks using a certain strategy .
A random-walk embedding method optimizes the likelihood that a node is reachable from a random-walk starting from node .
DeepWalk’s strategy is to generate random-walks of length and then generate pairs of co-occuring nodes in a window of size . The co-occurences are stored in a multiset .
Pseudo-code:
- for each do
- pick according to probability distribution
- generate vertex sequence of length by random walk on network
- for each do
- for each do
- add vertex-context pair to multi-set
- add vertex-context pair to multi-set
- for each do
The first vertex of the random-walk is sampled from a prior distribution that in undirected, connected bipartite graphs correspond to .
This process and the relative transformation through softmax corresponds to a similarity matrix. We can represent the similarity among two nodes as
\log\left(\frac{\text{(# of RW between i and j)}\cdot\text{(# of node/context pairs)}}{k\text{(# of occurrences of i)}\cdot\text{(# of occurrences of j)}}\right)\\ =\log\left(\frac{\#(i,j)|\mathcal D|}{k\#(i)\#(j)}\right)
where is the number of negative samples.
Understand this formula:
First partition the set of multi-set of pairs of nodes co-occurring in walks into two set and
Then observe that
where the latter observation comes from .
The observe what happens as the length of random walk . We first define the random walk matrix . The volume of a graph is the sum of the entries of adjacency matrix . There are 3 important Lemmas:
\frac{\#(i,j)_{r\rarr}}{|\mathcal D|_{r\rarr}}\stackrel{p}\rarr\frac{d_i}{\text{vol(G)}}(\mathbf P^r)_{i,j}\and \frac{\#(i,j)_{r\larr}}{|\mathcal D|_{r\larr}}\stackrel{p}\rarr\frac{d_i}{\text{vol(G)}}(\mathbf P^r)_{j,i}\\ \frac{\#(i,j)}{|\mathcal D|}\stackrel{p}\rarr\frac{1}{2T}\sum_{r=1}^T\left(\frac{d_i}{\text{vol(G)}}(\mathbf P^r)_{i,j}+\frac{d_i}{\text{vol(G)}}(\mathbf P^r)_{j,i}\right)\\ \frac{\#(i,j)|D|}{\#(w),\#(c)}\stackrel{p}\rarr\frac{\text{vol}(G)}{2T}\left(\frac{1}{d_j}\sum_{r=1}^T(\mathbf P^r)_{i,j}+\frac{1}{d_i}\sum_{r=1}^T(\mathbf P^r)_{j,i}\right)
Putting all together we obtain DeepWalk with infinite length random walks converges to
Notice that the internal sum reminds close to a random walk iteration as the power of matrix encodes the probability of reaching nodes in steps through random walking the graph. By factorizing such a similarity with SVD we can obtain the desired result that approximates the embeddings of a random-walk approach.
Graph Neural Network
abbr GNN
An architecture that allows nodes to propagate information to the neighbors in a non-linear manner. We have an attributed graph with matrix of node features . (such a matrix encodes nodes attributes such as labels, profiles or characteristics)
The main ingredient of a GNN is the aggregation function that takes in input the features from the neighbors of a node and finds an embedding .
Neighborhood Aggregation
Main idea: Generate node embeddings based on local neighborhoods of each node, using neural networks to represent the complex relationships of a graph.
Intuitively, think of the neighbors as a computational graph of a Neural Network.
The embeddings in the layer is an aggregation of the embeddings in layer .
How to aggregate information across the layers? A simple approach: average neighbor information and apply a neural network.
The embeddings of the initial layer are the attributes . The -th layer embedding of is
is non-linear activation function. The sum is the average of each neighbor’s embeddings compute in layer .
Loss function. are the trainable weight matrices. After -layers of neighborhood aggregation, we get output embeddings for each node. Loss function needs non-linear and differentiable.
We optimize the parameters of model through stochastic gradient descent SGD.
Rewrite model in matrix notation
where . To optimize the parameters, a cross-entropy loss function would work in a supervised setting w.r.t. node’s labels.
An important properties of models: ability to share the parameters and allow for generalization to unseen nodes, edges or graphs.
Depend on loss function and embedding matrix, the GNN can learn node embeddings, graph embeddings or clusters.
Graph Convolutional Networks
abbr. GCN
A different neighbor aggragation
The main idea is node with a large number of connections should be less important. There is no need for the bias term and the per-neighbor normalization.
Normalization empirically attains better results and better parameter sharing. We can efficiently compute the embeddings using sparse batch operations
where and is the diagonal degree matrix. The time complexity to propagate the information is .
GraphSAGE
Instead of fixed aggregation, GraphSAGE use any differentiable function that maps set of vectors to a single vector. So, if we use a generalized aggregation, we get
Concentrate self embedding of neighbor embedding. Also we use a generalized aggregation.
Some variants:
Mean. Take weighted average of neighbors:
Pool. Transform neighbor vectors and apply symmetric vector function:
where we take element-wise mean/max
LSTM. Apply LSTM to reshuffled neighbors