GNN - graph neural networks

Intuition: how to propagate information over the neighbors in a non-linear manner.

Graph Embeddings as Matrix Factorization

Linear embeddings correspond to SVD of the similarity matrices chosen in the objective.

For adjacency-based similarity, the best linear dd-dimensional embedding corresponds to the projection of the dataset into the dd directions associated to the top-dd singular values:

if SVD(A)=UΣVthen the embedding matrix Z=UdΣd\text{if }SVD(\mathbf A)=\mathbf U\Sigma\mathbf V^\top\\ \text{then the embedding matrix }\mathbf Z=\mathbf U_d\Sigma_d

Question: Does there exist a matrix that, once factorized, would be approximately equivalent to run gradient descent over the non-linear objective?

NetMF

DeepWalk, one of the most popular random-walk embedding methods.

A random-walk approach samples random walks using a certain strategy RR.

A random-walk embedding method optimizes the likelihood that a node jj is reachable from a random-walk starting from node ii.

DeepWalk’s strategy is to generate random-walks of length LL and then generate pairs of co-occuring nodes in a window of size TT. The co-occurences are stored in a multiset D\mathcal D.

Pseudo-code:

  • for each n=1,2,,Nn=1,2,\cdots,N do
    • pick w1nw_1^n according to probability distribution p(w1)p(w_1)
    • generate vertex sequence (w1n,,wLn)(w_1^n,\cdots,w_L^n) of length LL by random walk on network GG
    • for each j=1,2,,LTj=1,2,\cdots,L-T do
      • for each r=1,,Tr=1,\cdots,T do
        • add vertex-context pair (wjn,wj+rn)(w_j^n,w_{j+r}^n) to multi-set D\mathcal D
        • add vertex-context pair (wj+rn,wjn)(w_{j+r}^n,w_{j}^n) to multi-set D\mathcal D

The first vertex of the random-walk w1nw_1^n is sampled from a prior distribution p(w1)p(w_1) that in undirected, connected bipartite graphs correspond to p(i)=di/2Ep(i)=d_i/2|E|.

This process and the relative transformation through softmax corresponds to a similarity matrix. We can represent the similarity among two nodes i,ji,j as

\log\left(\frac{\text{(# of RW between i and j)}\cdot\text{(# of node/context pairs)}}{k\text{(# of occurrences of i)}\cdot\text{(# of occurrences of j)}}\right)\\ =\log\left(\frac{\#(i,j)|\mathcal D|}{k\#(i)\#(j)}\right)

where kk is the number of negative samples.

Understand this formula:

  • First partition the set of multi-set D\mathcal D of pairs of nodes co-occurring in walks into two set Dr\mathcal D_{r\rarr} and Dr\mathcal D_{r\larr}

    Dr={(w,c):(w,c)D,w=wjn,c=wj+rn}Dr={(w,c):(w,c)D,c=wj+rn,w=wjn}\mathcal D_{r\rarr}=\{(w,c):(w,c)\in\mathcal D,w=w_j^n,c=w_{j+r}^n\}\\ \mathcal D_{r\larr}=\{(w,c):(w,c)\in\mathcal D,c=w_{j+r}^n,w=w_j^n\}

  • Then observe that

    log(#(i,j)Dk#(i)#(j))=log(#(i,j)÷Dk(#(i)÷D)(#(j)÷D))#(i,j)÷D=12Tr=1T(#(i,j)rDr+#(i,j)rDr)\log\left(\frac{\#(i,j)|\mathcal D|}{k\#(i)\#(j)}\right)=\log\left(\frac{\#(i,j)\div|\mathcal D|}{k(\#(i)\div|\mathcal D|)(\#(j)\div|\mathcal D|)}\right)\\ \#(i,j)\div|\mathcal D|=\frac{1}{2T}\sum_{r=1}^T\left(\frac{\#(i,j)_{r\rarr}}{|\mathcal D|_{r\rarr}}+\frac{\#(i,j)_{r\larr}}{|\mathcal D|_{r\larr}}\right)

    where the latter observation comes from Dr/D=Dr/D=12T|\mathcal D_{r\rarr}|/|\mathcal D|=|\mathcal D_{r\larr}|/|\mathcal D|=\frac{1}{2T}.

  • The observe what happens as the length of random walk LL\rarr\infty%. We first define the random walk matrix P=Δ1A\mathbf P=\mathbf\Delta^{-1}\mathbf A. The volume of a graph GG is the sum of the entries of adjacency matrix vol(G)=ijai,j\text{vol}(G)=\sum_i\sum_ja_{i,j}. There are 3 important Lemmas:

    \frac{\#(i,j)_{r\rarr}}{|\mathcal D|_{r\rarr}}\stackrel{p}\rarr\frac{d_i}{\text{vol(G)}}(\mathbf P^r)_{i,j}\and \frac{\#(i,j)_{r\larr}}{|\mathcal D|_{r\larr}}\stackrel{p}\rarr\frac{d_i}{\text{vol(G)}}(\mathbf P^r)_{j,i}\\ \frac{\#(i,j)}{|\mathcal D|}\stackrel{p}\rarr\frac{1}{2T}\sum_{r=1}^T\left(\frac{d_i}{\text{vol(G)}}(\mathbf P^r)_{i,j}+\frac{d_i}{\text{vol(G)}}(\mathbf P^r)_{j,i}\right)\\ \frac{\#(i,j)|D|}{\#(w),\#(c)}\stackrel{p}\rarr\frac{\text{vol}(G)}{2T}\left(\frac{1}{d_j}\sum_{r=1}^T(\mathbf P^r)_{i,j}+\frac{1}{d_i}\sum_{r=1}^T(\mathbf P^r)_{j,i}\right)

  • Putting all together we obtain DeepWalk with infinite length random walks converges to

    log(vol(G)k(1Tr=1T(Δ1A)r)Δ1)\log\left(\frac{\text{vol}(G)}{k}(\frac{1}{T}\sum_{r=1}^T(\mathbf\Delta^{-1}\mathbf A)^r)\mathbf\Delta^{-1}\right)

    Notice that the internal sum reminds close to a random walk iteration as the power of matrix Δ1A\mathbf \Delta^{-1}\mathbf A encodes the probability of reaching nodes in rr steps through random walking the graph. By factorizing such a similarity with SVD we can obtain the desired result that approximates the embeddings of a random-walk approach.

Graph Neural Network

abbr GNN

An architecture that allows nodes to propagate information to the neighbors in a non-linear manner. We have an attributed graph G=(V,E,X)G=(V,E,\mathbf X) with matrix of node features XRm×n\mathbf X\in\mathbb R^{m\times n}. (such a matrix encodes nodes attributes such as labels, profiles or characteristics)

The main ingredient of a GNN is the aggregation function that takes in input the features from the neighbors of a node ii and finds an embedding hi\mathbf h_i.

Neighborhood Aggregation

Main idea: Generate node embeddings based on local neighborhoods of each node, using neural networks to represent the complex relationships of a graph.

Intuitively, think of the neighbors as a computational graph of a Neural Network.

The embeddings in the layer ll is an aggregation of the embeddings in layer l1l-1.

How to aggregate information across the layers? A simple approach: average neighbor information and apply a neural network.

The embeddings of the initial layer 00 are the attributes zi0=zi\mathbf z_i^0=\mathbf z_i. The ll-th layer embedding of ii is

zil=σ(WljN(i)zjl1N(i)+Blzil1),t>0\mathbf z_i^l=\sigma(\mathbf W_l\sum_{j\in N(i)}\frac{\mathbf z_j^{l-1}}{|N(i)|}+\mathbf B_l\mathbf z_i^{l-1}),\forall t\gt0

σ\sigma is non-linear activation function. The sum is the average of each neighbor’s embeddings compute in layer l1l-1.

Loss function. W,B\mathbf {W,B} are the trainable weight matrices. After kk-layers of neighborhood aggregation, we get output embeddings for each node. Loss function needs non-linear and differentiable.

We optimize the parameters of model through stochastic gradient descent SGD.

Rewrite model in matrix notation

Zl+1=σ(ZlWl+A~ZlW1l)\mathbf Z^{l+1}=\sigma(\mathbf Z^l\mathbf W^l+\tilde{\mathbf A}\mathbf Z^l\mathbf W^l_1)

where A~=Δ1/2AΔ1/2\tilde{\mathbf A}=\mathbf\Delta^{-1/2}\mathbf {A\Delta}^{-1/2}. To optimize the parameters, a cross-entropy loss function would work in a supervised setting w.r.t. node’s labels.

An important properties of models: ability to share the parameters W,B\mathbf {W,B} and allow for generalization to unseen nodes, edges or graphs.

Depend on loss function and embedding matrix, the GNN can learn node embeddings, graph embeddings or clusters.

Graph Convolutional Networks

abbr. GCN

A different neighbor aggragation

zil=σ(WljN(i)izjl1N(i)N(j)).\mathbf z_i^l=\sigma\left(\mathbf W_l\sum_{j\in N(i)\cup i}\frac{\mathbf z_j^{l-1}}{\sqrt{|N(i)||N(j)|}}\right).

The main idea is node with a large number of connections should be less important. There is no need for the bias term B\mathbf B and the per-neighbor normalization.

Normalization empirically attains better results and better parameter sharing. We can efficiently compute the embeddings using sparse batch operations

Zl+1=σ(Δ1/2A~Δ1/2ZlWl)\mathbf Z^{l+1}=\sigma(\mathbf\Delta^{-1/2}\tilde{\mathbf A}\mathbf\Delta^{-1/2}\mathbf Z^l\mathbf W_l)

where A~=A+I\tilde{\mathbf A}=\mathbf A+\mathbf I and δ\delta is the diagonal degree matrix. The time complexity to propagate the information is O(E)\mathcal O(|E|).

GraphSAGE

Instead of fixed aggregation, GraphSAGE use any differentiable function that maps set of vectors to a single vector. So, if we use a generalized aggregation, we get

zil=σ([WlAGG({zjl1,jN(i)}),Blzil1])\mathbf z_i^l=\sigma([\mathbf W_l\cdot AGG(\{\mathbf z_j^{l-1},\forall j\in N(i)\}),\mathbf B_l\mathbf z_i^{l-1}])

Concentrate self embedding of neighbor embedding. Also we use a generalized aggregation.

Some variants:

  • Mean. Take weighted average of neighbors:

    AGG=jN(i)zjl1N(i)AGG=\sum_{j\in N(i)}\frac{\mathbf z_j^{l-1}}{N(i)}

  • Pool. Transform neighbor vectors and apply symmetric vector function:

    AGG=γ({Qzjl1,jN(i)})AGG=\gamma(\{\mathbf{Qz}_j^{l-1},\forall j\in N(i)\})

    where we take element-wise mean/max

  • LSTM. Apply LSTM to reshuffled neighbors π(N(i))\pi(N(i))

    AGG=LSTM([zjl1,jπ(N(i))])AGG=LSTM([\mathbf z_j^{l-1},\forall j\in\pi(N(i))])