This is one of the two projects I undertake in 24 Fall semester, which is related to data mining.

Supervisor: Assoc. Prof. Davide Mottin, PhD Student Maximilian Egger.

Generally speaking, they have already developed a new method, ReliK, for measuring knowledge graph embeddings, but they want to speed it up with GPUs.

So my project is focusing on this. I wish to speed them up with some tolerance on small accuracy loss.

I need to first get familiar with the paper, ReliK: A Reliability Measure for Knowledge Graph Embedding. Here is some digest and notes from it.

Abstract

Object of ReliK: To assess a priori how well a knowledge graph embedding (KGE) will perform on a downstream task and in a specific part of the knowledge graph.

ReliK aka Reliability measure of KGEs.

  • It relies only on KGE embedding scores.
  • Task independent
  • KGE independent

Correlates well with some downstream tasks:

  • tail/relation prediction, triple classification
  • rule mining, question answering.

Introduction

KGE: generate a vector representation for entities and relationships of a KG.

Motivation: KEG method depends on which downstream task. We want to unified, global measure to quantify the reliability of KGE method.

Contributions: ReliK, principled measure that quantifies the reliability of how a KEG will perform will on a certain downstream task in a specific part of the KG without executing that task or training KG!!!

ReliK is based on the relative ranking of existing KG triples w.r.t. non-existing one.

Preliminaries

Knowledge Graph

KG K:E,R,F\mathcal K:\langle\mathcal E,\mathcal R,\mathcal F\rangle is a triple consisting of a set E\mathcal E of nn entities, a set R\mathcal R of relationships, and a set FE×R×E\mathcal F\subset\mathcal E\times\mathcal R\times\mathcal E of mm facts.

A fact is a triple xhrt=(h,r,t)x_{hrt}=(h,r,t). head, relation, tail.

We say xhrtx_{hrt} is positive if it actually exists in the KG xhrtFx_{hrt}\in\mathcal F, negative otherwise xhrt∉Fx_{hrt}\not\in\mathcal F.

Knowledge Graph Embedding

It is a representation of entities and relationships in a dd-dimensional space, dEd\ll\vert\mathcal E\vert. The Rd\mathbb R^d or Cd\mathbb C^d space.

The embedding score: function s:E×R×ERs:\mathcal E\times\mathcal R\times\mathcal E\to\mathbb R, which quantifies how like a triple xhrtx_{hrt} exists in K\mathcal K based on the embedding of its head, relationship, and tail.

  • Intuitively, the higher s(xhrt)s(x_{hrt}), the more likely the existence of xhrtx_{hrt} in K\mathcal K.

Key Reliability

Desideratum

  1. Embedding-agnostic. It is independent of the specific KGE method.
  2. Learning-free. No further KGE training.
  3. Task-agnostic. It is independent of downstream task.
  4. Locality. A good predictor of KGE performance locally to a given triple.

ReliK Measure

To fulfill 1. 2.: Treat embedding score as a black-box function that provides only an indication of the actual existence of a triple.

Use position of a triple via a ranking based on embedding score: Assume the relative ranking of a positive triple to be higher than that of negative.

For 4. locality, we assume that a local approach that consider triples relative to a neighborhood is more appropriate.

Definition

For a triple xhrt=(h,r,t)x_{hrt}=(h,r,t) we compute the neighbor set N(h)\mathcal N^-(h) of all possible negative triples, that is triples with head hh that do not exist in K\mathcal K.

Similarly, we define N(t)\mathcal N^-(t) for tail.

Head rank: rankH(xhrt)={xN(h):s(x)>s(xhrt)}+1rank_H(x_{hrt})=|\{x\in\mathcal N^-(h):s(x)\gt s(x_{hrt})\}|+1

Similarly, we define rankT(xhrt)rank_T(x_{hrt}) for tail.

ReliK(xhrt)=12(1rankH(xhrt)+1rankT(xhrt)).ReliK(x_{hrt})=\frac{1}{2}\left(\frac{1}{rank_H(x_{hrt})}+\frac{1}{rank_T(x_{hrt})}\right).

ReliK can easily be extended from single triples to sub-graphs by computing the average reliability among the triples in the subgraph.

We define the ReliK score of a set SFS\subseteq\mathcal F of triples as

ReliK(S)=1SxhrtSReliK(xhrt).ReliK(S)=\frac{1}{S}\sum_{x_{hrt}\in S}ReliK(x_{hrt}).

ReliK(0,1]ReliK\in(0,1]. The higher the ReliKReliK, the better reliability.

The lower the head-rank and tail-rank, the better the ranking of xhrtx_{hrt} induced by the underlying embedding scores.

Efficiently Computing ReliK

Computing original ReliK requires Ω(CR)\Omega(|\mathcal C|\cdot|\mathcal R|), as it needs to scan the entire negative neighborhood of the target triple.

Computationally TOO HEAVY!!!

We thus propose approximate version of ReliK with efficiency.

Main intuition of approximation is that the precise ranking of many potential triples is not needed.

What it matters is just the number of those triples that exhibit a higher embedding score than the target triple.

Lower bound:

  • ReliKLBReliK_{LB}, a lower bound to ReliKReliK by counting the negative triples in the sample that have a lower embedding score.

Approximate

  • ReliKApxReliK_{Apx}, estimation of ReliKReliK by evaluating the fraction of triples in the sample that have a higher score than the triple under consideration and scaling this fraction to the total number of negative triples.

Let SHS_H be a random subset of kk elements selected without replacement independently and uniformly at random from negative neighborhood N(h)\mathcal N^-(h) of the head hh of a triple xhrtx_{hrt}.

rankHS(xhrt)={xSH:s(x)>s(xhrt)}+1,rank_H^S(x_{hrt})=\vert\{x\in S_H:s(x)\gt s(x_{hrt})\}\vert+1,

to be the rank of the score s(xhrt)s(x_{hrt}) the KGE assigns to xhrtx_{hrt}, among all the triples in the sample.

Similarly, we compute rankTSrank_T^S for tail’s neighborhood N(t)\mathcal N^-(t).

Lower Bound Estimator

rankH(xhrt)rankHS(xhrt)+N(h)SHrankT(xhrt)rankTS(xhrt)+N(h)STReliKLB(xhrt)=12(1rankHS(xhrt)+N(h)SH+1rankTS(xhrt)+N(t)ST),rank_H(x_{hrt})\le rank_H^S(x_{hrt})+|\mathcal N^-(h)|-|S_H|\\ rank_T(x_{hrt})\le rank_T^S(x_{hrt})+|\mathcal N^-(h)|-|S_T|\\ \Rarr ReliK_{LB}(x_{hrt})=\frac{1}{2}\left(\frac{1}{rank_H^S(x_{hrt})+|\mathcal N^-(h)|-|S_H|}+\frac{1}{rank_T^S(x_{hrt})+|\mathcal N^-(t)|-|S_T|}\right),

It holds that ReliKLB(xhrt)ReliK(xhrt)ReliK_{LB}(x_{hrt})\le ReliK(x_{hrt}).

Upper Bound Estimator

Theoretical analysis is omitted here.

ReliKApx(xhrt)=12(1rankHS(xhrt)N(h)SH+1rankTS(xhrt)N(t)ST).ReliK_{Apx}(x_{hrt})=\frac{1}{2}\left(\frac{1}{rank_H^S(x_{hrt})\frac{\mathcal N^-(h)}{|S_H|}}+\frac{1}{rank_T^S(x_{hrt})\frac{\mathcal N^-(t)}{|S_T|}}\right).

Pseudo Code

Input: KG K:E,R,F\mathcal K:\langle\mathcal E,\mathcal R,\mathcal F\rangle, triple xhrt=(h,r,t)Fx_{hrt}=(h,r,t)\in\mathcal F, embedding score function s:E×R×ERs:\mathcal E\times\mathcal R\times\mathcal E\to\mathbb R, sample size kNk\in\mathbb N

Output: ReliKLB(xhrt)ReliK_{LB}(x_{hrt}) or ReliKApx(xhrt)ReliK_{Apx}(x_{hrt})

  1. SHS_H\gets sample kk triples from N(h)\mathcal N^-(h); STS_T\gets sample kk triples from N(t)N^-(t).

  2. rankH1rank_H\gets1; rankT1rank_T\gets1

  3. for xhrtSHSTx_{h'r't'}\in S_H\cup S_T do

    • If s(xhrt)s(xhrt)s(x_{hrt})\le s(x_{h'r't'}) then

      • if h=hh'=h then

        rankHrankH+1rank_H\gets rank_H+1

      • if t=tt'=t then

        rankTrankT+1rank_T\gets rank_T+1

  4. return 12(1rankH+N(h)SH+1rankT+N(t)ST)\frac{1}{2}\left(\frac{1}{rank_H+|\mathcal N^-(h)|-|S_H|}+\frac{1}{rank_T+|\mathcal N^-(t)|-|S_T|}\right) for ReliKLBReliK_{LB}

    or 12(1rankHN(h)SH+1rankTN(t)ST)\frac{1}{2}\left(\frac{1}{rank_H\frac{\mathcal N^-(h)}{|S_H|}}+\frac{1}{rank_T\frac{\mathcal N^-(t)}{|S_T|}}\right) for ReliKApxReliK_{Apx}

Time Complexity

O(k)\mathcal O(k).

When k20%nk\approx 20\%\cdot n, 2.5X faster, mean square error = 0.002.

Experimental Evaluation

omitted