Efficient ReliK 1 Paper Reading
This is one of the two projects I undertake in 24 Fall semester, which is related to data mining.
Supervisor: Assoc. Prof. Davide Mottin, PhD Student Maximilian Egger.
Generally speaking, they have already developed a new method, ReliK, for measuring knowledge graph embeddings, but they want to speed it up with GPUs.
So my project is focusing on this. I wish to speed them up with some tolerance on small accuracy loss.
I need to first get familiar with the paper, ReliK: A Reliability Measure for Knowledge Graph Embedding. Here is some digest and notes from it.
Abstract
Object of ReliK: To assess a priori how well a knowledge graph embedding (KGE) will perform on a downstream task and in a specific part of the knowledge graph.
ReliK aka Reliability measure of KGEs.
- It relies only on KGE embedding scores.
- Task independent
- KGE independent
Correlates well with some downstream tasks:
- tail/relation prediction, triple classification
- rule mining, question answering.
Introduction
KGE: generate a vector representation for entities and relationships of a KG.
Motivation: KEG method depends on which downstream task. We want to unified, global measure to quantify the reliability of KGE method.
Contributions: ReliK, principled measure that quantifies the reliability of how a KEG will perform will on a certain downstream task in a specific part of the KG without executing that task or training KG!!!
ReliK is based on the relative ranking of existing KG triples w.r.t. non-existing one.
Preliminaries
Knowledge Graph
KG is a triple consisting of a set of entities, a set of relationships, and a set of facts.
A fact is a triple . head, relation, tail.
We say is positive if it actually exists in the KG , negative otherwise .
Knowledge Graph Embedding
It is a representation of entities and relationships in a -dimensional space, . The or space.
The embedding score: function , which quantifies how like a triple exists in based on the embedding of its head, relationship, and tail.
- Intuitively, the higher , the more likely the existence of in .
Key Reliability
Desideratum
- Embedding-agnostic. It is independent of the specific KGE method.
- Learning-free. No further KGE training.
- Task-agnostic. It is independent of downstream task.
- Locality. A good predictor of KGE performance locally to a given triple.
ReliK Measure
To fulfill 1. 2.: Treat embedding score as a black-box function that provides only an indication of the actual existence of a triple.
Use position of a triple via a ranking based on embedding score: Assume the relative ranking of a positive triple to be higher than that of negative.
For 4. locality, we assume that a local approach that consider triples relative to a neighborhood is more appropriate.
Definition
For a triple we compute the neighbor set of all possible negative triples, that is triples with head that do not exist in .
Similarly, we define for tail.
Head rank:
Similarly, we define for tail.
ReliK can easily be extended from single triples to sub-graphs by computing the average reliability among the triples in the subgraph.
We define the ReliK score of a set of triples as
. The higher the , the better reliability.
The lower the head-rank and tail-rank, the better the ranking of induced by the underlying embedding scores.
Efficiently Computing ReliK
Computing original ReliK requires , as it needs to scan the entire negative neighborhood of the target triple.
Computationally TOO HEAVY!!!
We thus propose approximate version of ReliK with efficiency.
Main intuition of approximation is that the precise ranking of many potential triples is not needed.
What it matters is just the number of those triples that exhibit a higher embedding score than the target triple.
Lower bound:
- , a lower bound to by counting the negative triples in the sample that have a lower embedding score.
Approximate
- , estimation of by evaluating the fraction of triples in the sample that have a higher score than the triple under consideration and scaling this fraction to the total number of negative triples.
Let be a random subset of elements selected without replacement independently and uniformly at random from negative neighborhood of the head of a triple .
to be the rank of the score the KGE assigns to , among all the triples in the sample.
Similarly, we compute for tail’s neighborhood .
Lower Bound Estimator
It holds that .
Upper Bound Estimator
Theoretical analysis is omitted here.
Pseudo Code
Input: KG , triple , embedding score function , sample size
Output: or
sample triples from ; sample triples from .
;
for do
If then
if then
if then
return for
or for
Time Complexity
.
When , 2.5X faster, mean square error = 0.002.
Experimental Evaluation
omitted