数据挖掘 8 Link Analysis
PageRank as a fundamental tool in graph mining, connections to Markov process and linear algebra.
Random Walk
PageRank
An algorithm that provides a score to each web page.
It follows:
- Each link’s vote is proportional to the importance of its source page.
- If node with importance has out links, each of the ’s neighbors gets votes.
- Page ’s own importance is the sum of the votes on its in-links.
PageRank can be simply expressed as the sum of the scores of the neighbors.
Let be the probability that from we end up to ’s neighbor and is the out-degree of node .
The PageRank of a page is the weighted sum of the PageRank of the pages that links to it
where is the matrix of the probability of landing on page from . can be computed as .
The Power Iteration Method
Think of as an iterative process that updates the previous value of at iteration with the new . This is the idea of Power Iteration method.
The PageRank vector is initially uniform with for all nodes. The iteration continues until convergence . Note that the choice of stopping criteria is arbitrary and can be easily substituted with a simple sum of absolute differences.
Pseudo Code of Power Iteration
Input: Transition matrix , number of nodes
Output: PageRank
repeat
until
Dominant eigenvector. The eigenvector of . In fact, PageRank corresponds to the eigenvalue . . Furthermore, the PageRank is the dominant eigenvector of the matrix .
Power Iteration method converges to the dominant eigenvector by following theorem:
Theorem. The sequence approaches the dominant eigenvector of .
Random Walk Interpretation. A characterization of PageRank. This assumes that at any point in time an internet surfer lands on page , and at time the surfer follows an outlink from uniformly at random. Once the surfer lands on a different pages, they repeat the process of selecting a linked page uniformly at random.
Let be a vector whose -th coordinate is the probability that the surfer is at page at time . Then is a probability distribution over the pages. At time the surfer will reach any of the linked pages with probability . At convergence , then is a stationary distribution of the random walk.
The PageRank satisfies , so is a stationary distribution for the random walk. A central result from the theory of random walks says that graphs that satisfy certain conditions have a unique stationary distribution and will eventually converge regardless of the initial distribution .
Convergence of PageRank. Does the iteration converge? The answer is no. It’s possible to design such an example.
Other problems with PageRank.
- Sinks: a node without outgoing edges.
- Spider trap: a group of nodes without outgoing edges.
Pros and Cons
👍
- Fast to compute with Power Iteration where is number of iterations.
- It provides a measure of relevance that does not depend on the content of a page.
👎
- It measures generic popularity of a page, but is biased against topic-specific authorities.
- Susceptible to link spam.
- Uses a single measure of importance.
Solving PageRank Issues
Using teleportation.
It is the process under which, with some probability the surfer chooses a random node where to teleport.
Process:
At each iteration we toss a coin.
With probability we continue in our surfing by choosing randomly among the node’s neighbors, else with probability we select a random node to teleport.
Equivalently written in matrix notation
Observation: . If it’s close to the PageRank will converge to the uniform distribution ; if it’s close to the PageRank might converge slower like original version. A typical choice of .
Personalized PageRank
There are teleports equally on all nodes. However, in some application we might to teleport only on a small number of nodes by substituting the vector with a generic probabilistic vector obtaining
This equation is called Personalized PageRank. This kind of PageRank corresponds to Random Walks with Restart:
where is a stochastic matrix as the sum of each column is . is the restart vector, and is also stochastic.
Topic-specific PageRank
Topic-specific PageRank computes the Personalized PageRank for a group of nodes . These nodes are selected among the pages that have the same topic.
If we cluster the pages by words we might find different communities. One way to restrict to one of these communities is to run PPR on the of initial pages with the same words. We can compute PPR for each topic set and get a different vector . We update the transition matrix with a probability of teleporting to the nodes in the set
As we teleport uniformly to any node in we can simply use the personalization vector and define
In this way, the rest of the PPR definition remains the same besides .
TrustRank
Teleportation probably solves a deal of problems, but PageRank may still be biased towards pages that spread false information.
TrustRank tries to solve this issue with the use of trusted pages.
Main idea: the principle of approximate isolation under which it is unlikely that a good page points to a bad page.
Two problem to solve:
- how to select the set of trusted pages;
- how to decide whether a page is a spam or is a good one.
Selecting a subset of trusted pages. In order to select a list of trusted pages we can use different approaches:
- Sample a set of pages and ask a set of volunteers to evluate whether the pages are good or bad. Expensive!!!
- Select the pages with the highest PageRank, assuming that bad pages cannot eventually reach too high PR values.
- Select a set of trusted domains.
Detecting spam pages. One way to detect spam pages is to define a threshold value and mark all pages below such threshold as spam. However, finding the right threshold is difficult.
Another solution is to compute spam mass as the deviation of TrustRank from PageRank. The spam mass is the fraction of a page’s PageRank that comes from spam pages. Since in practice, we forget about spam pages, compute an estimate instead.
Let be the PR of page . The is the PR of with teleport only into trusted pages. Then spam mass is where . Pages with high spam mass are considered spam.
Link Farms
One problem for PageRank: Link farms.
A link farm is a page , usually controlled by a malicious spammer, that is connected to a set of good and trusted pages. Such page creates a number of fictitious pages that links from and to with the purpose of increasing artificially its PR to appear higher in search engines. But how many of such pages does need?
Let be the PR of and the PR of the “honest” pages that link to . The PageRank of any fake pages is
The PR is the sum of the contribution of the honest pages and the fake pages
If , the quantity . As such for each honest page the malicious page is rewarded times the honest page’s PageRank. This effect can be amplified with a large .
HITS: Hubs and Authorities
The Hypertext-induced topic selection (HITS) is a measure of importance of pages and documents similar to PageRank.
Main idea: in order to find good newspaper, we can find experts that support the newspaper. Any link becomes a vote.
In HITS, every page has 2 scores, a hub score and an authority score. The hub score is the sum of the scores to the authorities pointed by the hub; the authority score is the sum of votes coming from hubs.
HITS uses principle of repeated improvement that iterates over hubs an authorities in turn.
Each page starts with hub score . The iteration works in alternating manner: after the authorities collected their votes, the hubs sum the scores from the authorities. Once again the computation of the scores is a mutually recursive definition. A good hub links to many good authorities and a good authority is linked to many good hubs. We denote hub and authority scores as and .
Pseudo Code HITS
Initialize for all
repeat
- Normalize
until and
HITS converges to a single stable point. We can see that HITS indeed converge by considering the authority vector and the hub vector . In this way we can rewrite and . For two consecutive iterations HITS iterations we obtain
and repeat until the difference of in two different iterations is less than . From the two equations above it is easy to see that HITS converges to the principal eigenvalue of for and for .