# Markov Chains & Google's Page Rank

Before the inception of Google in the late 1990s, the results obtained from the typical search engine left one to sift through large amounts of irrelevant web pages the just happened to match the search text. Google search listings always seemed to deliver more pertinent results up front. The genius behind the world’s most dominant search engine is its PageRank algorithm, which quantitatively values the relative importance of each webpage. This allows Google to rank the pages, and consequently present the most relevant and useful ones first. Page and Brin describe PageRank a model of user behavior. They begin with the assumption that a random web surfer who is given a random Web page begins clicking on links, never hitting the “back” button. Whenever he gets bored, he jumps to another random page. The probability that this random surfer visits a specific page is its PageRank. At each page, the algorithm also analyses the probability that the random surfer will get bored and jump to another random page. The unsystematic process that illustrates the surfer’s behavior is known as a Markov Chain. The intrinsic characteristics of Markov’s theorem imply that regardless of the starting point, the probability that our random web surfer lands on a specific page is the same. Calculating PageRank

The searchable web currently has an immense number of nodes (pages) and edges (links). Pages can have both forward links and backlinks. Google takes advantage of this link structure to produce a global ranking of each webpage’s importance. It can be generally implied that a site with a high number of back links is quite important. However, PageRank implements a more sophisticated method for link counting and weighing. In order to elaborate, we will consider the following simplified web with only four pages:

In web of n pages, where each page is indexed by an interger k, where 1 ≤ k ≤ n, an arrow from on page A to page B indicates a link from A to B. Xk will be used to denote the importance score of page k in the web. All scores are non-negative (the least possible importance score is 0), and Xj ≥ Xk indicates that page j has a more important ranking than page k. We could approach arriving at a value for Xk by counting the back links for page k. Considering the figure here at the left, we have x1 = 2, x2 = 1, x3 = 3, and x4 = 2. Page 3 ranks the highest. This approach has one very large omission. It does not consider the relative importance of each page. A link from an important page should heighten page k’s rank more than a link from an unimportant page. In order to put this idea into practice, let’s try to calculate the score of page j as the sum of the scores of all the pages linked to it. Because of these pages interconnectivity, this approach is self-referential. While the score of page 1 would be the sum of the relative importance of pages 3 and 4, we can see that 3 and 4’s importance depends on 1’s. So that a single page does not have an inflated influence by simply linking to a lot of pages, the following strategy is put into place. If page j contains Nj links, and one of these links is to page k, then page k’s rank is improved by xj/nj, instead of by xj. This way, each page only gets a single vote that is weighted by its score that is evenly divided among the entirety of its outgoing links. It is assumed that a link from a page to itself is not counted. If Lk is the set of page k’s backlinks, then Lk {1,2,...,n} denotes the set of pages with a link to page k. We can thus quantify the rank of k with the following formula:

If we apply this formula to the four paged web depicted in the picture above we can calculate the rank of page 1 thusly: x1 = x3/1 + x4/2, because pages 3 and 4 have back links to page 1, page 3 has only one link, and page 4 contains 2, which splits its vote in half. Following this scheme, x2 = x1/3, x3 = x1/3 + x2/2 + x4/2, and x4 = x1/3 + x2/2. These linear...

Please join StudyMode to read the full document