Behind the Search Engines: PageRank Algorithm
Information retrieval had been challenging since the web was growing faster. The number of new users who are inexperienced about the web search or surfing were increasing too. The search engines back then had many limitations such as inefficient keyword matching algorithm leading to low quality matches, most of these search engines had high quality human-maintained indices but were quite expensive to maintain and improve with slow performance and could not cover diverse topics of index. Also, most of the systems were not able to handle millions of queries per day by the users. Somewhere there was a need for a large-scale search engine which would be highly efficient, scalable and user friendly too.
Lawrence Page and Sergey Brin, PhD students from Stanford University proposed a prototype called Google to effectively index search engines and produce better search results than the existing systems back then. Google got its name from googol or 10 raised to 100 which means very large.
We all know that Google is now one of the most efficient search engines present and has established a billion-dollar business but what really makes Google so special? What is the underlying algorithm behind it? Let’s understand this here.
Google search engine makes use of one of the important algorithm called as PageRank that helps it produce high precision results. But first let’s be aware of a concept called as citation graph which will help to understand PageRank algorithm in a better way.
Citation or Link graph of web:
It’s basically a network of citations/links between web pages or websites, thus representing connections between different web resources based on how they are cited by other web pages. These graphs help the search engine to map the internet, as called as Link Graphs. They reveal attributes about websites on the internet. In a link graph, each website is represented as a node, and a hyperlink between two websites is represented as a directed edge(link) between the nodes.
Mathematically, a link graph can be represented as a directed graph, such as if page A has a hyperlink referring to page B, then an edge will be connected to nodes represented Page A and Page B. Eventually the adjacency matrix X will be XAB = 1 for page A and XAB = 0 for page B.
They can be used to identify the topics a website is about, so the similar sites would be linked to each other. Thus, determining site relevance. The link graphs can also help determine spam websites. While spam sites may link to normal non-spam sites, normal sites do not tend to link to spam sites, isolating spam sites into their corners of the link graph.
PageRank Algorithm:
Back then, most of the existing search engines were hardly using link graphs within their system. For Google, 518 million of these hyperlinks of map was created. These maps were further helping rapid calculation of web pages rank that will related to people’s idea of importance. Eventually, prioritizing the results of web keyword searches. A PageRank score of 0 means lowest quality pages and pagerank score of 10 means most established page.
So, a link from one website to another acted as a vote of trust and authority. Thus, more links that point to a page (which acts as a vote), the more the page can be trusted, and it should be ranked higher. But PageRank algorithm extends itself to this idea by normalizing by the number of links on a page. It means a link cannot be a vote, the authority of that page is also considered.
Mathematically, we can represent this relationship as follows:
PR(i) = (1 - d) + d * (PR(j1)/L(j1) + PR(j2)/L(j2) + ... + PR(jk)/L(jk))
d = damping factor that represents the probability of a user randomly clicking a link on webpage rather than following a link from another webpage. So, d = 0.85 PR(j1), PR(j2), ..., PR(jk) are the PageRank scores of the webpages that link to webpage i, and L(j1), L(j2), ..., L(jk) are the number of outgoing links from those webpages.
The equation essentially says that the PageRank score of a webpage is a combination of the damping factor (1 - d) and the sum of the PageRank scores of the webpages that link to it, divided by the number of outgoing links from those webpages. This calculation is performed iteratively until a stable PageRank score is obtained for each webpage. By using this algorithm, Google is able to rank webpages based on their importance or popularity, which helps determine the order in which webpages are displayed in search results.
Google retires the PageRank Toolbar
Previously, SEOs could see the PageRank score of any webpage via the Google Toolbar. Eventually, SEOs really became obsessed with the PageRank becoming the most focus tactics to rank the websites higher rather than creating great content and user experience. Also, the PageRank score was easy to manipulate. Eventually, Google retired the PageRank toolbar in 2016.
References:
- Everything You Need to Know About Google PageRank
- Page Rank Algorithm and Implementation
- The Anatomy of a Large-Scale Hypertextual Web Search Engine
Thanks for reading!