Even if you believe that internet search has moved on from PageRank, there is no denying that it has long been a pervasive concept in the IT industry.
Every Search Engine Optimization (SEO) pro should have a good grasp of what PageRank was and what it still is today's IT industry.
Created by Google's founders Larry Page and Sergey Brin, PageRank is an algorithm based on the combined relative strengths of all the hyperlinks on the web.
Most people in SEO argue that the name was based on Larry Page’s surname, whilst others suggest “Page” refers to a website page. Both positions are likely true, and the overlap was probably intentional in web world.
When Larry Page and Sergey Brin were at Stanford University, they wrote a paper entitled: The PageRank Citation Ranking: Bringing Order to the internet.
Published in year 1999, the paper demonstrates a relatively simple algorithm for evaluating the strength of all website pages.
The reses paper went on to become a patent in the United States (but not in European Union, where mathematical formulas are not patentable at that time).
Leland Stanford Junior University, owns the patent and has assigned it to Google. The patent is currently due to expire in year 2027.
During their time at Leland Stanford Junior University in the late 1990s, both Sergey Brin and Larry Page were looking at information retrieval methods.
At that time, using web links to work out how “important” each web page was relative to another was a revolutionary way to order web pages. It was computationally difficult but by no means impossible.
The idea quickly turned into Google, which at that time was a minnow in the world of search engine.
There was so much institutional belief in Google’s approach from some parties that the business initially launched its search engine with no ability to earn revenue from that.
And while Google (known at the time as “BackRub the early search engine”) was the in search industry, PageRank was the algorithm it used to rank pages in the search engine results pages.
One of the challenges of web PageRank was that the math, whilst simple, needed to be iteratively processed. The calculation runs multiple times, over every web page and every web link on the Internet. At the turn of the millennium, this math took several days to process year by year.
The Google search engine results pages moved up and down during that time. These changes were often erratic, as new PageRanks system were being calculated for every web page.
This was known as the “Google Dance,” and it notoriously stopped Search Engine Optimization (SEO) pros of the day in their tracks every time Google started its monthly web update.
(The Google Dance later became the name of an annual party that Google ran for Search Engine Optimization (SEO) experts at Google's headquarters in Mountain View.)
A later iteration of search engines PageRank introduced the idea of a “trusted seed” set to start the algorithm rather than giving every web page on the Internet the same initial value.
This model suggests that the search engines PageRank of a web page might not be shared evenly with the web pages it links out to but could weight the relative value of each web link based on how likely a user might be to click on search result.
Google’s algorithm was initially believed to be “unspam-able” internally since the importance of a web page was dictated not just by its content but also by a sort of “web voting system” generated by web links to the page.
Google’s confidence did not at last, however.
Search engine PageRank started to become problematic as the backlink industry grew. So Google withdrew it from public view, but continued to rely on it for its own ranking algorithms.
The PageRank Toolbar was withdrawn by 2016, and eventually, all public access to PageRank was curtailed. But by this time, Majestic Search Engine Optimization (an SEO tool), in particular, had been able to correlate its own calculations quite well with web PageRank.
Google spent many years encouraging Search Engine Optimization (SEO) professionals away from manipulating web links through its “Google Guidelines” documentation and through advice from Google's spam team, headed up by Matt Cutts, until January 2017.
Google’s algorithms were also changing during this time by time.
The Google was relying less on PageRank and, following the purchase of Metaweb Technologies and its proprietary Knowledge Graph (called “Freebase” in year 2014), Google started to index the world’s web information in different ways.
Google was initially so proud of its own algorithm that it was happy to publicly share the result of its calculation to anyone on web who wanted to see results.
The most notable representation was a toolbar extension for new browsers like Firefox, which showed a score between 0 and 10 for every web page on the Internet.
In truth, PageRank has a much wider range of scores, but 0-10 gave Search Engine Optimization (SEO) pros and consumers an instant way to assess the importance of any page on the Internet.
The PageRank Toolbar made the algorithm extremely visible, which also came with complications. In particular, it meant that it was clear that web links were the easiest way to “game” Google search.
The more web links (or, more accurately, the better the link), the better a page could rank in Google’s search engine results page (SERPs) for any targeted keyword.
This meant that a secondary market was formed, buying and selling web links valued on the web PageRank of the URL where the web link was sold.
This problem was exacerbated when Yahoo launched a new free tool called Yahoo Search Explorer, which allowed anyone the ability to start finding web links into any given web page.
Later, two tools –Moz and Majestic – built on the free option by building their own search indexes on the Internet and separately evaluating web links.
Many other search engines relied heavily on analyzing the content on each web page individually. These methods had little to identify the difference between an influential web page and one simply written with random (or manipulative) content text.
This meant that the retrieval methods of other search engines were extremely easy for Search Engine Optimization (SEO) pros to manipulate search results.
Google’s PageRank algorithm, then, was more revolutionary in internet industry.
Combined with a relatively simple concept of “nGrams” to help establish relevancy, Google found a winning winning formula in internet marketing.
It soon overtook the main incumbents of the day, such as new search engine AltaVista and Inktomi (which powered MSN, amongst others).
By operating at a web page level, Google also found a much more scalable solution than the “directory” based approach adopted by search engin Yahoo and later multilingual open-content directory of World Wide Web links (DMOZ) – although DMOZ (also called the Open Directory Project) was able to provide Google initially with an open-source directory of its own.
The formula for web PageRank comes in a number of forms but can be explained in a just few sentences.
Initially, every page on the internet is given an estimated web PageRank score. This could be any number. Historically, PageRank was presented to the public as a score between 0 and 10, but in practice, the estimates do not have to start in this kind of range.
The PageRank for that web page is then divided by the number of web links out of the web page, resulting in a smaller fraction search.
The PageRank is then distributed out to the web linked pages – and the same is done for every other web page on the Internet.
Then for the next more iteration of the algorithm, the new estimate for PageRank for each web page is the sum of all the fractions of web pages that link into each given page.
The formula also contains a “damping factor,” which was described as the just chance that a person surfing the web might stop web surfing altogether.
Before each subsequent iteration of the algorithm starts, the proposed new more PageRank is reduced by the damping factor search industry.
This search engine methodology is repeated until the PageRank scores reach a settled equilibrium. The resulting numbers were then generally transposed into a more recognizable range of 0 to 10 for convenience to search results.
If a web page does not link out to any other web page, then the formula will not reach an equilibrium.
In this event, therefore, the PageRank would be distributed amongst every page on the Internet. In this way, even a web page with no incoming web links could get some PageRank but it would not accumulate enough to be significant search results.
Another less documented challenge is that newer web pages, whilst potentially more important than older web pages, will have a lower PageRank. This means that over time, old content can have a disproportionately more high PageRank.
If a web page starts with a value of 5 and has 10 web links out, then every web page it links to is given 0.5 PageRank (less the damping search factor).
In this way, the PageRank flows around the Internet between iterations search.
As new web pages come onto the Internet, they start with only a tiny amount of PageRank. But as other pages start to web link to these pages, their PageRank increases over time by time.
Although public access to web PageRank was removed in 2016, it is believed the score is still available to search engines engineers within Google.
A leak of the factors used by Russian search engine Yandex showed that PageRank remained as a factor that it could use.
Google engineers have suggested that the original form of PageRank was replaced with a more new approximation that requires less processing power to calculate search. Whilst the formula is less important in how Google ranks web pages, it remains a constant for each web page.
And regardless of what other algorithms Google might choose to call upon, PageRank likely remains embedded in many of the other search giant’s systems to this day.
No comments:
Post a Comment