Google rankings have a huge impact on internet usage, both for content providers and consumers, so I think it's a good idea to understand how that first page of results are selected. Top listing on Google can make or break a company, but it can also determine which articles you read on a topic, which shapes your thoughts. In other words, I think this is important stuff to understand.
Google's recent algorithm update (codenamed "Jagger") was a pretty significant change which deserves a bit of explanation. To understand what Jagger does (and why), you need to understand how Google's ranking system works; here's a hugely simplified history of the Google algorithm:
- Google's original PageRank system was based on the theory that links to a page (URL) were votes, and the more votes a page got the more relevant it was. Along with the text in the page, the text used in links to the page counted as keywords for that site, which lead to abuses like the dreaded Google Bomb. The name PageRank comes from Google founder Larry Page, not the term Web Page, by the way.
- Search Engine Optimizers (and casino sites / under-clothed photo sites / etc. - I'm quite unfairly going to refer to the whole bunch as SEO's here) figured out how to game the system by exchanging links, setting up phony sites, etc.
- Blogs got big. Bloggers love to link back and forth in their posts as well as their comment signatures. Google's PageRank system goes nuts over the link love and suddenly blogs had disproportionate Google Juice.
- SEO's started comment spamming - using programs or cheap labor to post phony comments on popular blogs with links back to their site to get more PageRank votes.
- Google released the "Florida" algorithm update in November 2003 to de-juice comment spam. Sometime in there the "nofollow" link attribute got added, which tells spiders not to credit links to the target's PageRank.
- SEO's set up phony blog sites which automatically stole content from other sites and linked to the sites they wanted to promote. Even to the human eye, some of these these sites weren't obvious frauds.
- Google rolled out the "Jagger" algorithm, designed specifically to make PageRank much harder to game. Jagger was rolled out and tweaked over several months, ending around November 2005.
Jagger's main change is the switch from the elegant but overly trusting PageRank system to the more realistically cynical TrustRank system, which is designed to only count votes from sites it trusts.
TrustRank immitates human behavior - if a stranger on a train recommends a movie, I'm going to value it a lot less than a recommendation from a close friend or movie critic, both of whom have earned my trust by either how long I've known them or by their reputation.Trust comes from two sources - site age and links from trusted sources. From my movie recommendation analogy above, site age is the close friend who has gained trust through the age of the relationship, whereas trusted sources are sites who has been granted a position of authority by links form a small seed group of trused sites.
Another way to look at this is from the point of view of a content publisher with a new site. At first, your links will be untrusted and will not contribute to the Page Rank of the page they link to. The site has to undergo an aging delay to before it is considered authoritative, which has led to discussion of the "Sandbox" (or the "Trustbox"). The idea is that new sites are sandboxed so they can't mess up the rankings until they've proven themselves, at which time they can participate in Page Rank voting.
Don't make the mistake of assuming there's a simple penalty box / sandbox with a set time delay before a site is trusted - that would be nothing more than an inconvenience to SEO's. There's a lively discussion on this, which can be summarized by saying it's not a Sandbox, it's a Trustbox. The point is that you don't just accrue trust through seniority, you earn it by links from trusted sources.
There are two ways to gain trust and escape the Trustbox:
- Acquire links from highly trusted sources (the "movie critic recommendation")
- Acquire links from somewhat trusted sources and let them age (the "friend recommendation")
This has different implications for different types of sites. An online online store or product, for instance, might need to take proactive measures to encourage linking to it. For a blog or content site, however, the best advice seems to be pretty common sense: continually produce interesting and relevant content so other blogs and content sites will link to your content.
In both cases, the relevancy of the link is important as well, which makes sense from both an accuracy and an "anti-gaming" point of view.
The system started with a very small number - around 200 - trusted "seed" sites which were closely reviewed by experts. Links from these sites propagated trust through the system.
Tailrank has applied a similar seed / trust model to rank and categorize RSS / ATOM feeds. Unlike Google, their rankings are personalized to your interestes if you upload your feed subscription OPML. Read more here.
New sites get a 2 to 4 week grace period, or they'd never have a chance. This is a great time to be gaining trusted links so you don't see a steep decline in rankings (and traffic) after this time is up.
References / Further reading:
- Combating Web Spam with TrustRank (PDF) - the research paper which originally proposed the idea. Intrestingly, it was Yahoo - not Google - was involved in this paper.
- A Look at Google TrustRank & Trustbox - short overview
- The Trustbox Revisited - Stuntdbl - There is no Sandbox! Also some nice links at the end.
- Google SEO : Sandbox, TrustRank, Jagger Update - Good info on how to work with the new system
- On the Google Jagger Algo Update - Some history as well as a more in depth listing of other changes in the algorithm, such as penalization for "irrational use of css" and "duplicate content on multiple domains". That second one seems to penalize crossposting...
 I think this was one of the earlier applications of the concept of allowing user behavior to define the system rather the opposite. How Web 2.0.