Wednesday, January 4, 2012

Error 404? Retrieve Deleted Web Pages With These Browser Extensions

What do you do when you arrive at a webpage that says Error 404 – Not Found? Do you click your tongue and close the page? Wait, not so fast. There is a good possibility that the deleted page is cached somewhere on the web. You just have to know where to look.

lightpost-404
Error 404 page of Lightpost Creative

The best place to start is Google, or for that matter, any search engine. When search engines crawl websites to index pages, they save a copy of the page locally on their server (and that’s how Google can retrieve results faster than you can blink your eye, because all they really do is search for the local copy and not the actual internet). The cached copy is accessible from the search results page. Previously, the link to the cached copy was in plain sight. Now they have moved it under instant preview. In any case, it’s there and clicking on the link will retrieve the copy of the page from Google’s servers even if the original page is deleted by the website owner. This works not only for Google, but Bing too.

google-cache

However, not all pages have a cached copy. Very recently published pages usually don’t have a cached copy. If the page was published, say within the last few hours or minutes, the cached page might not be available. Pages deleted a long time ago might not have a cache either. The Googlebot periodically crawls websites to re-index pages on a site and pages that are missing or deleted will eventually be removed from the cache as well.

For really old pages, the ideal place to look for is at the Internet Archive Wayback Machine.

wayback-machine

The Wayback Machine, according to the description found on Wikipedia, “is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a three dimensional index.”

The Wayback Machine’s cache is not as vast as that of Google, but it indexes most moderate to large websites. One major drawback of the Wayback Machine is that snapshots may be delayed for as long as 6 months or more after they are archived, or in some cases, even later, 24 months or longer. However, things have started to look better after the site went a major redesign last year. Now snapshots are available within a few hours to a few days. The frequency of snapshots is but variable, so not all tracked web site updates are recorded. Sometimes intervals of several weeks or years occur.

Obviously, you want an easier access to these services, and that is provided by the Web Cache extension for Chrome.

webcache-chrome

Web Cache lets you quickly search for the missing page on a number of locations including the Internet Wayback Machine, Google Cache, Yahoo Cache, Bing Cache, CoralCDN, Gigablast and WebCite.

Among the supported services the only ones worth using are Google Cache, Yahoo Cache, Bing Cache and Wayback Machine. CoralCDN is a CDN service (content distribution network) and I don’t know how that helps. Gigablast is an obscure search engine that is no good and Webcite is a Wayback Machine type of service but with limited reach.

gcache

For Firefox, there is just one add-on called Gcache+ that lets you search on Google’s cache for the unavailable page. There is another one called Resurrect Pages supports the same set of web caches as the Chrome extension, but not for missing (error 404) pages, but rather for pages that couldn’t be reached due to network problem or offline server. Yet another one called ErrorZilla Mod suffers from the same limitation.

Nothing for Internet Explorer or Opera.

2 comments:

  1. Thanks for your research, and sharing it. love this blog!! it's like the type of lifehacker posts that I love the most, all put into one place! and unique content usually not found on lh either!! :).

    You have a new subscriber :).

    ReplyDelete
  2. What to do when none of these options work? Is there any way to view/retrieve a web page that has been removed from google cache & isn't on wayback machine, the internet archive?

    ReplyDelete

Popular Posts