Google Indexes Scanned Documents

October 31st, 2008 at 12:00 am

scanned document

Google is now indexing scanned documents.  In the past scanned documents were simply pictures to Google’s spiders.  This means that they were unable to read the contents of these files and simply relied on their file names and tags to index them.  This was also the same with the popular Portable Document Fornat (PDF) from Adobe.  Recently however, PDF became an ISO standard and more and more documents are stored and uploaded on the internet this way.

Now, Google is proud to announce that they are now able to index these documents.

"We are now able to perform OCR on any scanned documents that we find stored in Adobe’s PDF format. This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found. This is a small but important step forward in our mission of making all the world’s information accessible and useful."

Indeed this feat is vastly important.  Although OCR is not exactly rocket science nor is it new technology, Google had much difficulty perfecting their robots on how to read scanned images.  Of course this takes into consideration that scanned documents are most of the time riddled with imperfections like coffee stains or whatever stains there may be.  Now, Google is able to read through these imperfections and properly index scanned documents.