Bring the past to the internet - Digitization of Print

charan puvvala

I was working with a university a year back where I assisted them in digitization of a select collection. This involved careful planning lots of dirty work, one software dev, proof checking and 5 man team working for 45 days to complete it. The end result was immensly satisfying. The 320 books that were digitized were available to the public at the Digital Library of the University. Previously one would have to take prior permissions from a hierachy of people to be able to access these books due to the fragile nature they in and the value they hold. After the digitization students could easily access them and the university offloaded the maintainence of the collection to different place.

The digitization process involved using OCR software to make the content searchable in the digitial library portal. The accuracy of OCR was somewhere around 90-92%. This meant there was lots of scope for improvement. Proof reading was the part where most of the budget was allocated, as the proof readers had to have some domain knowledge.


After my recent interest in Machine Learning, I was hoping to give OCR a spin with Machine Learning. Computer Vision is one area which has improved a lot using ML algorithms. Currently I am using Alpaydin's Introuduction to Machine Learning as my refernce for the past two months. I hope I can come up with more content in this section.