OCRopus – open source text recognition

Introduction

My girlfriend Pernille Petersen needed to grab a big chunk of text from from a magazine and asked me help her, so she didn’t need to type it in by hand. I new that I could do this easily with OCR, but hadn’t done so in Linux before.


OCRopus

One of the first google hits on the search “open source ocr” is OCRopus. As it can be read on wikipedia “OCRopus is currently developed under the lead of Thomas Breuel from the German Research Centre for Artificial Intelligence in Kaiserslautern, Germany and is sponsored by Google.”. It is also quite easy to install on Ubuntu or Debian:

[root@euler ~][23:33]# aptitude install ocropus

Conditioning the scan and running OCRopus

Before actually running ocropus on the screen shot. I knew that it would likely mess up the fact that there was a picture, ads and two column text, so I cropped the columns and adjusted the levels in of the images in GIMP to make it easier for ocropus to recognize the text.

finally I saved these two files and and ran ocropus on the files as shown:

tjansson@euler:~$ocroscript recognize column1.png > column1.html
tjansson@euler:~$ocroscript recognize column2.png > column2.html

If I had to describe Pernille’s performance in one expression it would be “speIIbinding”, she is clearly can/ing out her performing persona and her choice of repertoire was a real showcase of this lovely and expressive player emerging. I was in the audience at theFinal in 2009 and not only was she a worthy winner but one of her many qualities is her ability to engage with an audience and draw their attention to her story-telling; she is an honest performer, expressive, with great technical control and a very good chamber musician too.

Pernille brought with her a group of wonderful musicians with whom she clearly has a great musical affinity: Gunhild Tonder, harpsichord; Thomas Pitt, baroque Cello and Bjarke l\/Iogensen, accordion, to play some of Pernille’s favourite pieces, and also to complement the release of her solo debut CD of Castrucci and Geminiani.

Pernille started the concert on her own playing Principio di Vlrtu,a medieval dance, by heart using extended techniques to add variation to the Rondeau form. This set the mood for the concert; you couIdn’t hear a pin drop, but I wondered if a bit more risk taking and variation with articulation and colours might have added to the listeners’ interest. The same with the Sonata Nona by Uccellini; absolutely beautifully played, completely technically assured, with near perfect tuning but at points felt a little safe. She is a young player at the beginning of her career, which I have no doubt will be a great one, and will for sure develop the confidence to step out of the safety zone. The Uccellini comes in very much on the trail of the Stile Rappresentativo and it needs, in my opinion, a slightly bolder approach, but this in no way takes anything from a beautifully accomplished performance.

Sources

http://xplus3.net/2009/03/31/ocr-with-ocropus-and-tesseract/
http://code.google.com/p/ocropus/
http://www.pernillepetersen.dk/?page_id=224

Leave a Reply