Intro to Optical Character Recognition

I’m sure you’ve heard of Optical Character Recognition or OCR. It’s been around in some form or another since the early 1900s. At its core, it’s simply the process of recognizing printed text in images. Let’s say you’re at work and someone hands you a newspaper article. You need to use paragraphs 4-17 in an email and really don’t want to re-write the whole thing. You walk over to a scanner, scan the article, and a PDF shows up on your computer with the article. You open it up in your PDF reading program, click a button to convert it to a “readable PDF”, highlight the text you need, and you’re on your way. The button that converts your PDF into a readable format is conducting an OCR process. It’s looking through the pixels in the image you just scanned and trying to recognize characters and words. But a bunch of the text is wrong — “1” has turned into “l” and “rn” has turned into “m”. The problem is that even after more than 100 years of working on this technology, it still hasn’t reached human levels of recognition.

For important business processes, the level of accuracy provided by generic OCR solutions isn’t enough. When an OCR-enabled application is scanning a check, you don’t want it to get the account number wrong. When one is parsing medical documents, you don’t want it to get the name of a medication wrong. The main issue with the generic OCR solutions built into your PDF reader is that they are just that: generic. They’re designed to work with any type of document that you can feed it. Users can be trying to recognize text in scanned newspapers or insurance policies, apartment building rent rolls or legal briefs. The documents may be in pristine quality or they might be terrible scans.

Currently, AI deals very well with narrow tasks and poorly at broad ones. When it comes to specific business cases for OCR, the broad problem of recognizing text in images can often be constrained to make the resulting solution more accurate. If an OCR solution only has to recognize text in images of a certain quality or format, pre-processing techniques can be used to make the image more legible. If the solution only has to recognize text in documents written about certain topics, post-processing techniques can be used to increase the accuracy of the converted text.

Pre-processing techniques are strategies that can be used to clean up a document so that it is easier to read by the OCR algorithm. If a document was not aligned properly when it was scanned, it can be de-skewed to make the text horizontal. The document can also be converted to black-and-white, called binarization, to increase the ability of the algorithm to differentiate between the text and the background. Lines and boxes can be removed to ensure that the algorithm doesn’t interpret a line as being part of a letter. Layout analysis, also known as zoning, can be used to identify columns, paragraphs, and captions in multi-column layouts, so that the algorithm identifies each section of text as a distinct block. Character isolation can be used to separate multiple characters that might be connected in the image due to poor scan quality.

Once a document has gone through pre-processing, it is passed to the character recognition algorithm. The algorithm first isolates each character it has to recognize and then breaks the characters down into features. These features include lines, closed loops, line direction, and line intersections. These features are then compared to vector representations of all the characters the algorithm can detect. The algorithm uses feature detection techniques often used in other computer vision tasks to identify the closest match to the character it is trying to identify. Many OCR algorithms use a two-pass approach, where the first pass finds the best match between the character in the image and the character in its databases. It also identifies the probability that this match is correct. The second pass uses the characters that were matched with high probability to inform the identification of the remaining characters.

After the OCR algorithm has identified each of the characters, the output can then be run through post-processing algorithms to further increase accuracy. The output can be spell checked and constrained to a list of word that are allowed to appear in the document. The output can also be run through near-neighbor analysis, which can detect errors based on the likelihood that words would appear in close proximity to each other. Grammar detection can also be used to identify the likely part of speech of a word. This can be useful in correcting false positives. Post processing algorithms can also construct the output in the same format as the input, meaning that if the input was had two columns, the output would as well.

OCR by itself is very useful in bringing text from the real world into the computerized one, but it is in combination with Natural Language Processing techniques that its utility is truly realized. In a future article we’ll talk about Intelligent OCR, how it works, and how businesses are using it today.

Victor GebhardtVision, Intro