How it Works: Optical Character Recognition

Posted by caroline on March 14, 2013

Optical Character Recognition or OCR is a scanning technology that takes images with text and turns them into editable, digital formats. With OCR scanning, you can process books and any typed documents, and turn them into editable digital document formats. OCR scanning systems can also handle most images, even those with complex formatting with photos and columns of text.

So how does OCR scanning work? There are several steps involved:

Loading an image from a given source. An OCR scanning system has to understand and be able to read many different kinds of image formats: JPEG, PNG, GIF, BMP, TIFF (one page and multi page) and PDF are a few that OCR can read. Most documents are stored as images in PDF format and the only way to get the text from these files is with OCR scanning.
Next, it detects the most important image features that include the resolution and text. OCR scans look for some predefined range of font sizes and page colors so the image is typically edited before it’s processed when it is necessary.
If an image is off center or has a lot of noise (lots of black or white spots) the OCR program can improve image quality with some quick tweaks.
Most OCR scanning algorithms require an image with only black and white. This is called a bi-tonal image, and it is applied to images that have color or are gray. The process is called binarization and it’s very important because if it’s skipped than an image will come out incorrectly.
Detecting and removing lines is another thing OCR scanning will take care of. This is a required step to improve the analysis of a page layout. This step will recognize quality for things such as underlined text and tables on the page.
The next step is called “zoning” or page layout analysis. The OCR system will detect the position and type of important areas on the page.
Detecting text lines and words. This can be difficult due to difference in font sizes and the amount of space between words.
Analysis of broken characters and combining them. Sometimes, some characters will be broken in several parts or characters might be touching, this is necessary to detect and correct the position of each character.
The most important part of any OCR scanning is the recognition of characters. Each character has to be converted so it’s legible. If there’s an uncertain character within an image, the algorithm will produce more than one choice of character and the finalized character can be selected later.
Some OCR scanners have dictionary support to help improve recognition quality, certain characters can look very similar and the dictionary can help make the decision about what the character should be in context with the characters around it.
After all of this is said and done, the OCR algorithm will save the end results to the selected output file type which could be a searchable PDF, DOC, RTF, or TXT. It’s very important to save the original page in its original format as well.

When it comes to OCR scanning, every step is important and if one step fails then the entire process fails. Every single algorithm must work on the highest range of images for it to be effective. To achieve the best results, a good OCR system will have the ability to adjust the parameters for each separate algorithm as this can sometimes be the only way to improve recognition quality.

How it Works: Optical Character Recognition

Contact Us