Data Capture Accuracy in OCR Systems

One of our most frequently asked questions is how accurate are Optical Character Recognition (OCR) systems. The answer is extremely accurate, but only if the source document is of good quality and is scanned correctly.

When form-handling OCR systems are set up, considerable help is given to the OCR engine by the software. For example, you can specify the type of data you would expect to find in that data field and considerably narrow down the options available to the OCR engine.

An obvious example of this may be the field with the fee amounts. Clearly, there is a restricted palette of available characters in an amount field. All digits, a decimal point and a minus sign, plus some letters which may denote negatives “DR”, for example.

Common Mistakes

Common mistakes for OCR engines to make would be an “l” for a “1” or a ”z” for a “2”. So by restricting the choice to the OCR engine for that field, we eliminate the possibility of “l” and “z”. By layering rules or even translations to particular fields, it is possible to increase accuracy greatly.

Of course, that is an easy example. If we were looking at a policy number field, we would have a greater palette of available characters and more scope for misreads. But it is usually possible to discern patterns of numbers and to eliminate characters that are never used.

For example, if we had a series of numbers in the pattern of AA99999-AA (where A is alphabetic and 9 is numeric), we can program the engine to recognise the pattern and translate the characters into the correct format.

However, all of this depends on the quality of the document presented. When we see poor results, it is always down to the condition of the original document or the way the document scanner has been set up. So how do we get the best quality documents?

The Best Quality Documents

The best quality pdf is one that has been downloaded from a provider’s website.

These documents will be computer generated. They will not have started life as paper and then scanned. As a result, the characters will be perfect – there will be good contrast between the characters and the background. The accuracy achieved with these documents will be extremely high. It does not matter if there is colour in these documents all you need to do is download the .pdf and process it.

When the statement originates as paper, and it is scanned, it is crucial that the scanner is set up correctly. If not, it will probably not be possible to process the statement accurately.

The first thing to do is scan only original and unmarked documents. If you mark the paper with a pen as in ticking off transactions, then the OCR engine will try and interpret your ticks as data. 

When you set up the scanner, you should scan at a high dpi (dots per inch) setting, we recommend 300dpi. The scanner should always be set to black and white, and definitely not colour or grayscale.

You will see sharp characters when you magnify the image in black and white. And, importantly, white space around each character. The white space is crucial, if the characters touch each other, you will get very poor results.

Now check a .pdf scanned in colour or grayscale and magnify up. You will see a blurring of each character. For the short-sighted amongst us, it is like trying to read without glasses.    

One final tip, don’t destroy the paper until it is processed. If you do need to rescan, you can’t if you have shredded the paper.

So the good news is that you are the one in control of data quality. If you follow these simple steps, we can set up the system to produce really accurate data, data that is far more accurate than human rekeying can achieve. If you need any help with your document scanning, book a time slot here  for a no-obligation chat.