Logic Validation vs. Data Validation
Software can be tested or validated at various levels using a variety of techniques. Two necessary forms of software testing are: (1) logic validation and (2) data validation. Logic validation is the process of determining whether the applied logic (in the form of code) exhibits the desired behaviour. Whereas, data validation is the process of ensuring that the supplied input data to a system is valid and consistent with intended data types of the system. There exists many tools that facilitate and standardize practices for logic validation in the industry. However, the techniques for data validations are particular to the use cases. At Wealthfront, we practice data validation in our continuous deployment model to ensure that each deployed service is functioning with the new set of data. As described in the linked blog post, the technique employed is customized for the specific use case in question. Another scenario where we use data validation is for verifying the contents of document images we receive from external partners. As these documents are images embedded in PDF, we cannot use standard PDF parsing techniques to validate their content. Rather the technique we use is called optical character recognition (OCR).Optical Character Recognition (OCR)
Optical character recognition (OCR) refers to the automated process of translating images of text into machine-encoded text, such as ASCII. It is widely used in commercial applications to store, edit, search and analyze text documents (typewritten or text). This is done in a matter of seconds which would otherwise be a cumbersome manual task. OCR works by scanning your images, extracting the contained text, splitting the text into characters and then recognizing those characters. It can be trained to recognize a variety of different fonts, languages and even handwritten text. In the open source world, Tesseract is perhaps the most accurate and leading OCR engine. Originally developed as a PhD research project at Hewlett-Packard (HP) in the 1980s, Tesseract has been significantly enhanced by Google after it became open source. At Wealthfront, we use Tesseract to do OCR validation on scanned PDF documents. Since Tesseract uses Leptonica image processing libraries to perform OCR, it only works with image files such as PNGs or TIFFs and cannot work with PDFs directly. It needs to be combined with a PDF interpreter, such as Ghostscript, an excellent interpreter and manipulator of Postscript and PDF files to image files. To perform OCR in Java code, you need a Java Native Access (JNA) wrapper for simplified native library access to Tesseract OCR engine. Tess4J is the JNA wrapper that combines Tesseract DLLs with Ghostscript to provide feature support for PDF documents. Following is some sample Java code that takes a scanned PDF document, converts it into PNGs, and then performs OCR using Tess4J libraries:import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import net.sourceforge.tess4j.util.PdfUtilities;
public class TestOCR {
public void performOCR() throws TesseractException {
Tesseract instance = Tesseract.getInstance();
File pdfDoc = new File("/my/file/location/doc.pdf");
File pngImageFiles[] = PdfUtilities.convertPdf2Png(pdfDoc);
for (int i = 0; i < pngImageFiles.length; i++) {
String ocrResult = instance.doOCR(pngImageFiles[i]);
}
}
}
Code language: Java (java)
Ghostscript Performance Enhancements
There are settings that can be tuned to increase the performance of Ghostscript. If you use the default convertPdf2Png method in Tess4J’s pdfUtilities, then custom settings cannot be exercised. However, you can always write your own wrapper for Ghostscript and calibrate settings to optimize the performance of your program, such as the sample: Ghostscript suggests using the options for multithreaded rendering (increase the rendering bands for concurrency on multi-core systems) via -dNumRenderingThreads=n or giving it more memory for performance improvements. However, from experimentation results, they offered little to no improvement for our set of input data.Output Resolution
The resolution at which you perform the document conversion does have a direct impact on Ghostscript performance, albeit at the cost of quality of the output image file. While converting documents at lower DPI will reduce the conversion time, they will increase the inaccuracy of the OCR interpretation and vice versa. You can specify the output image resolution with the -rres argument. By default, Ghostscript converts images at 72 DPI which is quite low. Following are the performance results comparison at different DPIs:|
Conversion resolution – DPI |
72 DPI default |
100 DPI |
200 DPI |
250 DPI |
300 DPI |
400 DPI |
|
Runtime Increase |
— |
~1.25X |
~2.2X |
~2.8X |
~3.5X |
~13X |
Selective Page Conversion
Another useful option is selective page conversion, which is dependent on the use case where you only want to perform OCR on selected pages of a document. This significantly reduces runtime by not defaulting to converting the entire document, especially for larger documents. You can specific the range of pages you want to convert using the following two options: -dFirstPage=1 -dLastPage=n. Even if the document size is unknown prior to conversion, you can use any PDF reader (such as Apache PDFBox) to retrieve page count. Single page conversion is still roughly linear to entire document conversion since there isn’t any noticeable overhead associated with Ghostscript initialization. Significant performance improvements for selective page conversion start to kick for documents over 20 pages. The following should provide a good relative comparison for the different document sizes and conversion times.|
Pages in PDF |
< 5 Pages |
~10 Pages |
~50 Pages |
~100 Pages |
|
Runtime – converting 3 pages individually |
~1sec |
~1sec |
~1sec |
~1sec |
|
Runtime – converting entire document |
~1sec |
~2sec |
~11sec |
~20sec |
Tesseract Performance Enhancements
The next bottleneck is the core Tesseract OCR process which can also be tuned for performance. One of the allowable optimization that can be applied with Tess4J wrapper method for OCR (doOCR) is calling it in combination with a Rectangle. The Rectangle bounds the region of the image that needs to be recognized while performing OCR. From test runs, the runtime improvement is about 4x when using a Rectangle of dimensions (0, 0, 1000, 1000) in comparison to not using Rectangle.Using Rectangle
Following are the runtime improvements when using Rectangles of different size from sample runs:|
Rectangle Dimensions |
No Rectangle |
(0, 0, 1000, 1000) |
(0, 0, 1500, 1500) |
(0, 0, 2000, 2000) |
|
Runtime Improvement |
— |
4X |
2X |
1.4X |
|
Runtime (seconds) |
1.6sec |
0.4sec |
0.7sec |
1.1sec |
Accuracy of Tesseract OCR Process
In terms of accuracy, Tesseract’s OCR is not completely precise and exhibits some level of variance when interpreting text images into ASCII. Common variance include:- Misinterpretation of the letter case: Interpreting uppercase for lowercase letters and vice versa
- Mistaking letters, numbers, symbols that share similar ASCII symbol shapes, such as:
|
Actual Character |
OCR Interpreted Character |
|
0 |
O |
|
0 |
° |
|
I |
| |
|
I |
: |
|
5, 6 or 8 |
S |
|
Conversion resolution – DPI |
< 100 DPI |
150 DPI |
200 DPI |
250 DPI |
300 DPI |
400 DPI |
|
Failure Rate |
> 99% |
~51% |
~18% |
negligible |
negligible |
negligible |