In this blog-post, we will build a python program which performs Optical Character Recognition (OCR) and demonstrates to leverage it for solving real world business problems. Let us first understand the problem in brief. OCR refers to the technology which can process and convert the printed text from scanned images or documents into raw text which can be manipulated by machine.

In this blog-post, we will be walking through the following modules:

  1. Installing Tesseract OCR Engine.
  2. Running Tesseract with Command line.
  3. Running Tesseract with Python
  4. Running Parallel instances for Speed up
  5. Building the Pipeline for Real World Application.

You can download samples which are used in this blog-post from here.

1. Installing Tesseract OCR Engine

Tesseract is a popular open source project for OCR. You can visit the GitHub repository of Tesseract here. Much recently (in 2016), OCR developers had implemented LSTM based deep neural network (DNN) models (Tesseract 4.0) to perform OCR which is more accurate and faster than the previous conventional models.

Installing tesseract on windows is easy with the precompiled binaries found here. You can download and install the beta version exe from the Mannheim University Library page. Do not forget to edit “path” environment variable and add tesseract path.

2. Running Tesseract: Command Line

<

p style=”text-align: justify;”>You can see the converted text on command line by typing the following:
tesseract image_path stdoud

To write the output text in a file:
tesseract image_path result.txt

To specify the language model name, by default it takes english:
tesseract image_path result.txt -l eng

There are various page segmentation modes as a parameter. It directs the layout analysis that Tesseract performs on the page. There are 14 modes available which can be found here. By default, Tesseract fully automates the page segmentation, but does not perform orientation and script detection. To specify the parameter, type the following:
tesseract image_path result.txt -l eng --psm 6

Below is an example of scanned bill receipt of a restaurant on which OCR is performed.

bill_

The generated text after OCR is shown below:

Ying Thai Kitchen
2220 Queen Anne AVE N
Seattle WA 98109

Tel.  Fax. 

www. yingthaikitchen.com

Welcome to Ying Thai Kitchen Restaurant,
Order#:17 Table 2
Date: 7/4/2013 7:28 PM

~ Server: Jack (T.4)
44 Ginger Lover $9.50
[Pork] [2**]

Brown Rice $2.00
Total 2 item(s) $11.50
Sales Tax $1.09
Grand Total $12.59
Tip Guide
15%=$1.89, 18%=$2.27, 20%=$2.52 ,

Thank you very much.
Come back again

To OCR multiple pages in one run of Tesseract. Prepare a text file (savedlist.txt) that has the path to each image:
path/to/1.png
path/to/2.png
path/to/3.tiff

Save it, and then give its name as input file to Tesseract. “output.txt” will contain text generated from all the files in the list demarcated by page separator character.
tesseract savedlist.txt output.txt

3. Running Tesseract : Python

There are few wrappers built on the top of tesseract library in python. Python-tesseract (pytesseract) is a python wrapper for Google’s Tesseract-OCR. Type pip command to install the wrapper.

pip install pytesseract

Once you install the wrapper package, you are ready to write python codes for performing OCR. Just note that pytesseract is only a wrapper to access the methods of tesseract and still requires tesseract to be installed in system. We will write a simple python definition def ocr(img_path) to perform OCR. It takes an image path as input, performs OCR, writes the generated text to a .txt file and returns the output file name.

import pytesseract
import cv2
import re

def ocr(img_path):
    out_dir = "ocr_results//"
    img = cv2.imread(img_path)
    text = pytesseract.image_to_string(img,lang='eng',config='--psm 6')
    out_file = re.sub(".png",".txt",img_path.split("\\")[-1])
    out_path = out_dir + out_file
    fd = open(out_path,"w")
    fd.write("%s" %text)
    return out_file

4. Running Parallel instances for Speed up

In the previous section, we defined a function which takes an input image path and converts it into readable text. An obvious question of scale comes in when we have to process large number of images for example 1 million images. Thinking of that, I am penning down some of the ideas which one can try.

  1. Multi-Threading : If the system has 4 physical cores, one can run 4 parallel instances of tesseract and thus performing OCR of 4 images in parallel.
  2. Multi-page Feature : Multi-page feature of tesseract is much faster than single image conversion sequentially. To speed up the process, one should make a list of image paths and feed it to tesseract.
  3. Using SSDs or RAM as Disk : If there are large number of images, it can help in saving lot of I/O time. SSDs will have faster access and loading time.
  4. Running in Distributed system : Use MPI for python on a distributed system and scale it as much as you want. It is different than multi-threading as it is not limited to number of cores of a single system. You may have to bear more cost in terms of hardware.

Basically, In a multi-threading setup, a single server of 15-20 cores with SSD storage could process 1 million images in a day. A lot depends on implementation though. Below is an easy implementation to run multiple instances of tesseract in parallel over all the cores of the system using concurrent.futures library.

import os
import glob
import concurrent.futures
import time

os.environ['OMP_THREAD_LIMIT'] = '1'
def main():
    path = "test"
    if os.path.isdir(path) == 1:
        out_dir = "ocr_results//"
        if not os.path.exists(out_dir):
            os.makedirs(out_dir)

        with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
            image_list = glob.glob(path+"\\*.png")
            for img_path,out_file in zip(image_list,executor.map(ocr,image_list)):
                print(img_path.split("\\")[-1],',',out_file,', processed')

if __name__ == '__main__':
    start = time.time()
    main()
    end = time.time()
    print(end-start)

We call map function of ProcessPoolExecutor class which takes a definition and the list of images as an input. It distributes the list of image paths and executes the passed definition on each core in parallel.

5. Building the Pipeline for Real World Application.

The Architecture of the ICR system consists of the 4 main components. They are shown below in sequence:
1. Image Corrector:

  • The idea is to prepare the input image in order to do better text recognition in OCR component.
  • Rectification of Image (Image Correction)
  • Removal of borders from image.
  • Text Segmentation & background cleaning.
  • Use of OpenCV and Image Processing tools like ImageMagick

2. Optical Character Recognizer

  • Implementation of State-of-the-art technique used in OCR.
  • Using open source model : Tesseract.
  • It is DNN based on Long short term memory( LSTM) published in 2016.
  • Training of Tesseract required : For recognizing new fonts or hand written texts.

3. Text Processor & Corrector

  • Implementation of spell-checker to further improve accuracy.
  • Generated text needs post-processing in order to extract important fields.
  • Use of Regex and text processing libraries.
  • if necessary, We may set up the layout of text.

4. Data Population & Insight Generation

  • Extracted fields to be populated in Database (Unstructured to structured data).
  • It will augment the features/variables and improve the data quality.
  • Insight generation to help business.
  • Can be utilized for creating a documents exploration system.

A basic architecture of the end-to-end application is given below:

architechture
OCR APPLICATION

Important Note

There are few important things to keep in mind while building an tesseract based OCR application for solving some business problem.

  • The standard format of input image for tesseract is “.tiff” or “.png”. It will be convert all formats to “.tiff”.
  • Tesseract OCR works best with high-resolution images. It is recommended to convert all images to 300 DPI (Use ImageMagick).
  • By default, Tesseract uses 4 threads for OCR. It’s better to set thread=1 for a single image as it reduces overheads. Further one can run multiple instances.
  • OCR accuracy is affected by borders and lines in the images. Also background cleaning is required for better results. If you are not getting good results with tesseract, you may like to improve image quality (look for Fred’s Textcleaner script).

At the End

Hope it was a convenient read for all of you. I would encourage readers to reproduce the results demonstrated in the blog-post with python scripts. There is still a lot to explore in tesseract. Further, one can look for:

  1. To detect the layout of texts in images using the bounding boxes and its confidence probability. You can look for the other functions which gives such finer details for OCR.
  2. To train tesseract for new text fonts through transfer learning on LSTM models in order to improve accuracy.
  3. To understand LSTM based tesseract models and train it from scratch in order to perform handwritten text recognition.

To showcase the end-to-end application, I developed a basic QT desktop application. The below video demonstrates the idea.

Update: Readers can download the backup of QT application built from below link. It was a old work and don’t have a working app now. You may have to figure out codes in order to reproduce it. Good luck with that.

 

If you liked the post, follow this blog to get updates about the upcoming articles. Also, share this article so that it can reach out to the readers who can gain from this. Please feel free to discuss anything regarding solving such business problems.

Happy machine learning 🙂