How To Use AWS Textract OCR To Pull Text and Data From Documents

0
358

Many companies use human workers to do manual data entry on forms, applications, and other physical documents. While this is very accurate, it’s slow and costly. AWS Textract uses machine learning to automate this process.

Why Use AWS Textract?

Textract certainly isn’t the only Optical Character Recognition tool—there are plenty of open source solutions available for free, such as Tesseract OCR. You can read our guide to using that to learn more.

Textract, however, is a lot more than simple OCR as it’s meant for analyzing and extracting data from forms, tables, and other documents. It’s able to pull out important key-value pairs, tables, and other key strings, which makes it actually usable as an interface between scanned documents and a database (though you’ll need to set that automation up yourself).

The other allure is that Textract makes OCR available as a fully managed cloud service. You don’t need to set up your own application servers to run OCR and understand the output; just configure Textract, and send it some documents, it will output the results.

For companies still doing manual data entry, Textract can save you a lot of money, both in the reduced man hours spent typing on a keyboard, and the fact that it can batch process many items at once, increasing the speed of data entry immensely.

In terms of price, Textract is cheapest for straight up text, like scanning pages of books. For that, it only costs $1.50 per 1000 pages. For analyzing tables, it costs $15.00 per 1000 pages. For key-value pairs, it costs $50.00 per 1000 pages. While that’s not exactly free, it sure beats paying a human to do it manually.

Textract is pretty accurate, but if you’re worried about the machine getting something wrong, AWS has a solution for that as well. You can set up Textract to use Amazon’s Augmented AI workflow, which will automatically refer low-confidence results to humans for review.

Using Textract

Head over to the Textract Management Console, and click “get started.” Using the console manually, you can upload documents using the button here:

Textract will process it immediately. You’ll quickly see what makes Textract so useful; it knew which pieces of text on this W2 form were important, which ones were part of key-value pairs, which ones were part of tables, and which ones it could throw out.

On the right, you’ll find the output, which displays all the raw strings it found, the key-value pairs, and any tables of data. Note that these aren’t mutually exclusive, as in this case it found key-value pairs that where also parts of tables.

You can download the results, and you’ll find a CSV file of all tables and key-value pairs, as well as a text file of the raw text output.

If you want to automate Textract, you’ll need to use the AWS CLI or API. Textract has its own set of commands for working with it from the command line.

You can either serialize the document to base64-encoded document bytes, or upload it to S3 and give Textract a key for where to find it. Then, you can use analyze-document to start a job:

aws textract analyze-document –document ‘{“S3Object”:{“Bucket”:”bucket”,”Name”:”document”}}’ –feature-types ‘[“TABLES”,”FORMS”]’

This is a synchronous operation, but you can analyze asynchronously by starting a job and then fetching the results manually.

aws textract get-document-analysis –job-id df7cf32ebbd2a5de113535fcf4d921926a701b09b4e7d089f3aebadb41e0712b –max-results 1000