Today, Amazon Web Services, Inc. (AWS), an Amazon.com company (NASDAQ: AMZN), announced the general availability of Amazon Textract, a fully managed service that uses machine learning to automatically extract text and data, including from tables and forms, in virtually any document without the need for manual review, custom code, or machine learning experience.
Amazon Textract goes beyond simple optical character recognition (OCR) to identify the contents of fields in forms, information stored in tables, and the context in which the information is presented, such as a name or social security number from a tax form or the product SKU or quantity in a warehouse from an inventory report. The extracted text and data can be easily used to build smart searches on large archives of documents, or can be loaded into a database for use by applications, such as accounting, auditing, and compliance software. Amazon Textract’s API supports multiple image formats like scans, PDFs, and photos, and customers can use it with database and analytics services like Amazon Elasticsearch Service, Amazon DynamoDB, and Amazon Athena and other machine learning services like Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate, and Amazon SageMaker to derive deeper meaning from the extracted text and data. To get started with Amazon Textract, visit https://aws.amazon.com/textract.
Many companies extract text and data from files such as contracts, expense reports, mortgage guarantees, fund prospectuses, tax documents, hospital claims, and patient forms through manual data entry or simple OCR software. This is a time-consuming and often inaccurate process that produces an output requiring extensive post-processing before it can be put in a format that is usable by other applications. That’s because existing OCR technologies are unable to recognize common layouts like forms and tables, and only generate a lengthy and often inaccurate text dump. What organizations want instead is the ability to accurately identify and extract text and data from forms and tables in documents of any format and from a variety of file types and templates. Amazon Textract analyzes virtually any type of document, automatically generating highly accurate text, form, and table data. Amazon Textract identifies text and data from tables and forms in documents - such as line items and totals from a photographed receipt, tax information from a W2, or values from a table in a scanned inventory report - and recognizes a range of document formats, including those specific to financial services, insurance, and healthcare, without requiring any customization or human intervention. Amazon Textract makes it easy for customers to accurately process millions of document pages in just a few hours, significantly lowering document processing costs, and allowing customers to focus on deriving business value from their text and data instead of wasting time and effort on post-processing. Results are delivered via an API that can be easily accessed and used without requiring any machine learning experience.
“The power of Amazon Textract is that it accurately extracts text and structured data from virtually any document with no machine learning experience required. Subsequently, developers can analyze and query the extracted text and data using our database and analytics services like Amazon Elasticsearch Service, Amazon DynamoDB, and Amazon Athena and integrate with other machine learning services like Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate, and Amazon SageMaker to help customers derive deeper meaning from the extracted text and data,” said Swami Sivasubramanian, Vice President, Amazon Machine Learning. “In addition to the integration with other AWS services, the rich partner community developing around Amazon Textract makes it possible for customers to gain real meaning from their file collections, operate more efficiently, improve security compliance, automate data entry, and facilitate faster business decisions.”
Amazon Textract takes scanned files stored in an Amazon S3 bucket, reads them, and returns data in the form of JSON text annotated with the page number, section, form labels, and data types. This data can then be used for a range of applications (e.g. generating smart search indexes, redacting text in a massive collection of forms, creating automated loan approval workflows, using the data for regulatory compliance, and flagging fraud risk for insurance claims). Customers can load the data into business software, such as spreadsheets, databases, and payroll systems, or they can analyze and query the data using Amazon ElasticSearch, Amazon DynamoDB, Amazon Redshift, or Amazon Athena. Amazon Textract is available today in US East (Ohio), US East (N. Virginia), US West (Oregon), EU (Ireland), and will expand to additional regions in the coming year.
The Globe and Mail is a national icon and Canada’s most recognized media brand. "As a news media company, we rely on many PDF or scanned-source documents such as FOIs (freedom of information requests) that have important information contained in tables that we previously couldn't access,” said Michael O’Neill, Managing Director of Digital and Data Science at The Globe and Mail. “These documents have been under-utilized because journalists were not able to access them easily or didn't know they existed. Using Amazon Textract, we are able to extract information from tables in PDFs and easily output that data to CSV and offer easy access to these documents by making them available for search queries by our journalists. This increases efficient access to information for our journalist by tenfold."
Met Office is the UK’s national weather service, and is a world leader in providing weather and climate services. "We hope to use AmazonTextract to digitize millions of historical weather observations from document archives,” said Philip Brohan, Climate Scientist at Met Office. “Making these observations available to science will improve our understanding of climate variability and change."
PwC helps organizations and individuals create value by delivering quality in assurance, tax, and advisory services. “At PwC, we work to provide our customers with intelligent automation tools that help transform previously manual processes. We've integrated Amazon Textract into our solution for the pharmaceutical industry to automate document processing for various FDA forms like MedWatch and CIOMS,” said Siddhartha Bhattacharya of PwC. “Previously, people would manually review, edit, and process these forms, each one taking hours. Amazon Textract has proven to be the most efficient and accurate OCR solution available for these forms, extracting all of the relevant information for review and processing, and reducing time spent from hours to down to minutes.”
Healthfirst is a not-for-profit managed care organization and one of the fastest growing health plans in New York with over 1.4M diverse members and a network of more than 35,000 providers and 4,500 employees. “At Healthfirst, we are building data pipelines to turn scanned medical charts into useful clinical information to improve care coordination, drive quality outcomes, and ensure appropriate reimbursement for members under our coverage,” said Steve Prewitt, Chief Analytics Officer at Healthfirst. “We use Amazon Textract and Amazon Comprehend Medical to glean real value from unstructured data sources in an efficient way, resulting in revenue savings 10-20 times more than our usual downstream operation. By scaling up to analyze over 50,000 charts, we can find undocumented diagnoses and refer around 5,000 members for the care management they need.”
Informed, Inc. automates how financial institutions originate loans and open bank accounts. "We have already used Amazon Textract to analyze tens of thousands of loan documents on behalf of financial institutions, and our own software-as-a-service offering has been enhanced by the service, enabling us to identify 95% of the defects in loan application packages and help banks reduce their manual data entry,” said Justin Wickett, Founder and CEO, Informed Inc. “Using Amazon Textract, our software gives financial institutions real-time visibility into an applicant’s income based off of their pay stubs, bank statements, tax returns, and other financial documents. We plan to expand the types of documents we analyze using Amazon Textract in order to enable financial institutions to take advantage of our machine learning models and bring real-time decision-making efficiency to today's slow and manual process."
Candor’s mission is to transform the archaic, time consuming process that burdens the mortgage industry. “We use OCR to extract data from a wide variety of lender-required documents to verify income, assets, property value, and more. Until now, the best OCR solution read one page at the rate of 38.4 seconds, but Amazon Textract achieves this in a fraction of that time,” said Tom Showalter, Founder & CEO of Candor. “We’ve been able to use Textract to accurately read complex, diverse documents such as bank statements, pay stubs, and tax documents without additional training or machine learning expertise, allowing our clients to underwrite and close a loan in days, as opposed to weeks.”
UiPath is a leading Robotic Process Automation vendor providing a complete software platform to help organizations efficiently automate business processes. "Amazon Textract will further differentiate UiPath's robotic process automation platform by enhancing UiPath’s document understanding capabilities, enabling our customers to unlock critical business data from documents, transform that data into actionable business insights, and deliver those insights into line-of-business and operational systems," said Param Kahlon, Chief Product Officer of UiPath.
TeraDact allows customers to transform stored images and paper documents into privacy-compliant, usable digital formats at scale. “Amazon Textract’s smart docs platform feeds TeraDact’s patented redaction services to automatically remove and secure sensitive data. TeraDact customers can permanently remove this data so that it can never be recovered or opt to replace sensitive data with patented tokens which can be recovered by individuals with the appropriate permissions. This is particularly useful in complying with government mandates surrounding individual data privacy such as GDPR,” said Tom Trobridge, COO, TeraDact.
Ripcord’s mission is to digitize and extract knowledge from paper documents using vision-guided robotics, machine learning, and advanced AI. This knowledge automates business processes and workflows. “We’ve had tremendous success utilizing Amazon Textract to augment our advanced entity extraction to benefit many industries and uncover $4 billion in new pay. We look forward to expanding our use of Amazon Textract across financial and government services, healthcare and legal,” said Alex Fielding, CEO of Ripcord.
Blue Prism develops Robotic Process Automation software to provide businesses and organizations with a more agile virtual workforce. “Blue Prism's connected-RPA can automate and perform mission-critical processes, allowing customers the freedom to focus on more creative, meaningful work. By using Amazon Textract, we’ve given our digital workforce another powerful tool for automation. Amazon Textract accurately analyzes data from various document types using machine learning, which enhances the digital transformation journey for our customers. Using additional AWS AI services like Amazon Comprehend and Amazon Rekognition, we can tackle challenges from added secure customer authentication processes to fraud detection capabilities. The intelligence and flexibility of Amazon Textract’s form data extraction can elevate OCR to new levels in industries like financial services, retail, manufacturing and transportation to name a few,” said Dave Moss, CTO and Co-Founder of Blue Prism.