These days, a majority of data is useful for organisations that do not reside in databases. Instead, it exists in the form of unstructured documents like invoices, receipts, forms etc. Most business transactions begin, involve and end with new documents being generated. Thus organisations need to equip themselves with Artificial Intelligence in Finance that can parse, understand and extract relevant information from unstructured documents. It would reduce costs related to manual data entry and lead to faster decision making by introducing AI in financial services and machine learning in financial services.
To avoid designing expert rules for each specific document type, some current works attempt to tackle the problem by learning a model to explore the semantic context in text sequences. We propose to harness the valuable information from both semantic meaning and spatial distribution of texts in documents for our process through AI application in finance.
Overall, the task of structured information extraction from unstructured documents can be divided into the following subtasks:
- Generating manually labelled data by tagging key-value pairs of interest.
- Optical Character Recognition which reads text from typed/ scanned documents.
- ROI (Region of Interest) Identification for Key-Value pair extraction.
- Table Extraction for line-items information.
The research community and the industry have made varying degrees of progress in these tasks by incorporating AI in finance. Let us go through them one by one.
Generating manually labelled data by tagging key-value pairs of interest:
There are several off-the-shelf tools available from both open-source and cloud providers. Some popular open-source tools are labelme, labelImg, CVAT (by Intel), VoTT (by Microsoft) etc.
Apart from these, popular cloud platforms like Microsoft Azure also have labelling tools as a part of its Form Recogniser API. AWS also has GroudTruth as a part of their SageMaker offering.
So, as we can see, both open-source and the commercial cloud community have made significant progress through introducing AI for financial services.
OCR to read the text from scanned/typed documents:
When it comes to OCR, there are again plenty of options available both as open-source projects and managed cloud services. We’ll briefly discuss some of these:
- Tesseract OCR is one of the most popular open-source OCR tools, which is free and easy to use. It is a command-line tool. However, you have a Python wrapper called pytesseract and the GUI frontend gImageReader, so you can choose the one that best fits your purposes. Tesseract usually performs well for high-resolution images. Certain morphological operations such as dilation, erosion, OTSU binarization can help further increase tesseract performance. Also, several page segmentation modes (around 14) can be played around to see what works best for your specific use case.
- EasyOCR is another open-source OCR that supports 80+ languages. As the name suggests, easyocr is a simple, lightweight library for OCR. It gives an excellent performance for organised texts like invoices, receipts, bills. While Tesseract has several page segmentation modes to choose from, EasyOCR allows for simple yet fine-grained controls in the form of various horizontal and vertical merger thresholds.
However, most of these open-source tools are trained for generic domains and often do not work well when recognising currency symbols and other financial domain-specific entities. To solve this problem, there are paid Invoice OCR solutions from Veryfi, Nanonets, AWS, GCP, Azure etc.
ROI (Region of Interest) Identification for key-value pair extraction:
Now that you have your labelled data ready and an OCR tool that can extract text from the documents, it’s time to handle the crux of the problem - extracting relevant key-value pairs. The approaches to do this can be broadly divided into two categories :
- Rule-Based: Rule-based systems consist of pixel-level rules to label entities within a document. These approaches depend heavily on the document format and provide pretty high accuracy on a particular document format. However, they lack “intelligence” and hence do not scale well to multiple document formats. They are a good starting point for an enterprise looking to automate its manual data entry processes. However, they are not scalable in the long run.
- ML-Based: ML-Based systems consider semantic knowledge to label the extracted text. Most ML-based approaches incorporate semantics with the help of Word Embeddings and often miss out on spatial information. By incorporating semantics, key-value extraction will be more robust and accurate than Rule-Based as it is more generalised. However, Spatial information along with Visual Information can do wonders.
Table Extraction for line-items information:
Finally, we need to tackle possibly the most tricky part of this entire solution - Line Item extraction. This task is deceptively complex. On the first look, it just seems like it can be solved using basic rule-based OCR techniques. However, it involves several complexities that arise due to non-standard invoice formats. Many things need to be considered, including multi-line item descriptions, columns split into sub-columns with standard headers etc. Hence, a deep learning-based solution is necessary for this. Luckily, there are a few open-source solutions available for table detection & extraction. Some of these are:
- CascadeTabNet: It is an approach for end-to-end table detection and structure recognition from image-based documents. It is an end to end approach for solving both problems of table detection and structure recognition using a single Convolution Neural Network (CNN) model. It is a cascade mask region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet) based model that detects the regions of tables and recognises the structural body cells from the detected tables simultaneously.
- DeepDeSRT: DeepDeSRT is a novel end-to-end system for table understanding in document images. For table detection, it uses transfer learning techniques by fine-tuning a pre-trained model of Faster RCNN. Again transfer learning is performed for table structure recognition by augmenting and fine-tuning an FCN semantic segmentation model pre-trained on the Pascal VOC 2011 dataset.
- TabStructNet: TabStructNet focuses only on the table structure recognition task. However, it proposes a pretty robust pipeline for it. It tries to mimic the way humans perceive a table by combining cell detection and interaction modules to localise the cells and predict their row and column associations with other detected cells. Cell detection uses FPN (Feature Pyramid Networks) with RPNs (Region proposal network), and the row-column association uses Graph Convolutional Networks.
Thus, as we have seen, the task of information extraction from unstructured data involves several complicated steps. By discussing each of these steps, we have tried to establish a framework through AI for finance to tackle such problems.
We at MastersIndia are also developing some custom solutions for promoting artificial intelligence for financial services and tackling these problems, do get in touch for details!