From fraud detection to the already ubiquitous use of machine learning powered trading strategies, few industries have benefitted from the promise of AI as much as finance has. After all, finance is a data-rich ecosystem and much of this data does not need to be labeled or augmented to be useful. Take trading, for example, where minute-by-minute stock information is both readily available and incredibly robust. When you add up the accessibility of data, the importance of financial use cases generally, and the competitive advantage even a small improvement in model performance brings to their bottoms lines, it makes sense these institutions would be on the cutting edge of machine learning.
But there’s a particular ML discipline we want to talk about today that we’re seeing more and more of here at Figure Eight. Unlike a lot of financial solutions, this one does require some data augmentation, but the benefits of doing it correctly can be massive. We’re talking about understanding and transcribing every document with machine learning. We’re talking about optical character recognition.
Sometimes referred to as OCR, optical character recognition looks to translate handwritten or printed text, most often from a PDF or other image, into machine-readable text. For a simple example, when you deposit a check into an ATM you’re seeing OCR at work. The ATM scans the check, understands the amount, and double-checks with the user to verify that the amount is in fact correct (which, at this point, it almost always is).
Checks are a pretty simple problem, though. They’re uniform–the amount is in the same place, the recipient is in the same place, etc.–and they contain certain text that’s already machine readable (account numbers at the bottom left, for example). Most financial institutions deal with a much larger universe of document types, many of them handwritten or scanned or in legacy databases. Data entry of these documents is both costly and, frankly, a bit boring. Which is why smart companies are moving to a machine learning powered approach.
So how does OCR work? To start with, it’s important to understand there are various strategies here depending on the documents in question (and we’ve seen plenty of those approaches here at Figure Eight–in fact, we have case studies with both Workday and CrowdReason that show how they approach this problem). What we’d like to do is show you three of the main steps generally taken to train smart OCR models. We’ll start with one that’s often overlooked:
Identify areas of interest
Many OCR workflows start with understanding what’s important in individual documents–and what isn’t. Typically, this is done with bounding boxes over images.
Take a typical document, say a mortgage form. There’s going to be some boilerplate legal copy on there that never changes and fields that you might not care about. But amounts, due dates, account numbers, signers, and more are in fact important.
It works like this: you surface a document to a human annotator like the ones we have on Figure Eight. You ask them to place a box around the relevant fields (like due date) and they do so. Effectively, you have a subsection of the image boxed off and that box is associated with the field “due date.” With enough of these images and annotations, your OCR algorithm can start to identify where the due date (or any other relevant field) is on a certain form without any human input.
It’s important to note that doing this helps create “fingerprints” of different documents. In other words, your model can learn that certain fields appear in certain places on certain documents, allowing you to simple upload image fields and confidently knowing your model can both read the document and identify what the document is in the first place.
Transcribe those fields
Once you’ve identified where the relevant fields are in a given document, the next step is to actually transcribe them. Some financial institutions will already have models that can do this, of course, but if not, we can help train them. And even if you do, tuning them on individual documents or file types can be an important step to improving accuracy.
This is a fairly simple task on our platform. You surface a certain area to an annotator (say, that boxed due date we mentioned above) and have them transcribe what’s in there. Obviously, handwriting is trickier than typed text, so an OCR model will need to see way more of these annotations to understand the substance of the text, but it’s doable. After all, remember depositing your checks in the ATM.
Tune and improve your model over time
It’s also worth mentioning here that not all text will be machine readable. Some people just have bad handwriting. Some images are pixelated.
That’s okay. What a smart, well trained model will do is flag images it can’t read or isn’t confident about. Those can go to human annotators to transcribe (or internal services you’re already using) and those judgments can be fed back into the model so it gets smarter. The point is to reduce overall data entry by orders of magnitude. Making sure every document is machine readable is a great goal, but there are a ton of edge cases.
By annotating for fields your model struggles with or the edge cases we mentioned above, you create a virtuous circle for your model. Each annotation trains the model so that model needs less human input to make accurate predictions. When it can’t make one, you annotate and feed that information back to it so it learns from the edge cases or tough problems. This is a fundamental part of the human-in-the-loop approach we’ve championed for years.
If you’re looking to build out your OCR functionality, please don’t hesitate to reach out. We’ve helped Workday understand receipts for their expense reporting solution. We’ve helped CrowdReason understand complicated property tax forms. And we can help you too.