No two property tax forms are exactly the same. Learn how Figure Eight helped CrowdReason fingerprint thousands of documents so they can each be read instantly upon upload.
“We had no ground truth to train with on day 1. But with Figure Eight, we were given instant access to a platform where we could generate high quality ground truth data.”
– Brandon Van Volkenburgh,
CTO & Co-Founder, CrowdReason
CrowdReason is a financial technology services company optimizing business processes with its proprietary SaaS applications–leveraging trained software robotic processes together with on-demand virtualized labor. They specialize in property tax, with a suite of tools including a service that allows customers to analyze and understand disparate tax documents with machine learning and optical character recognition (OCR) technology.
There are more than 20,000 taxing jurisdictions in the United States, each with their own documents and processes. For businesses, especially ones with many locations across the country or complex tax liability, staying current on those differences and understanding each jurisdiction’s rules is a serious headache. It can mean endless data entry, missed deadlines, penalties, and missed tax reduction opportunities.
CrowdReason’s goal is to help their customers manage it all. But scouring every document individually simply isn’t an option for a company that wants to automate a painful process for their customers. What’s much less painful and far more scalable is reading every document with machine learning. And so they turned to Figure Eight.
Extracting data from unstructured documents has always been a challenge, especially when those documents vary in the subtle but important ways that property tax forms do. The high number of classes and varying forms that change year to year, results in thousands of documents that aren’t the same.
CrowdReason’s solution: to leverage an OCR model that can read scanned documents and pass those documents through an automated workflow where bots make unstructured decisions. The bots do this by leveraging both preloaded algorithms and human labor to extract important data like due dates, amounts, and addresses, and tie that to their core product offerings. It works like this:
CrowdReason’s in-house algorithms “fingerprint” a document based on extracted features then check to see whether a document has been seen before. Similar to a human fingerprint, documents can be matched to their unique characteristics. If the document has not been seen before, that document is passed fully through the Figure Eight platform, where human workers extract multiple data points in a series of tasks.
Data is then aggregated and passed on to the customer. Each extracted data point is also fed into an extraction algorithm that maximizes a bounding box where those data points exist in a document for a given fingerprint. Sometimes, data points exist in multiple locations, so they allow a machine to determine optimal bounding boxes rather than force worker defined bounding boxes.
Over time, multiple documents of the same fingerprint are extracted and the template/bounding box performance automatically increases to a point where high confidence predictions can be made. As high confidence predictions can be made, they then shift to only surfacing data points to human workers that have low confidence. The ultimate goal for each document being to recognize and extract the data without any human input.
What started out as a 100% manual workflow, with humans extracting each and every document, now only requires humans for less than 40% of data extraction. In other words, their models, built with great care and human-in-the-loop labels, understands more than 60% of the scanned, uploaded document data.