Data scientists have long struggled to find quality, relevant datasets to get their machine learning projects off the ground. This is especially true for researchers and academics or practitioners in organizations where data clarity and quality haven’t been traditionally prioritized. Simply put, the community needs more and better training data to make the advancements in machine learning we should be making. And that’s why I was so proud to announce the launch of Figure Eight Datasets at Train AI this morning.
Figure Eight Datasets is a free, curated repository of versioned, open-sourced training data. Together, these training sets represent over 10 million human-in-the-loop labels, spanning natural language processing, computer vision, optical character recognition, and more. The initial offering contains:
Open Images Dataset V4 (Bounding Boxes)
A set of 1.7 million images, annotated with bounding boxes for 600 classes of objects, served in collaboration with Google.
Medical Images for Nucleus Segmentation
21,000 nuclei from several different organ types annotated by medical experts.
Transcriptions of 400,000 handwritten names for Optical Character Recognition (OCR).
San Francisco Parking Sign Detection
Parking sign detection and parsing from images of San Francisco streets.
Medical Speech, Transcription and Intent (English)
A collection of audio utterances for common medical symptoms.
Medical Information Extraction
A dataset of relationships between medical terms in PubMed articles, for relation extraction and related Natural Language Processing tasks.
Multilingual Disaster Response Messages
A set of messages related to disaster response, covering multiple languages, suitable for text categorization and related Natural Language Processing tasks.
Swahili Health Translation, Speech, Transcription and Topics
A collection of health-related audio recordings in Swahili created in collaboration with Translators Without Borders and the Red Cross.
What we’re especially proud of is not just the quality and breadth of the datasets we’re releasing today, but their fundamental purpose. Earlier in my career, I worked on machine learning projects for disaster response and medicine and having sets like these to build off would’ve been a godsend. That’s one of the reasons you’ll see datasets related to those exact fields in our new library. But really, we’re still just barely scratching the surface of the good that machine learning can do for the world. Most people know ML powers search algorithms, entertainment recognition, logistics, gaming, and on and on, but it has the ability and capacity to do so much more. It can save lives. It can change medicine. These are the kinds of datasets we’re featuring now and the kind we’ll be releasing in the future. We hope you enjoy them.
We also released a pair of new features, including our first foray into video annotation. Right now, our Video Object Tracking and Smart Bounding Box Annotation features are in private beta, but they’re the exact kind of features machine learning teams need to provide real value in their day-to-day work.
Video Object Tracking allows for annotation of an object within a video frame and then to have that annotation persist across the rest of the video. Essentially, that means annotating an object once and having it track across hundreds or thousands of frames instead of the painstaking work of splitting video into stills and annotating each frame. This functionality is essential if you’re annotating video at scale, and with YouTube alone accounting for 1 billion hours of video consumed daily, video-based machine learning projects are going to grow exponentially over the coming years.
Our Smart Bounding Box Annotation feature allows machine learning teams to leverage the power of Deep Learning to accurately identify objects in computer vision applications. This new capability actually comprises a pair of new features: Predictive Bounding Boxes and Intelligent Bounding Box Aggregation. The former greatly reduces the human effort to identify, label, and draw boxes around images as our deep learning model predictively draws those boxes, leaving annotators to simply confirm, adjust, or remove those annotations. The latter, Intelligent Bounding Boxes Aggregation, addresses how best to determine the “correct” bounding box when multiple annotators may box an object with slight differences. Honed & optimized over millions of past bounding box annotations on the Figure Eight platform, Intelligent Bounding Box Aggregation addresses this problem using Deep Learning Computer Vision combined with expertise in quality control for human annotation. Together, these features increase both accuracy and throughput in bounding box annotations on our platform.
If you’d like to get some more information about this private beta, head to our signup page. And please remember to check out our brand new Figure Eight Datasets. We’d love to hear how you use them in your research.