Transcriptions of 400,000 handwritten names for Optical Character Recognition (OCR).
Please view this page on a desktop computer to see more information including CSV preview and download options.
This dataset consists of more than four hundred thousand handwritten names collected through charity projects to support disadvantaged children around the world.
Optical Character Recognition (OCR) utilizes image processing technologies to convert characters on scanned documents into digital forms. It typically performs well in machine printed fonts. However, it still poses a difficult challenges for machines to recognize handwritten characters, because of the huge variation in individual writing styles.
There are 206,799 first names and 207,024 surnames in total. The data was divided into a training set (331,059), testing set (41,382), and validation set (41,382) respectively.
Labels of all images created via human-in-the-loop anotation on the Figure-Eight platform are also provided, enabling you to extend the data set with your own data.
Below, you’ll find a link to the Figure Eight template used to transcribe these handwritten names. The “Duplicate Job” button above will take you to a template that follows this exact workflow.
The input data in this job is a hundreds of thousands of images of handwritten names. In the “Data” tab above, you’ll find the transcribed images broken up into test, training, and validation sets.