Generating Training Data

Posted in Uncategorized at 11:26 am by danvk

As you may remember from a previous post, I’ve been doing some work with a collection of old images. The first problem was to write a program to find the individual photos in images like these:


This is easy for humans, but hard for a machine!

A key part of developing any heuristic algorithm like this one is to get some training data. You find the correct answer by hand for some fraction of the data, then judge your program by seeing how its results compare to the “golden” data.

For my photo detection project, the golden data might look like this:


You could generate this sort of data by hand using a photo inspector and a text editor. But it would be tremendously tedious. You wouldn’t want to do this for 100 images unless you were being paid. For a personal project, it’s a non-starter.

A little bit of usability work here can go a long way. For this project, I built a simple web tool using my localturk service. It’s a tool which helps you step through repetitive tasks using a web browser. It exposes the exact same API as Amazon’s Mechanical Turk: CSV input + HTML TemplateCSV output. But it runs on your own machine and you do the work yourself. No external turkers or exchange of money involved.

My tool shows the original image on the page and asks you to drag rectangles across the individual photos. You can resize or move the rectangles after you draw them:


Click the image to try this tool in your browser.

localturk records the responses in a a CSV output file which you can use as your golden data. The whole process is done visually in your browser.

I’d estimate I spent maybe an hour creating that template and half an hour stepping through the 100 photos. This may or may not compare favorably to the photo inspector and text editor process I described above, but it was certainly more enjoyable.

In his Machine Learning class, Andrew Ng says that you should ask “how hard would it be to get 10x more training data?” With this fancier system, it would take a few hours. Or I could upload 1,000 tasks to Mechanical Turk and trade time for money.

How do other ML people generate small amounts of training data?