danvk.org » programming

Finding Pictures in Pictures

danvk — Sun, 10 Feb 2013 06:11:42 +0000

Over the past month, I’ve been working with imagery from the NYPL’s Milstein Collection. Astute readers may have some guesses why. The images look something like this one:

There are two photos in this picture! They’re on cards set against a brown background. Other pictures in the Milstein gallery have one or three photos, with or without a white border:

To make something akin to OldSF, I’d need to write a program to find and extract each of the photos embedded in these pictures. It’s incredibly easy for our eyes to pick out the embedded photos, but this is deceptive. We’re really good at this sort of thing! Teaching a computer to do makes you realize how non-trivial the problem is.

I started by converting the images to grayscale and running edge detection:

The white lines indicate places where there was an “edge” in the original image. It’s an impressive effect—almost like you hired someone to sketch the image. The details on the stoops are particularly cool:

The interesting bit for us isn’t the lines inside the photo so much as the white box around it. Running an edge detection algorithm brings it into stark relief. There are a number of image processing algorithms to detect lines, for example the Hough Transform or scipy’s probabilistic_hough. I’ve never been able to get these to work, however, and this ultimately proved to be a dead end.

A simple algorithm often works much better than high-powered computer vision algorithms like edge detection and the Hough Transform. In this case, I realized that there was, in fact, a much simpler way to do things.

The images are always on brown paper. So why not find the brown paper and call everything else the photos? To do this, I found the median color in each image, blurred it and called everything within an RMSE of 20 “brown”. I colored the brown pixels black and the non-brown pixels white. This left me with an image like this:

Now this is progress! The rectangles stand out clearly. Now it’s a matter of teaching the computer to find them.

To do this, I used the following algorithm:

Pick a random white pixel, (x, y) (statistically, this is likely to be in a photo)
Call this a 1×1 rectangle.
Extend the rectangle out in all directions, so long as you keep adding new white pixels.
If this rectangle is larger than 100×100, record it as a photo.
Color the rectangle black.
If <90% of the image is black, go back to step 1.

Eventually this should find all the photos. Here are the results on the original photo from the top of the post:

The red rectangles are those found by the algorithm. This has a few nice properties:

It naturally generalizes to images with 1, 2, 3, 4, etc. photos.
It still works well when the photos are slightly rotated.
It works for any background color (lighting conditions vary for each image).

There’s still some tweaking to do, but I’m really happy with how this algorithm has performed! You can find the source code here.

Developing the OldSF Slideshow

danvk — Tue, 22 Jan 2013 02:21:10 +0000

If you head over to oldsf.org, you’ll find a sleek new UI and a brand new slideshow feature. Here’s the before/after:

Locations like the Sutro Baths can have hundreds of photos. The slideshow lets you flip through them quickly.

As so often happens, what looked simple at first became more and more complex as I implemented it. Here’s how that process went for the OldSF update.

It started with Raven’s mock of the feature:

I started by looking for a JavaScript library that could do most of the heavy lifting for me. Raven’s mock shows a single big image in the center with bits of the previous and next images visible on either side. After finding lots of “slideshow” libraries that weren’t quite right, I realized that what I really wanted was called a “carousel”, not a “slideshow”. After making this conceptual breakthrough, I quickly settled on jCarousel.

The slideshow/carousel was barely in place before I ran into a new problem: the images kept shifting out of place! The issue is that jCarousel lays out all the images in your slideshow in a big long line, like so:

Most of the images are off-screen. To change the “active” image, jCarousel slides the whole strip to the left or right. To save bandwidth, I try not to load images that you’ll never see. An image only gets loaded when it appears on the screen. Before it loaded, I had no idea what its width was. The long strip of images really looked like this:

If an image turned out to be wider than expected, then the browser would push all the later images farther to the right, like so:

This wreaked havoc on the carousel’s layout. It could budge the center image all the way off the screen!

At first, I told jCarousel to redo its layout whenever a new image loaded. This mostly prevented the budging, but it had a nasty side effect. Images typically get loaded when you scroll through the slideshow. This scrolling is animated. But if jCarousel redid the layout, then the animation would suddenly stop and the motion would look very janky. My first thought was to prevent the relayout during animations, and this is what I did for our initial “launch”. But a few days later, Raven told me that she wasn’t consistently seeing the correct image when she copy/pasted links to the slideshow. The layout issues weren’t gone!

The source of all this complexity was the images in the slideshow with unknown widths. So the cleanest solution was to make them known! I added image width and height to the database and propagated them through to the client. This meant that the layout never had to change when an image loaded. Using the same schematic as above, the carousel looked like this:

With the layout fixed, all the hacks I’d written melted away and the slideshow links turned rock solid.

Another surprisingly tricky part involved moving the Google Maps navigation controls. In our new UI, the map uses the full window. The logo, date range selector and right-hand panel “float” above the map. When I first implemented it, I saw the mess you see to the left.

Uh-oh! I assumed this would be easy to fix—just shove the Google Maps navigation down a bit. But Google Maps doesn’t expose any CSS classes for its controls, so this turns out to be quite tricky. With some help from the API docs and this StackOverflow question I learned that the only way to do this is to create a small, invisible custom maps control which shoves all the other controls out of the way. Sheesh. The invisible control is outlined in this image:

Another fun issue came up with Street View. Having Street View on OldSF is quite useful, since it lets you do “now and then” comparisons. But when we went to the full-screen map layout, we ran into this annoyance:

That “x” button in the top right corner is the only way to get out of street view, and it’s covered by the right-hand panel. You’re stuck! The solution here was to find the events corresponding to entering and leaving Street View. When you go into Street View, we hide our UI elements. When you leave, we show them again. Raven has suggested that it’s nice to still see the images, so in the future I may just shove the right-hand panel down a bit or provide my own exit button.

Those were three of the most interesting issues I ran into while creating this new feature. There were many, many more. Nothing is ever so simple as it seems!

Lonely Hangouts

danvk — Mon, 25 Jun 2012 21:35:38 +0000

While working on Puzzle+, my crossword application for Google+ Hangouts, I couldn’t help but notice what a colossal pain it was to develop against the Hangouts API. It has a few things going against it:

Testing your changes requires pushing them to a remote HTTPS server.
Your application is buried in a ton of iframes, which makes the JS console harder to use.
Opening up Google+ Hangouts runs a browser plugin, turns on your camera, and makes your computer nice and toasty-hot.
It’s impossible to test multiplayer scenarios without multiple Google+ accounts and multiple computers (since opening a hangout requires exclusive access to your camera).

To make myself less sad, I developed a small node.js server which emulates the Google+ Hangouts API. This lets you do all your development (both single- and multi-player) locally, without any of the AV overhead that Hangouts usually bring in.

In case anyone else finds themselves in a similar predicament, I’ve released this code as Lonely Hangouts on github.

puzzle+: Crosswords for Google+

danvk — Thu, 17 May 2012 15:48:12 +0000

To solve a crossword with your friends in Google+, click this giant hangout button:

You’ll see something like this:

Click “Hang out” to invite everyone in your circles to help you with the puzzle. If you want to collaborate with just one or two people, click the “x” on “Your Circles” and then click your friend’s names on the right.

You’ll be prompted to either upload a .puz file or play one of the built-in Onion puzzles. You can get a free puzzle from the New York Times by clicking “Play in Across Lite” on this page.

With the puzzle downloaded, drag it into the drop area:

And now you’re off to the races! The big win of doing this in a Google+ hangout is that you get to video chat with your collaborators while you’re solving the puzzle, just like you would in person!

Astute readers will note that puzzle+ is a revival of lmnopuz for Google Shared Spaces, which was a revival of lmnowave (Crosswords for Google Wave), which was in turn a revival of Evan Martin and Dan Erat‘s standalone lmnopuz. Hopefully the Google+ Hangouts API will be more long-lived than its predecessors.

Horizontal and Vertical Centering with CSS

danvk — Mon, 14 May 2012 17:04:47 +0000

I recently wanted to center some content both vertically and horizontally on a web page. I did not know in advance how large the content was, and I wanted it to work for any size browser window.

These two articles have everything you need to know about horizontal centering and vertical centering.

The two articles don’t actually combine the techniques, so I’ll do that here.

In the bad old days before CSS, you might accomplish this with tables:


  
    
      Content goes here

Simple enough! In the wonderful world of HTML5, you do the same thing by turning divs into tables using CSS. You need no fewer than three divs to pull this off:


  
    
      Content goes here

And here’s the CSS:

.container {
  display: table;
  width: 100%;
  height: 100%;
}
.middle {
  display: table-cell;
  vertical-align: middle;
}
.inner {
  display: table;
  margin: 0 auto;
}

A few comments on why this works:

You can only apply vertical-align: middle to an element with display: table-cell. (Hence .middle)
You can only apply display: table-cell to an element inside of another element with display: table. (Hence .container)
Elements with display: block have 100% width by default. Setting display: table has the side effect of shrinking the div to fit its content, while still keeping it a block-level element. This, in turn, enables the margin: 0 auto trick. (Hence .inner)

I believe all three of these divs are genuinely necessary. For the common case that you want to center elements on the entire screen, you can make .container the body tag to get rid of one div.

In the future, this will get slightly easier with display: flexbox, a box model which makes infinitely more sense for layout than the existing CSS model. You can read about how do to horizontal and vertical centering using flexbox here.

Accurate hexadecimal to decimal conversion in JavaScript

danvk — Fri, 20 Jan 2012 23:05:20 +0000

A problem came up at work yesterday: I was creating a web page that received 64-bit hex numbers from one API. But it needed to pass them off to another API that expected decimal numbers.

Usually this would not be a problem — JavaScript has built-in functions for converting between hex and decimal:

parseInt("1234abcd", 16) = 305441741 (305441741).toString(16) = "1234abcd"

Unfortunately, for larger numbers, there’s a big problem lurking:

parseInt("123456789abcdef", 16) = 81985529216486900 (81985529216486900).toString(16) = "123456789abcdf0"

The last two digits are wrong. Why did these functions stop being inverses of one another?

The answer has to do with how JavaScript stores numbers. It uses 64-bit floating point representation for all numbers, even integers. This means that integers larger than 2^53 cannot be represented precisely. You can see this by evaluating:

(Math.pow(2, 53) + 1) - 1 = 9007199254740991

That ends with a 1, so whatever it is, it’s certainly not a power of 2. (It’s off by one).

To solve this problem, I wrote some very simple hex <-> decimal conversion functions which use arbitrary precision arithmetic. In particular, these will work for 64-bit numbers or 128-bit numbers. The code is only about 65 lines, so it’s much more lightweight than a full-fledged library for arbitrary precision arithmetic.

The algorithm is pretty cool. You can see a demo, read an explanation and get the code here:
http://danvk.org/hex2dec.html.

Takeaways from Stanford’s Machine Learning Class

danvk — Tue, 20 Dec 2011 00:04:33 +0000

Over the past two months, I’ve participated in Andrew Ng’s online Stanford Machine learning class. It’s a very high-level overview of the field with an emphasis on applications and techniques, rather than theory. Since I just finished the last assignment, it’s a fine time to write down my thoughts on the class!

Overall, I’ve learned quite a bit about how ML is used in practice. Some highlights for me:

Gradient descent is a very general optimization technique. If you can calculate a function and its partial derivatives, you can use gradient descent. I was particularly impressed with the way we used it to train Neural Networks. We learned how the networks operated, but had no need to think about how to train them — we just used gradient descent.
There are many advanced “unconstrained optimization” algorithms which can be used as alternatives to gradient descent. These often have the advantage that you don’t need to tune parameters like a learning rate.
Regularization is used almost universally. I’d previously had very negative associations with using high-order polynomial features, since I most often saw them used in examples of overfitting. But I realize now that they are quite reasonable to add if you also make good use of regularization.
The backpropagation algorithm for Neural Networks is really just an efficient way to compute partial derivatives (for use by gradient descent and co).
Learning curves (plots of train/test error as a function of the number of examples) are a great way to figure out how to improve your ML algorithm. For example, if your training and test errors are both high, it means that you’re not overfitting your data set and there’s no point in gathering more data. What it does mean is that you need to add more features (e.g. the polynomial which I used to fear) in order to increase your performance.

The other takeaway is that, as in many fields, there are many “tricks of the trade” in Machine Learning. These are bits of knowledge that aren’t part of the core theory, but which are still enormously helpful for solving real-world problems.

As an example, consider the last problem in the course: Photo OCR. The problem is to take an image like this:

and extract all the text: “LULA B’s ANTIQUE MALL”, “LULA B’s”, “OPEN” and “Lula B’s”. Initially, this seems quite daunting. Machine Learning is clearly relevant here, but how do you break it down into concrete problems which can be attacked using ML techniques? You don’t know where the text is and you don’t even have a rough idea of the text’s size.

This is where the “tricks” come in. Binary classifiers are the “hammer” of ML. You can write a binary classifier to determine whether a fixed-size rectangle contains text:

Positive examples
Negative examples

You then run this classifier over thousands of different “windows” in the main image. This tells you where all the bits of text are. If you ignore all the non-contiguous areas, you have a pretty good sense of the bounding boxes for the text in the image.

But even given the text boxes, how do you recognize the characters? Time for another trick! We can build a binary classifier to detect a gap between letters in the center of a fixed-size rectangle:

Positive examples
Negative examples

If we slide this along, it will tell us where each character starts and ends. So we can chop the text box up into character boxes. Once we’ve done that, classifying characters in a fixed-size rectangle is another concrete problem which can be tackled with Neural Networks or the like.

In an ML class, you’re presented with this pipeline of ML algorithms for the Photo OCR problem. It makes sense. It reduces the real-world problem into three nice clean, theoretical problems. In the class, you’d likely spend most of your time talking about those three concrete problems. In retrospect, the pipeline seems as natural as could be.

But if you were given the Photo OCR problem in the real world, you might never come up with this breakdown. Unless you knew the trick! And the only way to learn tricks like this is to see them used. And that’s my final takeaway from this practical ML class: familiarity with a vastly larger set of ML tricks.

Java, Ten Years Later

danvk — Sat, 05 Nov 2011 20:33:54 +0000

It’s been almost ten years since I’ve actively used the Java programming language. In the mean time, I’ve mostly used C++. I’ve had to pick up a bit of Java again recently. Here are a few of the things that I found surprising or notable. These are all variants on “that’s changed in the last ten years” or “that’s not how C++ does it.”

The Java compiler enforces what would be conventions in C++.
For example, “public class Foo” has to be in Foo.java. In C++, this would just be a convention. You can use “private class” when you’re playing around with test code and want to use only a single file. Similarly, class foo.Bar needs to be in “foo/Bar.java”.

Java Packages are a more pervasive concept than namespaces in C++.
There’s a “default package”, but using this prevents you from loading classes by name: Class.fromName(“Foo”) won’t work, but Class.fromName(“package.Foo”) will. Classes in your current package are auto-imported, which surprised me at first. The default visibility for methods/fields in Java is “package private”, which has no analogue in C++.

Java keeps much more type information at runtime time than C++ does.
The reflection features (Class.getMethods(), Method.getParameters(), etc.) have no equivalent in C++. This leads to some seemingly-magical behaviors, e.g. naming a method “foo” in a Servlet can cause it to be served at “/foo” without you saying anything else. Not all information is kept though: you can get a list of all packages, but not a list of all classes in a package. You can request a class by its name, but you can’t get a list of all classes. You can get a list of all the method names in a class, but you can’t get a list of all the parameter names in a method.

Java enums are far richer than C/C++ enums.
enums in Java are more like classes: they can have constructors, methods, fields, even per-value method implementations. I really like this. Examples:

public enum Suit { CLUB("C"), DIAMOND("D"), HEART("S"), SPADE("S"); private String shortName; private Suit(shortName) { this.shortName = shortName; } public String toString() { return shortName; } }

Java is OK with a two-tier type system.
At its core, C++ is an attempt to put user-defined types on an equal footing with built-in types like int and char. This is in no way a goal of Java, which is quite content to have a two-tier system of primitive and non-primitive types. This means that you can’t do Map, for instance. You have to do Map. Autoboxing makes this less painful, but it’s still a wart in the language that you have to be aware of.

One concrete example of this is the “array[index]” notation. In C++, this is also used for maps. There’s no way to do this in Java, and I really miss it. Compare:

map[key] += 1;

map.put(key, 1 + map.get(key));

which has more boilerplate and is more error-prone, since you might accidentally do:

map.put(key, 1 + other_map.get(key));

The designers of Java Generics learned from the chaos of C++ templates.
Generic classes in Java are always templated on types: no more insane error messages. You can even say what interface the type has to implement. And there’s no equivalent of method specialization, a C++ feature which is often misused.

Variables/fields in Java behave more like C++ pointers than C++ values.
This is a particular gotcha for a field. For example, in C++:

class C { public: C() { // foo_ is already constructed and usable here. } private: Foo foo_; };

But in Java:

class C { public C() { // foo is null here. We have to do foo = new Foo(); } private Foo foo; }

Java constructors always require a trailing (), even if they take no parameters.
This is a minor gotcha, but one I find myself running into frequently. It’s “new Foo()” instead of “new Foo” (which is acceptable in C++).

The Java foreach loop is fantastic
Compare

for (String arg : args) { ... }

for (Set::const_iterator it = args.begin(); it != args.end(); ++it) { ... }

The “static {}” construct is nice
This lets you write code to initialize static variables. It has no clear analogue in C++. To use the Suit example above,

private static HashMap name_to_suit; static { for (Suit s : Suit.values()) { name_to_suit.put(s.toString(), s); } }

The new features (Generics, enums, autoboxing) that Java has gained in the last ten years make it much more pleasant to use.

Crosscountry Crosswords

danvk — Sun, 27 Mar 2011 21:07:05 +0000

It’s been almost a year since I introduced lmnowave, the collaborative crossword puzzle gadget for Google Wave. A lot has happened in that past year, not least the cancelation of Wave.

First, to clear up some confusion. It’s not “I’m no wave”, it’s “L-M-N-O-Wave”, which is a play on “L-M-N-O-Puz”, aka lmnopuz, the software on which my collaborative crossword system is based. Only a few dozen people ever saw lmnopuz, so no one got the joke. And I realized after releasing it that, by changing ‘puz’ -> ‘wave’, I’d taken away any hint of what my wave gadget actually did. A bad name. Oh well.

In August, Google announced that Wave was canceled. This seemed to be the end of lmnowave. Sure, Wave was still usable. But the life had been sucked out of the project. This was quite disappointing to me, since I’d spent a fair bit of my own time developing the crossword gadget.

Then, in mid-December, Douwe Osinga introduced the oddly-named Google Shared Spaces. It’s an attempt to salvage the Wave gadget code, to let it live outside of Wave.

For lmnopuz, it’s perfect. Here’s the lmnowave shared space. You can use it to collaborate on crosswords with your friends, just like you could with lmnowave. In some ways, it’s even better, since the Wave UI is stripped away and you can focus on your puzzle. To do crosscountry crosswords, my friend and I open up a shared space and call each other on Skype. The combination works really well.

What does the future hold for lmnowave? It’s a bit unclear. I may turn it into a Facebook game, or perhaps use it to learn how to write applications for the Mac App store.

Enjoy!

Commacopy

danvk — Wed, 09 Mar 2011 15:16:08 +0000

At work, I often see web pages that display large numbers like so:

num-bytes	1,234,567,890
num-entries	123,456,789

Including the commas in the display makes the numbers easier to read. But it does have a downside. Say you want to calculate the average number of bytes per entry. If you copy/paste the numbers above, the commas will prevent most programming languages (e.g. python or bc) from interpreting them correctly.

My coworker Dan came up with a great solution to this conundrum using CSS. Try copy/pasting these numbers over into the text box:

1234 or 2345
-12345.67
-123456789

The commas don’t copy! Best of both worlds!

You can view source to see how it works, but let’s jump straight to the goodies:

Bookmarklet: commacopy

Unobtrusive JavaScript: commacopy.js

To use the bookmarklet, drag it to your browser’s bookmarks toolbar. If you click it, it will silently convert all numbers containing commas on the current page to the fancy copy/pasteable commas. This should really be a Chrome extension that runs on every page, but I’ll leave that as an exercise for the reader.

To use the unobtrusive JS, make a copy of commacopy.js and include it in your page via: