danvk.org blog

After 20 Years, the Globally Optimal Boggle Board

2025-04-23T00:00:00+00:00

Exciting news! This is the best possible Boggle board:

Boggle is a word search game. You form words by connecting adjacent letters, including along diagonals. Longer words score more points. Good words on this board include STRANGERS and PLASTERING. After you spend three minutes trying to find as many words as you can, you’ll be struck by just how good computers are at this.

Using the ENABLE2K word list, this board has 3,625 points on it coming from 1,045 words. This board has more points than any other. Try any other combination of letters and you’ll get a lower score. While I’ve long suspected this board was the winner, I’ve now proven it via exhaustive search.

Many people have searched for high-scoring boards before, but no one has ever constructed a computational proof that they’ve found the best one. This is a new, first of its kind result for Boggle.

To see why this is interesting, let’s go back to the 1980s.

High-Scoring Boggle and Local Optima

With the release of the Apple II (1977) and IBM PC (1981), computers become accessible to hobbyists, including word game enthusiasts. In 1982, Alan Frank published a short article in Word Ways magazine called High-Scoring Boggle. It’s the earliest work I’ve found on Boggle maximization, and it’s instructive on why this is a hard problem.

Here’s what he wrote:

The article goes on to list the 769 words that add up to 2,047 points. You can browse the words on that board using the fifth edition of OSPD here: gnisetrpseacdbls. (Thanks to the addition of new words, it’s increased to 2,226 points.)

The article doesn’t explain how Alan and Steve came up with this particular board, but I suspect they used a hill climbing procedure. The idea is simple: start with a random board and find its score. Tweak a letter and see if the score improves. If so, keep it. If not, discard the change. Repeat until you stall out. You’ll eventually wind up with a high-scoring board.

Writing a Boggle solver and finding this board was a real achievement in 1982. But unfortunately for Alan and Steve, their board is not “the highest-scoring one.” It’s not even close. The board pictured at the top of this article scores 3,736 points using OSPD5, vastly more than theirs.

So what went wrong? It’s hard to say without their code, but I have a hunch. The bane of hill climbing is the local maximum:

It’s easy for a hill climber to find a local max, rather than the global max. Presumably Alan and Steve’s board is locally optimal, and small changes can’t improve it. But there’s still a much better board out there, you just have to descend your small hill first before you can climb a taller one. Local optimization can always fail in this way. You may just be looking in the wrong neighborhood.

Deeper searches

I wrote my first Boggle Solver in 2004 and quickly got interested in using hill climbing and simulated annealing to find high-scoring boards. I wasn’t the only one.

Boggle programs in the 2000s had some major advantage over Alan and Steve’s from 1982. Memory was much cheaper, CPUs were much faster, and the internet made it much easier to get word lists.

This meant that you could do “deeper” searches:

Instead of changing just one cell at a time, expand the search radius by changing 2, 3, or 4.
Instead of tracking just a single best board, track hundreds of high-scoring candidates.
Instead of doing just a handful of hillclimbing runs, do millions.

Using a process like this, I found that, whatever board I started with, I always wound up with one of a handful of high-scoring boards, including our favorite 3,625 pointer.

This suggested that this board might just be the global max. But still, we could be falling into the same trap as before. The true global optimum might just be hard to find this way.

The only way to know for sure is via exhaustive search. And unfortunately, at least at first glance, this seems completely impossible.

The Impossibility of Exhaustive Search

There are an astronomically large number of possible Boggle boards. How many? If any letter can appear in any position, then there are roughly

26^16/8 = 5,451,092,862,428,609,257,472 = 5.45*10^21

possible boards. (The factor of 8 comes from symmetry; not all boards can be rolled with real Boggle dice, but this is within an order of magnitude.)

It’s possible to find all the words on a Boggle board very quickly using a Trie data structure. On my M2 Macbook, I can score around 200,000 boards/sec. Still, at that pace, testing every board would take around 800 million years!

Fortunately, there’s a more clever way to structure the search.

Branch and Bound

There are just too many boards to look at each one. Even enumerating all of them would be too slow. Instead, we need to group boards together into a “board class.” Then we can calculate an upper bound on the highest-scoring board in each class. If this upper bound is lower than 3625, we can toss out the entire class without having to test any of the individual boards in it. If not, we need to split the class and try again.

This technique is known as Branch and Bound, and it was first developed way back in the 1960s. B&B is more of a strategy than a concrete algorithm, and it leaves a lot of details to fill in. The clever bits of applying this approach to Boggle are:

An appropriate way to partition the space of Boggle boards into board classes
An upper bound that’s fast to compute but still reasonably “tight”
A way to split board classes and calculate their upper bounds without repeating work

A “board class” might contain trillions of individual boards. An example would be boards with a particular consonant/vowel pattern. There are roughly 2^16/8=8192 possible consonant/vowel patterns—vastly fewer than the number of boards. And you can imagine that it’s easy to rule out boards with all consonants or all vowels. Other patterns are much harder, though. (My search didn’t exactly use consonants and vowels, this is just an illustration.) For more on board classes, read this post.

The second and third ideas required developing a somewhat novel tree structure tailor-made for Boggle. These “sum/choice” trees make it efficient both to calculate upper bounds and to split board classes. You can see examples of these trees and read about how they work in this post.

If you’d like to learn more about these algorithm and data structures, I’d encourage you to run the code on your own machine, read the work-in-progress paper about this result and methodology, and read some of my previous blog posts:

The results

I developed and tested the search algorithm on 3x3 and 3x4 Boggle, which are much easier problems. Then I ran it on 4x4 Boggle.

Using a 192-core c4 on Google Cloud, it took about 5 days to check around 1 million 4x4 board classes (~23,000 CPU hours). This is about $1,200 of compute. That’s a lot, but it’s not a crazy amount. (Fortunately I had a friend in BigTech with CPUs to spare.)

The result was a list of all the Boggle boards (up to symmetry) that score 3500+ points using the ENABLE2K word list. There were 32 of them. Here are the top five:

plsteaiertnrsges (3625 points)
splseaiertnrsges (3603 points)
gntseaieplrdsees (3593 points)
dresenilstapares (3591 points)
dplcseainrtngies (3591 points)

You can see the rest here. These boards are rich in high-value endings like -ING, -ER, -ED, and -S.

The top boards were all ones that I’d previously found by hillclimbing.

What did we learn about the problem?

Hill Climbing Works. If you search deeply enough, the globally optimal Boggle board can be found via hill climbing and simulated annealing. This doesn’t come as a huge surprise: the space of Boggle boards is “smooth” in that making small changes to one high-scoring board tends to give you another high-scoring board. But this is hand-wavy, and now we know for sure!
This is NP-Hard. Finding the highest-scoring board in a board class is likely an NP-Hard problem. Fortunately, N is small (4x4=16) and the tailor-made code is able to solve this orders of magnitude faster than general ILP solvers.

Questions and Answers

Does this use AI? It’s 2025, and yet this project made very little use of AI. The runtime is classic data structures and algorithms, all CPU and no GPU. GitHub Copilot was helpful for translating parts of the Python prototype to C++ and for small coding tasks.
Can this board be rolled with real Boggle dice? Yes. (See photo for proof!) All the highest-scoring boards can be rolled with both old (pre-1987) and new Boggle dice. My search included all combinations of letters, not just the ones that could actually be rolled.
What are your odds of rolling this board? Vanishingly low! I believe they’re around 1 in 10^19, which is in the ballpark of the number of stars in the universe. You’re better off playing the lottery.
What about the letter Q? One of the Boggle dice has a “Qu” on it, and my search allowed any of the cells to be “Qu”. Not surprisingly, the highest-scoring boards had no Qu on it. For ENABLE2K, the best board I’m aware of containing a Qu is cinglateperssidq (3260 points), where the Qu is a dead cell. The best I know of that actually uses the Qu is gepsnaletiresedq (3199 points), which contains QUEER, QUEEREST, etc.
What about other wordlists? The best board depends on the dictionary you use. There are some slight variations, for example the best board for the OSPD Scrabble Dictionary is likely splseaiertnrsges (3827 points), which is the second-best board for ENABLE2K (3603 points). The GitHub repo has a breakdown by wordlist. Only the result for ENABLE2K has been proven.
What are the other high-scoring boards? Here’s a complete list of boards with 3500+ points using ENABLE2K. Many of these are one or two letter variations on each other, but some are quite distinct.
Why did this happen now? This could have been done at any point in the last 10–20 years. But it was easier today because of the widespread availability of cloud computing. It also helped that I had some free time to devote to this problem.
Can this be GPU accelerated? People have been asking me about this since 2009. While it’s possible that there’s some version of Boggle that can be GPU accelerated, this isn’t it. The algorithm is too tree-y and branchy. There’s lots of coarse parallelism available, but very little fine-grained parallelism.
What about other (human) languages? I’ve only run this for English, but you’re welcome to try running the code yourself for other languages. I hear Polish Boggle is interesting!

What tools were used?

The code is a mixture of C++ for performance-critical parts and Python for everything else. They’re glued together using pybind11, which I’m a big fan of.

If you’d like to run the code or learn more, check out the GitHub repo.

What if there’s a bug?

I’d cry. 😭 While I’d never rule out the possibility of a bug, there several reasons to believe that this computational proof is correct:

It matches the highest-scoring boards found by exhaustive search on 2x2 and 2x3 Boggle, where this is feasible.
It matches the highest-scoring boards found by exhaustive search within a single board class for 3x4 Boggle.
It finds all the best boards that I’ve found via hill climbing for 3x3, 3x4 and 4x4 Boggle.
The tree operations preserve an invariant on the score that suggests they are valid.

What’s next?

I have a few more ideas for incremental optimizations. But I’ve been hacking away at this problem for at least three months, and this seems like a good place to stop. I wasn’t sure that 4x4 Boggle would ever be “solved” in this way, and it’s immensely satisfying to knock out a problem that’s been in the back of my mind for nearly 20 years.

I do intend to write a paper explaining what I’ve done more formally, as well as another post with my thoughts on this whole experience. You can find an in-progress draft of the paper in the GitHub repo.

The top-scoring boards for other word lists still need to be proven. Hasbro also sells a 5x5 and 6x6 version of Boggle. These are astronomically harder problems than 4x4 Boggle, and will likely have to wait for another generation of computers and tools. The best board I’ve found via hillclimbing for 5x5 Boggle is sepesdsracietilmanesligdr. The results of this exploration suggest there’s a good chance this is also the global optimum.

Boggle Revisited: Following up on an insight

2025-04-10T00:00:00+00:00

My last boggle post presented an exciting insight that yielded a 30x speedup. This meant two things:

I could find the globally-optimal 3x4 Boggle board in ~19 hours using three cores on my laptop, rather than 8-9 hours on a 192-core cloud instance.
My cost estimate for finding the globally-optimal 4x4 Boggle board dropped from $500,000→$15,000.

That was still more than I was willing to pay, but it brought me within a factor of 10 of the ultimate goal.

I was able to find another ~10x of optimizations, and I was able to find the best 4x4 board (more on this soon!). What got me there wasn’t an exciting insight. Instead, it was a series of incremental wins that stacked together nicely. This post presents four of them:

Orderly Bound
Lift → Orderly Force / Merge
Inline Child Nodes and Arenas
Variable-depth Switching

Orderly Bound

The last post presented Orderly Trees. Here’s what one looked like for a 2x2 board class:

This represents the board class ab cd ef gh, which contains sixteen possible Boggle boards. The blue node is ab, the green nodes are cd, the orange nodes are ef and the yellow nodes are gh.

To calculate the upper bound for a particular board, you take the branches corresponding to its letters and sum them up. Here’s adeg, for example:

To get the Multiboggle score, you add up all the points on the dark, parenthesized cells. So 3 + 2 + 2 + 1 = 8, which is the correct bound.

In practice, we usually want to partially evaluate boards. So we first try making a choice between “A” and “B” for the first cell, and use the tree to calculate an upper bound for both a cd ef gh and b cd ef gh. If either of these has a bound less than our best known score, we’re done. If not, we need to split the next cell: a c ef gh and a d ef gh, etc.

Previously, I calculated these bounds by traversing the tree independently for each set of forced cells. This has low memory cost, but it does lots of duplicated work. There are many identical calculations in the bounds for ab cd ef gh and a cd ef gh, for example.

The new “orderly bound” algorithm does this more efficiently by taking advantage of the “orderly” structure of the tree. The idea is to maintain stacks of pointers, organized by cell. So there’s one stack of blue pointers, one stack of green pointers, etc. To force the blue cell to be a, you pop it off the stack (there’s only one blue cell) and push the next green and orange choice cells you see onto their stacks. You can keep track of the bound on the board class as you do this, which lets you bail out as early as possible.

Here’s an animation of how that looks for boards in ab cd ef gh with a bound > 7:

The active nodes in the stacks have a * next to them. Sum nodes with parentheses indicate where the algorithm is advancing. This is kind of like a DFS.

This was something like a 10x speedup over the previous system, which translated into an overall 2-3x speedup on 3x4 board classes. I’d initially hoped that this system would let me get rid of the lifting operation entirely, but that didn’t pan out. The best approach was to lift a few times and then run OrderlyBound. Hybrid always wins.

This puzzled me for a while because I mistakenly thought that OrderlyBound was linear. But I eventually realized that, while it only visits each node once in this 2x2 example, it has pretty heinous backtracking behavior for larger boards. Lifting helps to mitigate that.

Lift → Orderly Force / Merge

The sequence for upper-bounding a class of Boggle boards is:

Build an Orderly Tree for that board class.
Do a few lift operations to synchronize choices across subtrees.
Call OrderlyBound.

Creating an “Orderly Bound” that was tailor-made for orderly trees was a big win. So what about a tailor-made lift operation?

While I was able to implement something like this, the bigger win came from reevaluating the decisions that had led me to use the “lift” operation in the first place. Recall from my earlier post that “lifting” pivots a single choice node all the way to the top of the tree.

But if you have two choices for a cell (say a and b), then an alternative is to produce two trees, one with that cell set to a and another with that cell set to b. I call this a “force” operation. In the past, I preferred “lift” to “force” because it let me compress and deduplicate subtrees.

When I switched to orderly trees, however, compression and deduplication stopped being helpful. So I threw them out and switched from “lift” back to “force.” Dropping the fields required required for deduplication was a huge RAM savings.

For an orderly tree, forcing a cell winds up being mostly a “merge” operation on subtrees. For example, to force the first cell (blue) to be a, we merge the top subtree, which corresponds to a, and the lower subtree that starts with a green cell, which corresponds to words that don’t use the first cell (namely ”GED”, which is apparently a type of fish).

This merge operation winds up being quite efficient to implement using iterators that advance in lockstep.

Inline Child Nodes and Arenas

Unlike the other optimizations, this one has nothing to do with Boggle. It’s pure C/C++!

The first time I ran my Boggle solver in the cloud, I was surprised by how much memory I used and by how much faster my code ran on my M2 Macbook than in the cloud. One theory was that Apple’s chips have very high memory bandwidth, and this might be a bottleneck for me on the Intel CPUs in the cloud. So I wanted to reduce my memory usage.

The vast majority of memory in the Boggle solver is used to store a tree structure. After removing unnecessary properties from my EvalNode class, it looked like this:

class EvalNode {
  int8_t letter_;
  int8_t cell_;
  uint16_t points_;
  uint32_t bound_;
  vector<EvalNode*> children_;
}

sizeof(EvalNode) is 32 bytes for this structure on a 64-bit system. I allocate hundreds of millions of these, so saving even a few bytes makes a big difference.

The vast majority of the space is used by the children_ vector. Here’s what the memory layout looks like:

The small fields are organized efficiently into eight bytes. But then the vector takes up the remaining 24 bytes. So how is std::vector implemented? It usually looks something like this:

template <class T>
class vector {
  T* data_;         // points to first element
  T* end_;          // points to one past last element
  T* end_capacity_; // points to one past internal storage
};

I’d always assumed it stored a count, but this three pointer system is clever. It makes it very fast to check whether you’re at the end of a vector, and it frees the implementation from having to care about how big a pointer is.

For us, the gist is that we always store three pointers (24 bytes) directly in the EvalNode structure and then store the child pointers themselves in some other array. In practice most nodes have zero, one or two children. So this winds up being inefficient in a few ways:

The three pointers use a lot of space compared to the typical number of child pointers.
Because the vector allocates lots of small backing arrays, memory may be fragmented and there’s lots of overhead in managing it.
Accessing a child requires going through two pointers.

There’s a classic trick for improving this situation. Instead of using vector<T> children , use T* children[] to store an indeterminate number of child pointers directly in the struct:

class EvalNode {
  int8_t letter_;
  int8_t cell_;
  uint16_t points_;
  uint8_t num_children_;  // new
  uint8_t capacity_;      // new
  uint32_t bound_;
  EvalNode* children_[];  // indeterminate array
}

Now the memory layout looks like this:

sizeof(EvalNode) evaluates to 16 bytes now, but it’s really 16+8*capacity. If you want a node with two children, you allocate 32 bytes and use placement new.

This has some pros and cons. First, the pros:

The structure is smaller. For a zero-child EvalNode, it’s half the bytes. It can store up to two children and still remain smaller than the old structure, not even including the side buffer.
It doesn’t allocate memory in an outside array. All the memory is in the structure itself. This reduces fragmentation and means that accessing a child only requires chasing a single pointer.

The cons:

We have to store the number of children and the capacity of the node. These only take one byte each (nodes have a maximum of 16 children), but this is enough to screw up the alignment of the structure. Six of the sixteen bytes are unused! This is inefficient, but unavoidable without bitpacking.
It’s hard to add capacity to a node. This is also an issue with vectors, but that complexity is hidden from us. To add a child to a node that’s “at capacity,” we have to allocate a new, larger node and copy everything over.

When I implemented this, the pros vastly outweighed the cons. For 4x4 boards, this reduced memory usage by 20-30% and gave me something like a 40% speedup. A big part of this was that I was able to make more effective use of an arena for memory management. With the new structure, destroying a tree just required deallocating a few large buffers. Previously, it required deallocating millions of little backing arrays for children_.

This improved memory use and management made the final optimization possible.

There were two other things I learned from this optimization that I wanted to note:

Long ago, I’d learned to use this trick by putting T* child_[0] as the last property of a struct. This [0] form was never standard and has been obsolete since C99. It’s more correct to write T* child_[]. And as of C++11, this saves you from a footgun: for (auto child : child_) is valid (but not what you want) with child_[0] but is a compile error with child_[].
C and C++ compilers are not allowed to reorder properties in a class. So it can pay off to think carefully about alignment and the size of each field.

Update: I later split EvalNode into SumNode and ChoiceNode classes, which let me get both of them down to 8 bytes with no waste.

Variable-depth Switching

To prove that a class of Boggle boards doesn’t contain any individual boards with more than N points, the procedure is:

Build an orderly tree for that class.
Recursively call Orderly Force some number of times to produce lots of subtrees.
Call Orderly Bound on each of those subtrees to get candidate boards with high Multiboggle scores.
Run those candidate boards through a plain old Boggle solver to check their true Boggle score.

The depth at which you switch from Force to Bound is a key choice. It’s ultimately a memory/speed tradeoff. More forcing reduces the exponential backtracking behavior of the Orderly Bound algorithm, but requires more RAM.

Previously, I’d used a fixed depth as the switchover point. I couldn’t set it any higher than depth=4 without running into memory problems. But for really hard 4x4 board classes, I found that higher depths were better.

After all the memory optimizations and the switch from Lift to Force, it became practical to use a variable depth. For harder subtrees, I could force more cells before switching over OrderlyBound. I used the upper bound on the current subtree as a proxy. If it got within a factor of 2.5x of the best known board, I’d switch over to OrderlyBound.

This didn’t have much of an effect on 3x4 boards, but it was something like a 2-3x speedup for the hardest 4x4 boards without too much of a RAM penalty. In practice, this sometimes used a lot of forces, up to 12. It would have forced all the cells if I let it, but this tended to use too much RAM.

Conclusion

The lesson here is that when you come up with an exciting new idea, it might force you to reevaluate other decisions that you’ve made. If A was faster than B with the old system, B might be faster than A in the new one.

All these optimizations added up to at least a 10x speedup on my laptop. And given the reduced RAM usage, I was hopeful that these improvements would be at least as big on the Intel CPUs in the cloud.

A $1,500 cloud run is much more palatable than a $15,000 cloud run. But more on that in the next post!

Boggle Revisited: A Thrilling Insight and the Power of Algorithms

2025-02-21T00:00:00+00:00

At the end of my last post on Boggle, I’d achieved perhaps a 10x speedup over my 2009 approach and run my code for 8-9 hours on a 192-core machine to definitively prove that, with 1,651 points, this is the highest-scoring 3x4 Boggle board for the ENABLE2K word list:

S L P
I A E
N T R
D E S

I was happy with the work. All that was left was to write one final blog post reflecting back on the process.

Then I had a thrilling flash of insight that made me dive right back in.

How bounds are computed

To understand the insight, we need to talk a bit more about how you compute an upper bound on a class of boards. I wrote about this back in 2009, but let’s recap now using some visualizations.

To keep things small, let’s play 2x2 Boggle. We’ll consider this board class:

0:{a,b} 2:{e,f}
1:{c,d} 3:{g,h}

Each cell can be one of two different letters, so this board class contains 2^4=16 different 2x2 Boggle boards:

a e   a e   a f   a f
c g   c h   c g   c h

a e   a e   a f   a f
d g   d h   d g   d h

b e   b e   b f   b f
c g   c h   c g   c h

b e   b e   b f   b f
d g   d h   d g   d h

Some have zero points (bdfh) while the highest-scoring one has 8 points (adeg).

We’d like to compute an upper bound on the highest-scoring board in this class without enumerating every board in it. To do so, we traverse the board just like we would to find the words on a regular Boggle board. We prune the search using valid prefixes: a path starting with “ac” is worth exploring, but one starting with “hd” is not.

Unlike a regular Boggle board, though, we’ll encounter cells with multiple possible letters. In this case we try each of them and pick the one that leads to the most points.

You can visualize this board traversal as a tree:

There’s a lot of information in this image:

Every word that can be found on any board in the board class appears in a double-outlined node (”cage”, “each”, “age”, “aged”, “ache”, etc.).
Each node tracks an upper bound for its tree, the most points you can find under it.
There are two types of nodes, sum nodes and choice nodes.
- To calculate the bound for a sum node, add the bounds of its children. This models how you can move in any direction from a cell.
- To calculate the bound for a choice node, take the max of its children. This models that, to get a concrete board, you have to choose a single letter for each cell.
The left-most node (ROOT) is a sum node, modeling how you can find a word starting from any cell. Its bound is an upper bound on the board class as a whole.
The upper bound for this board (14 points), is, indeed, higher than the highest-scoring board in the class (8 points).

Back in 2009, I implemented this using recursive function calls, so that the tree structure was implicit in the traversal order. In 2025, I explicitly allocated this tree in memory so that I could perform operations on it.

We’re mostly interested in the bound, so let’s simplify the diagram by throwing away everything except the bound. Here’s what that looks like:

The colors represent individual cells:

Blue: Cell 0 / Top Left (A or B)
Green: Cell 1 / Bottom Left (C or D)
Orange: Cell 2 / Top Right (E or F)
Brown: Cell 3 / Bottom Right (G or H)

The number is the bound on each node. Nodes with a white background are sum nodes and/or words. You can’t read the individual words off this diagram any more, just the points that they contribute.

To recap how the bounds flow:

White nodes (sum nodes) are the sum of their children, plus any words on this node.
Colored nodes (choice nodes) are the max of their children.

And that is how you calculate an upper bound on a class of boards.

The upper bound here isn’t precise: 14 is larger than 8. Why is that? One thing that’s striking in the colorful tree is just how jumbled the colors are. There’s green on the left, right and middle. There’s orange everywhere. In terms of the Boggle board, this means we make the same choice (which letter to pick for each cell) many times in many different places in the tree. And we might not make the same choice every time. It’s not possible to find both CAGE and AGED on the same board, because the bottom left cell (1/green) has to be either a C or a D. It can’t be both.

The last post explained how we could tighten the bound by synchronizing these choices. In terms of the tree, this means applying lots of pivot operations to “lift” one of the choices to the top (left). Here’s what the tree looks like after lifting the choice for cell 0 (blue):

Now there’s a single blue choice node on the left, and the bound has gone down. In this case there are fewer nodes after lifting, but that’s not usually the case. The choice for cell 0 (blue) is synchronized now, but the other choices are not. To keep improving the bound, we can lift another choice. Let’s do green:

The bound has dropped again, and the tree is getting more orderly. We can keep going. Cell 2, orange, is next:

It won’t improve the bound but, for completeness, we can also lift cell 3 (brown):

Each lift adds a layer to a “choice pyramid” at the top of the tree, until what we’re left with is just a decision tree. The bound on this is tight, and you can find the board that produces the best score by following the max path down the tree.

Lifting is an effective way to tighten the bound on a class of boards, but it’s computationally expensive and it gobbles up enormous amounts of memory for large trees.

Orderly Trees

I tried to imagine what “lifting” meant in terms of how you traversed the board. What if there were a way to construct the lifted tree directly?

One interesting property of the lifted tree is that it doesn’t encode the letters of a word in the order that they appear in that word. AGED, for example, might be represented more like GEDA after lifting. You could imagine creating an effect like this while traversing the Boggle board by allowing yourself to add letters at the start of the word, in addition to the end, or even the middle.

Once you do this, though, you have to worry about finding the same word twice. To avoid double-counting, you could sort the cells that you used to spell the word.

And that leads us to the blazing flash of insight. We’re free to add the cells in a word in any order we like. So why don’t we always sort the cells before adding a word to the tree? This will naturally organize the tree in a way that synchronizes choices.

I’ve taken to calling these “orderly trees” because of the sorting and because they’re more organized than the trees we’ve been working with before. Here’s the tree for the ab cd ef gh board class we’ve been looking at in this post:

There are a few things to note straightaway:

The tree is much smaller than before.
The colors are much more organized, even without lifting: a single blue node on the left, green nodes mostly on the left, brown nodes always on the right.
Green nodes always appear to the right of blue nodes. Orange is always to the right of green, and tan is always to the right of orange. This reflects how we sort the cells before adding words.
The bound is much tighter than before: 8 vs. 14. In fact, 8 is already the tightest possible bound for this board class since adeg scores 8 points.
There are sum nodes only where it’s possible to “skip” a cell in the order.

This seems great! Surely reducing the bound and shrinking the tree will speed up the breaking process. But by how much? It’s not at all clear whether this is a 2x, 10x or a 100x optimization.

You might object: if we sort the cells, how do we distinguish anagrams like ACHE and EACH? Remember from the last post that we’re really playing Multi-Boggle, where you’re allowed to find the same word twice so long as you find it in a different way. That saves us here. Because ACHE and EACH follow different paths, we add their points twice. They wind up on the same node in the tree, but they do both count.

Orderly Results

This was exciting! A great idea with unknown upside. I implemented it to find out how helpful it was.

The reduction in bounds was enormous and got bigger for larger and harder board classes:

A big 3x3 board: 9,359 → 1,449 points (6x fewer)
Some 3x4 boards:
- 36,134 → 3,858 (9x)
- 51,317 → 4,397 (12x)
- 194,482 → 9,884 (20x)

Surely a 20x reduction in bounds would speed up the breaking process, but by how much? On my test set of 50 3x4 board classes, this took me from 333→29s, an 11.5x speedup. Nice!

Between this and a few other optimizations, I was able to redo the full 3x4 Boggle run from the last post. For that run, I used a 192-core C4 on GCP for 8-9 hours to find all the 3x4 boards with 1600+ points. For this run, I used three cores on my M2 Macbook for about 19 hours to find all the boards with 1500+ points. That’s a 30x speedup! And the Macbook was solving a harder problem.

Why not an even bigger speedup? Because the trees are more “orderly”, there’s less room for lifting to improve the bounds. Lifting is still helpful, especially on more complex boards, just much less so than with the old trees.

What about 4x4 Boggle? I estimate that it’s about 50,000x harder than 3x4 Boggle. So while a 30x speedup is huge, we’re still fighting against a 1,500x headwind. If we were to do the full 4x4 run on GCP, I estimate that orderly trees would reduce the bill from around $500,000 to $15,000. Now that’s an optimization!

Orderly Trees are a great illustration of the power of algorithmic advances: one good idea let me do on my laptop what had previously needed a 192-core cloud machine. And it would save $485,000 on the 4x4 run!

That’s still a bit more than I’m willing to pay (if you feel otherwise, let me know!) but we’re getting closer. A few more insights and the general trend towards lower compute costs might just bring it within reach. Barring any more surprising insights, the next posts will look at the best 3x4 board, what it can tell us about 4x4 Boggle, and will offer my reflections on this process. Update: a few more optimizations and a lot of compute did, in fact, bring it within reach.

As always, you can find all the code for this post in the danvk/hybrid-boggle repo.

Boggle Revisited: New Ideas in 2025

2025-02-13T00:00:00+00:00

Over the past few weeks I’ve revisited a 15-year old project of mine: trying to find the globally optimal Boggle board. In the last post, I recapped the work I did in 2009 to find the globally-optimal 3x3 Boggle board.

In this post, I’ll present a few optimizations I found in 2025 that add up to something like a 10x speed boost over the 2009 approach. Between better algorithms, faster CPUs, and the widespread availability of large cloud machines, I’ve now been able to find the globally-optimal 3x4 Boggle board.

With an impressive 1,651 points and 600 words, here it is:

S L P
I A E
N T R
D E S

This board is chock full of big words, including REPAINTED and STRAINED.

A real Boggle board, of course, is 4x4 or even 5x5. ~~Sadly, those problems still remain out of reach. The final post in this series will look at ideas on how to tackle them.~~ Update: after an exciting insight, 4x4 turns out to be possible!

The code for this post lives in the danvk/hybrid-boggle repo.

New ideas in 2025

I explored many new ideas for speeding up Boggle solving, but there were five that panned out:

Play a slightly different version of Boggle where you can find the same word as many times as you like.
Build the “evaluation tree” used to calculate the max/no-mark bound explicitly in memory.
Implement “pivot” and “lift” operations on this tree to synchronize letter choices across subtrees.
Aggressively compress and de-duplicate the evaluation tree.
Use three letter classes instead of four.

Multi-Boggle

Looking at the best 3x3 board:

P E R
L A T
D E S

There are two different ways to find the word LATE:

P E R   P E\R
L-A-T   L-A-T
D E/S   D E S

In regular Boggle you’d only get points for finding LATE once. But for our purposes, this will wind up being a global constraint that’s hard to enforce. Instead, we just give you two points for it. We’ll call this “Multi-Boggle”. The score of a board in Multi-Boggle is always higher than its score in regular Boggle, so it’s still an upper bound.

If there are no repeat letters on the board, then the score is the same as in regular Boggle. In other words, while you can find LATE twice, you still can’t find LATTE because there’s only one T on the board.

In practical terms, this means that we’re going to focus solely on the max/no-mark bound and forget about the sum/union bound. The max/no-mark bound for a concrete board (one with a single letter on each cell) is its Multi-Boggle score.

The Evaluation Tree

Recall from the last post that we started by calculating an upper bound on one large board class:

lnrsy	chkmpt	lnrsy
aeiou	aeiou	aeiou
chkmpt	lnrsy	bdfgjvwxz

and then split it up into five smaller board classes, one with each vowel in the center cell, to reduce the upper bound:

lnrsy	chkmpt	lnrsy
aeiou	a	aeiou
chkmpt	lnrsy	bdfgjvwxz

lnrsy	chkmpt	lnrsy
aeiou	e	aeiou
chkmpt	lnrsy	bdfgjvwxz

...

lnrsy	chkmpt	lnrsy
aeiou	u	aeiou
chkmpt	lnrsy	bdfgjvwxz

The fundamental inefficiency in this approach is that it results in an enormous amount of duplicated work. These five board classes have a lot in common. Every word that doesn’t go through the middle cell is identical. It would be nice if we could avoid repeating that work.

Back in 2009, I implemented the max/no-mark upper bound by recursively searching over the board and the dictionary Trie. This was a natural generalization of the way you score a concrete Boggle board. It didn’t use much memory, but it also didn’t leave much room for improvement.

You can visualize a series of recursive function calls as a tree. The key advance in 2025 is to form this tree explicitly in memory. This is more expensive, but it gives us a lot of options to speed up subsequent steps.

Here’s an example of what one of these trees looks like:

This is visualizing a 2x3 board class, with the cells numbered 0-5:

 T I    0 3
AE .    1 .
 R .    2 .

The top “ROOT” indicates that you have to start on one of the four cells. “CH” ovals indicate that you have to make a choice on a cell, rectangles indicate what those choices are (A or E?) and double-outlined rectangles indicate complete words. You can read words by following a path down the tree. From the left we have: TAR, TIE, TIER, AIT, RAT, RET, REI. (It’s news to me that AIT is a word!)

To get a bound from a tree, you take the sum of your children on rectangular nodes and the max of your children on oval (choice) nodes. Double-outlined boxes indicate how many points they’re worth. In this case the bound at the root is 3, coming from the top branch.

This is a small tree with 30 nodes, but in practice they can be quite large. The tree for the 3x3 board class we’ve been looking at has 520,947 nodes and the 3x4 and 4x4 trees can be much larger.

I actually tried building these trees in 2009, but I abandoned it because I wasn’t seeing a big enough speedup in subsequent steps (scoring boards with a split cell) to justify the cost of building the tree.

What did I miss in 2009? Sadly, I had a TODO that turned out to be critical: rather than pruning out subtrees that don’t lead to points in a second pass, it’s much faster and results in less memory fragmentation if you do it as you build the tree. A 33% speedup becomes a 2x speedup. Maybe if I’d discovered that in 2009, I would have kept going!

The other discovery was that there’s a more important operation to optimize.

“Pivot” and “Lift” operations

Update: A few months later, I wound up doing something a little different than “pivot” and “lift”.

After we found an upper bound on the 3x3 board class, the next operation was to split up the middle cell and consider each of those five (smaller) board classes individually. Now that we have a tree, the question becomes: how do you go from the initial tree to the trees for each of those five board classes?

There’s another way to think about this problem. Why is the max/no-mark bound imprecise? Why doesn’t it get us the score of the best board in the class? Its flaw is that you don’t have make consistent choices across different subtrees. You can see this by zooming in on the “0=t” subtree from the previous graph:

The bound on this tree is 3 (sum all the words). On the top branch (with a “T-“ prefix), it makes the most sense to choose “A” for cell 1, so that you can spell “TAR.” But on the bottom branch (with a “TI-“ prefix), it makes more sense to choose “E” so that you can spell “TIE” and “TIER.”

Of course, the cell has to be either A or E. It can’t be both. The problem is that these choices happen far apart in the tree, so they’re not synchronized. If we adjusted the tree so that the first thing you did was make a choice for cell 1, then the subtrees would all be synchronized and the bound would go down:

This represents the same tree, except that the choice on cell 1 has been pushed to the left. The bound is now 2, not 3 (you have to pick 1 point from the top branch or 2 points from the bottom branch).

What we need is a “pivot” operation to lift a particular choice node up to the top of the tree. You can work out how to do this for each type of node.

First of all, if a subtree doesn’t involve a choice for cell N, then we don’t have to change it. Easy.

For the other two types of nodes (sum and choice), it helps a lot to draw the lift operation. Here’s a sum node (labeled “ROOT”).

We’d like to pivot the tree so that the choice on cell 1 is at the root. Here’s what that looks like:

The tree has gotten bigger and a little more complicated. That’s typical. We’ll be able to improve this, but more on that in a moment.

Here’s a choice node:

Here’s what it looks like after pivoting the choice on 1 to the root:

Again, the tree has gotten more complex. In particular, notice how the 0=c node with three points has been duplicated. This blowup is the cost of pivoting. The payoff is reduced bounds.

If you lift the choice for the middle cell of the 3x3 board all the way to the top of the tree, you’ll wind up with this:

There’s a choice node with five sum nodes below it, and the bound is lower. Now if you lift another cell to the top, you’ll get two layers of choice nodes with sum nodes below them:

I’ve rotated the tree because it’s already getting big. As before, the bound is lower. If you keep doing this, you build up a “pyramid” of choice nodes at the top of the tree. If the bound on any of these nodes drops below the highest score we know about, we can prune it out. This is equivalent to the “stop” condition from the 2009 algorithm, it’s just that we’re doing it in tree form.

This “lift” operation is not cheap. The cost of making choices in an unnatural order is that the tree gets larger. Here’s the tree size if you lift all nine cells of the 3x3 board, along with the bound:

Step	Nodes	Bound
(init)	520,947	9,359
1	702,300	6,979
2	1,315,452	5,334
3	2,527,251	4,069
4	5,158,477	3,047
5	8,395,605	2,318
6	14,889,665	1,774
7	18,719,619	1,373
8	11,205,272	1,037
9	4,143,221	804

The node count goes up slowly, then more rapidly, and then it comes down again as we’re able to prune more subtrees. It increases a lot before it comes back down. For this board, there’s a 36x increase in the number of nodes after 7 lifts. At the end of this, there’s only 428 concrete boards (out of 5,625,000 boards in the class) that we need to test.

Does 4 million nodes seem like a lot for only 428 Boggle boards? It is. There are a few important tweaks we can make to keep the tree small as we lift choices.

Compression and De-duping

Update: after I changed how the trees were constructed, compression and de-duping were no longer wins and I dropped them.

Keeping the tree as small as possible is essential for solving Boggle quickly. There are two inefficient patterns we can identify and fix in our trees to keep them compact.

Collapse chains of sum nodes into a single sum node.
Merge sibling choice nodes.

There’s no point in having trees of sum nodes without choice nodes in between them. We may as well add all their points and children to a single, bigger sum node. Here’s a tree we had after lifting through a sum node earlier:

Here’s what it looks like after collapsing sum nodes:

The points from the 0=a and 0=b nodes have moved into their parents, which lets us delete those two nodes, which makes the tree smaller.

Here are the node counts when you add compression after each pivot:

Step	Nodes
(init)	520,947
1	669,156
2	1,054,515
3	1,726,735
4	2,675,250
5	2,620,720
6	1,420,925
7	301,499
8	39,667
9	9,621

These are considerably better. After 7 lifts, we’ve gone from nearly 19 million nodes to a mere 300,000. The maximum increase is now only 5-6x. Compression on its own is a 2-3x speedup.

Remember how we switched to playing “Multi-Boggle” earlier? This is where that change is crucial. If we had to track individual words, we couldn’t collapse sum nodes because they’d each reflect different words, and we’d need to keep track of that for the sum/union bound. But with Multi-Boggle, we’re free to collapse TIE and TIER into a single word that’s worth 2 points.

Here’s a tree where we can merge sibling choice nodes:

There are two choices for cell 0 that are siblings under the root node. The first is a choice between 0=a and 0=b (0=c nets zero points). The second is a choice between 0=b and 0=c. There’s no reason we should make those choices independently. They’re the same choice, and the tree should reflect that. We can implement that by merging the trees:

There are fewer nodes now and (exciting!) the bound has gone down because we’ve synchronized two choices that were previously made independently. Subtree merging can be done recursively.

The “lift” operation expands the tree because it can duplicate nodes or entire subtrees. Another optimization is to de-duplicate these, and make sure we only ever operate on unique nodes. This can be done by computing a hash for each node. Here’s a tree with structurally identical nodes marked in red:

You can scan over the tree to find the canonical versions of each red subtree. For example, the red “4 CH” in the middle reads “4 CH - 4=r - 5 CH - 5=e (1)”, which is exactly the same as the line right above it.

Step	Unique Nodes
(init)	98,453
1	117,602
2	215,121
3	318,088
4	592,339
5	754,947
6	481,449
7	125,277
8	27,125
9	9,613

Whereas compression is more effective at reducing node counts after many lifts, de-duplication is better at reducing (unique) node counts initially and after fewer lifts. Only processing each unique node once can potentially save us a lot of time.

One way to think about this is that it allows us to memoize the pivot operation. Another is that it turns the tree into a DAG, similar to how you can compress a Trie by turning it into a DAWG/DAFSA (Directed Acyclic Word Graph). Visualizing it as a DAG doesn’t work very well — there’s just too many crossing lines.

Use three letter classes instead of four

The net effect of all these changes is that we’re able to “break” difficult board classes much more efficiently. For example, this 3x4 board class:

lnrsy	aeiou	bdfgjqvwxz
lnrsy	aeiou	aeiou
chkmpt	lnrsy	lnrsy
bdfgjqvwxz	aeiou	aeiou

takes about 14 seconds to break using the 2009 technique but only 3 seconds to break using the tree techniques described above. Lifting five times reduces the bound from 51,639 to 6,695, at which point we can switch over to the 2009 technique.

Compare that with an easier board:

lnrsy	lnrsy	bdfgjqvwxz
aeiou	bdfgjqvwxz	bdfgjqvwxz
lnrsy	aeiou	chkmpt
chkmpt	lnrsy	bdfgjqvwxz

This takes 0.16 seconds with the 2009 technique and 0.12 seconds with the tree technique. It’s a win, but much less of a win. (The five lifts on this one reduce the bound from 8,138 to 1,231.)

If trees are most helpful on hard board classes, maybe we should have more of them? If we use use three letter buckets instead of four, it significantly reduces the number of board classes we need to consider:

Four letter buckets: 4^12/4 ≈ 4.2M boards classes
Three letter buckets: 3^12/4 ≈ 133k board classes

The board classes with three letter buckets are going to be bigger and harder to break. But with our new tools, these are exactly the sort of boards on which we get the biggest improvement. So long as the average breaking time doesn’t go up by more than a factor of ~32x (4.2M/133k), using three buckets will be a win.

Code Year	Buckets	Pace (s/class)
2009	4	4.53
2009	3	33.336 (7.4x slower)
2025	4	1.175
2025	3	5.321 (4.5x slower)

So three buckets would have been better with the 2009 algorithm, but it’s an even bigger win in 2025. Since 32 / 4.5 ≈ 7, we’d expect this to be around a 7x speedup.

Why not keep going to two classes, or even just one? The cost is memory and reduced parallelism. “Chunkier” board classes require bigger trees, and RAM is a finite resource. Moreover, the fewer letter buckets we use, the more skewed the distribution of breaking times gets. Some board classes remain trivial to break (ones with all consonants, for example), but others are real beasts. On the full 3x4 run, the fastest board was broken in 0.003s whereas the slowest took 2297s. It’s harder to distribute these uneven tasks evenly across many cores or many machines to get the full speedup you expect from distribution. I think using slightly bigger chunks could still help, say two classes (consonant/vowel) in the corners, three on the edges and five in the middle. Update: this did work.

Putting it all together

For each board class, the 2025 algorithm is:

Build the evaluation tree.
“Lift” a few choice cells to the top.
Continue splitting cells without modifying the tree, ala the 2009 approach.

The right number of “lifts” depends on the board and the amount of memory you have available. Harder boards benefit from more lifts, but this takes more memory.

Using this approach, I was able to evaluate all the 3x4 Boggle board classes on a 192-node C4 cloud instance in 8–9 hours, roughly $100 of compute time. The results? There are exactly five boards that score more than 1600 points with the ENABLE2K word list:

srepetaldnis (1651)
srepetaldnic (1614)
srepetaldnib (1613)
sresetaldnib (1607)
sresetaldnip (1607)

The best one is the same one I found through simulated annealing. The others are 1-2 character variations on it. It would have been more exciting if there were a new, never before seen board. But we shouldn’t be too surprised that simulated annealing found the global optimum. After all, it did for 3x3 Boggle. And Boggle is “smooth” in the sense that similar boards tend to have similar scores. It would be hard for a great board to “hide” far away from any other good boards.

Next Steps

This is an exciting result! After 15 years, it is meaningful progress towards the goal of finding the globally-optimal 4x4 Boggle board.

There are still many optimizations that could be made. My 2025 code only wound up being a 3-4x speedup vs. my 2009 code when I ran it on the 192-core machine. This was because I had to dial back a few of the optimizations because I kept running out of memory. So changes that reduce memory usage would likely be the most impactful.

On the other hand, I don’t think there’s any tweaks to my current approach that will yield more than a 10x performance improvement. So while I might be able to break 3x4 Boggle more efficiently, it’s not going to make a big dent in 4x4 Boggle. Remember that 50,000x increase in difficulty from earlier. For 4x4 Boggle, we still need a different approach. (Or $500,000 of compute time!)

In the next and final post, I’ll talk about a few ideas that might help to finally crack 4x4 Boggle. I’ll also share some thoughts on pybind11, picking up an old project, and the experience of working on a hard optimization problem. The next post turned out to be something wonderful and entirely unexpected.

You can find all the code for this post in the danvk/hybrid-boggle repo.

Boggle Revisited: Finding the Globally-Optimal 3x4 Boggle Board

2025-02-10T00:00:00+00:00

Over 15 years ago (!) I wrote a series of blog posts about the board game Boggle. Boggle is a word search game created by Hasbro. The goal is to find as many words as possible on a 4x4 grid. You may connect letters in any direction (including diagonals) but you may not use the same letter twice in a word (unless that letter appears twice on the board).

You score points for each word you find, with longer words being worth more points (3-4 letters are 1 point, 5=2 points, 6=3 points, 7=5 points, 8+ letters=11 points).

Boggle is a fun game, but it’s also a fun Computer Science problem. There are three increasingly hard problems to solve as you go down this rabbit hole:

Write a program to find all the words on a Boggle board. This is a classic data structures and algorithms problem, and sometimes even an interview question. What’s wonderful about this problem is that it’s a perfect use for a Trie (aka Prefix Tree), and a counter to the idea that hash tables are always the best answer. You can find many, many Boggle solvers of this sort on the internet. Apparently Jeff Dean is a fan of Boggle, and LLMs can even write these sorts of solvers.
Find high scoring Boggle boards. Once you’ve written a fast solver, a natural question is “what’s the Boggle board with the most points on it?” The usual approach is some variation on simulated annealing or hill climbing. Start with a random board and find all the words on it. Then change a letter or swap two letters and see if it improves things. Repeat until you stall out. This problem is less popular than the first, but you can still find a few about it and a Code Golf competition. There’s even a published article from 1982 about it! (Fun fact: they found the wrong board.)
Prove that a Boggle board is the global optimum. If you do enough simulated annealing runs, you’ll see the same few boards pop up again and again. A natural next question is “are these truly the highest-scoring boards?” Are there any high-scoring boards that simulated annealing misses? Proving a global optimum is much harder than finding a few high-scoring boards and, so far as I’m aware, I’m the only person who’s ever spent significant time on this particular problem.

The crowning achievement of my work in 2009 was proving that this was the highest-scoring 3x3 Boggle board (using the ENABLE2K word list), with 545 points:

R T S
E A E
P L D

Now, 15 years later, I’ve been able to prove that this is the best 3x4 board, with 1651 points:

S L P
I A E
N T R
D E S

This post will summarize how I found the highest-scoring 3x3 Boggle board back in 2009, and the next will describe how I extended this to 3x4 in 2025. Alas, 4x4 Boggle still remains out of reach for now. Maybe in 2040?

Why is this a hard/interesting problem?

There are an enormous number of possible Boggle boards. Something like 26^16/8, which is around 5 billion trillion (5*10^21). This is far, far too many to check one by one. I previously estimated that it would take around 2 billion years on a single CPU.

And yet… there’s a lot of structure in this problem that might be exploited to make it tractable. There’s an enormous number of possible optimizations to try, lots of interesting data structures and algorithms to read about and implement, and always the possibility that you’re one insight away from solving this problem.

Boggle is a worthy adversary, and most ideas don’t pan out. But the possibility of achieving such an enormous speedup (2 billion years → a few hours) is what makes this problem exciting to me.

Why pick it back up now?

Fifteen years is a long time! The world has changed a lot since my last Boggle post. Computers have gotten much faster. There have been five new versions of C++. Cloud computing is a thing now. Stack Overflow is a thing. So are LLMs. A cool language called TypeScript came out and I wrote a book on it. I even have an iPhone now!

I’ve gotten in the habit of doing the Advent of Code, a coding competition that’s held every December. It involves lots of data structures and algorithms problems, so it got me in that headspace.

In addition, I’ve long been curious to write code using a mix of C++ and Python: C++ for the performance-critical parts, Python for everything else. Maybe it could be a best-of-both worlds: the speed of C++ with the convenience of Python. I thought Boggle would be a great problem to use as motivation. I wound up using pybind11 and I’m a fan. I’ll have some thoughts to share about it in a future post.

How did I find the optimal 3x3 board in 2009?

Though I didn’t know it at the time, I used branch and bound, a ubiquitous optimization strategy first developed in the 1960s. There were three key ideas.

Board classes

The first idea was to reduce the number of boards by considering whole classes of boards at once. Here’s an example of a class of 3x3 boards:

l, n, r, s, y	c, h, k, m, p, t	l, n, r, s, y
a, e, i, o, u	a, e, i, o, u	a, e, i, o, u
c, h, k, m, p, t	l, n, r, s, y	b, d, f, g, j, v, w, x, z

I’ve divided the alphabet up into four different “buckets.” Instead of having a single letter on each cell, this board has 5-9 possible letters from one of those buckets on each cell. There are 5,062,500 individual boards in this class. The highest-scoring 3x3 board (rts/eae/pld) is one of them, but there are many others.

There are vastly fewer board classes than individual boards. For 3x3 Boggle, using four “buckets” of letters takes us from 26^9/8 = 6x10^11 boards → 4^9/8 = 32,768 board classes. If we can find the highest-scoring board in each class in a reasonable amount of time, this will make the problem of finding the global optimum tractable.

Bound: Upper bounds on board classes

The second insight is that, rather than finding the highest-scoring board in a class, all we really need to do is establish an upper bound on its score. An upper bound is a concept from mathematics: if the highest-scoring board in a class has 500 points, then 500 is an upper bound on the score. So is 600. The upper bound doesn’t need to be achieved by any particular board, it just needs to be greater or equal to the score of every board.

If the upper bound is less than 545 (the score of the best individual board we found through simulated annealing), then we know there’s no specific board in this class that beats our best board, and we can toss it out without having to score every single board in the class.

As it turns out, establishing an upper bound is much, much easier than finding the best board in a class. I came up with two upper bounds back in 2009:

sum/union: the sum of the points for every word that can be found on any board in the class.
max/no-mark: a bound that takes into account that you have to choose one letter for each cell.

You can read more about how these work in the linked blog posts from 2009. The max/no-mark bound is typically much lower than the sum/union bound, but not always. Usually neither of the upper bounds is low enough. On this board class, for example, the sum/union bound is 106,383 and the max/nomark bound is 9,359. Those are both much, much larger than 545!

Branch: Repeatedly split board classes

This brings us to the final insight: if the upper bound is too high, you can split up one of the cells to make several smaller classes. For example, if you split up the middle cell in the board class from above, then this single class becomes five classes, one for each choice of vowel in the middle:

lnrsy	chkmpt	lnrsy
aeiou	a	aeiou
chkmpt	lnrsy	bdfgjvwxz

lnrsy	chkmpt	lnrsy
aeiou	e	aeiou
chkmpt	lnrsy	bdfgjvwxz

...

lnrsy	chkmpt	lnrsy
aeiou	u	aeiou
chkmpt	lnrsy	bdfgjvwxz

These are the bounds you get on those five board classes:

Middle Letter	Max/No Mark	Sum/Union
A	6,034	55,146
E	6,979	69,536
I	6,155	58,139
O	5,487	48,315
U	4,424	37,371

Those numbers are all still too high (we want them to get below 545), but they have come down considerably. Choosing “U” for the middle cell brings the bound down the most, while choosing “E” brings it down the least.

Branch and Bound

If we keep splitting up cells, we’ll keep getting more board classes with lower bounds. I didn’t know it in 2009, but this is branch and bound, a ubiquitous approch for solving optimization problems:

Branch: Split up a cell to get smaller board classes (subproblems).
Bound: Calculate an upper bound on the board class.

If you iteratively break cells, your bounds will keep going down. If they drop below 545, you can stop. These recursive breaks form a sort of tree. The branches of the tree with lower scores (like the “U”) will require fewer subsequent breaks and will be shallower than the higher-scoring branches (the “E”). The sum/union bound converges on the true score, so if you break all 9 cells and still have more than 545 points, you’ve found a global optimum.

Back in 2009, I reported that I checked all the 3x3 boards this way in around 6 hours. The board you find via simulated annealing is, in fact, the global optimum. In 2025, I’m able to run the same code on a single core on my laptop in around 40 minutes.

This is something like a 400x speedup vs. scoring all the 3x3 boards individually.

How much harder are 3x4 and 4x4 Boggle?

As you increase the size of the board, the maximization problem gets harder for two reasons:

There are exponentially more boards (and board classes) to consider.
Each board (and board class) has more words and more points on it.

How bad is this?

3x3: each class takes ~80ms to break and there a ~33,000 of them ⇒ ~40 minutes.
3x4: each class takes ~1.6s to break and there are ~6.7M of them ⇒ ~78 days.
4x4: each class takes ~10m to break and there are ~537M of them ⇒ ~10,000 years.

So with the current algorithms, 3x4 Boggle is ~3,000x harder than 3x3 Boggle and 4x4 Boggle is around 50,000 times harder than that.

That’s enough for today. In the next post, I’ll present a few optimizations on the 2009 approach that net us another ~10x speedup. Enough that it’s reasonable to solve 3x4 Boggle on a single beefy cloud machine, but not enough to bring 4x4 Boggle within reach.

Another Decade, Another Webdiff

2024-06-21T00:00:00+00:00

Over the past few weeks, I found the time to work on webdiff, an open source project of mine that I built over a decade ago. I still use it all the time, but hadn’t actively worked on it since 2015. Revisiting an old project is always an interesting experience, and this post presents my reflections on it.

First off, what is webdiff? It’s a diff tool. Rather than running git diff, you run git webdiff and you get a two-column diff UI with syntax highlighting in your browser:

Because it’s running in your browser, you get lots of nice things for free: web fonts, zoom, search. You can also look at diffs between images:

Installation is as simple as

pip install webdiff

Background

The project came out of my experience of leaving Google for the first time back in 2014. I was very accustomed to how software was built inside Google, and there were some tools that I missed. In particular, Google’s code review tools (Mondrian and later Critique) were light years ahead of GitHub’s in 2014. GitHub’s PR Review was quite barebones back then: no two-column diffs, no syntax highlighting. webdiff was my attempt both to improve this, and to learn how to build and publish a tool outside the Google ecosystem.

I spent a good chunk of the summer of 2014 building webdiff, and I was happy with how it turned out. I released it that July and continued to work on it, even giving a PyCon talk about it in 2015.

While I haven’t worked much on webdiff since 2015, I’ve continued to be an active user. My years at Google trained me to look at two-column diffs with syntax highlighting in a browser, and I still much prefer this to looking at diffs in a terminal. GitHub’s Pull Request UI has improved significantly over the years, but I like that I can run git webdiff locally (or on a plane) without having to push anything to GitHub. I also prefer the one-file-at-a-time UI, which matches Mondrian/Critique. VSCode’s diff viewer is another interesting option these days, though I feel some mismatch between diff viewing as an ephemeral process and editing as a more persistent one.

Realizations

Using webdiff over the years, I had a few realizations. One was that I hadn’t understood git very well when I built it in 2014. Most of my experience had been with git5, a git wrapper around Perforce in use at Google at the time. In retrospect, this was an incredibly confusing way to learn git! My understanding of git improved while I worked at Hammerlab in 2014–2015, then took another big step forward in 2016 when I watched the fantastic git from the bits up talk.

The other big realization was that I’d architected webdiff in the wrong way. webdiff takes two directories (before and after, or left and right) and tries to match up the files in them. This is usually straightforward, but there are some tricky edge cases, like a rename+change. The original webdiff matched files up on its own, then calculated diffs for each file. The realization was that git is already really good at this, and that I should rely on it to do all the diff calculations for me. webdiff should display diffs, never calculate them.

A wall

This idea kicked around in the back of my head for a few years, until I had some time to work on it in the fall of 2022. I learned about git diff --no-index, which lets you use git-diff to diff two files or directories outside a git repo. And I learned about git diff --raw, which diffs two directories and matches files between them to produce adds, deletes, renames and changes. This all seemed promising! It even let me play around with flags like git diff -w which tells git diff to ignore whitespace changes.

Then I ran into a wall: if you run git difftool from HEAD, one of the directories it produces will be filled with symlinks to files in your repo. This makes sense: it’s faster to create symlinks to the files than copies. And for webdiff, it meant that you could edit a file, reload the browser window, and see the new diff.

Unfortunately for webdiff, git diff --no-index does not resolve symlinks. This meant that, in order to produce a diff, I had to run git difftool --no-symlinks. This was slower, and it broke an important workflow: reloading the diff after editing a file no longer reflected your changes. This was frustrating, and enough to put me off the project.

A breakthrough

Fast-forward almost two years and I decided to pick up the project again. What had seemed like a fundamental issue in 2022 now just seemed like a nuisance. Before passing the directories to git diff --no-index, I could make a version of the directory that resolved the symlinks. This would let git pair up the files for me. Then I could resolve symlinks before running git diff --no-index to generate diffs for individual files. Elegant, no, but it let me get through the impasse.

Once that was resolved, I was able to cut the first new release of webdiff in years. But once I was in there, I didn’t want to stop. When you look at ten year old code, it’s hard to resist the urge to modernize it. I’ve done quite a bit of that over the past few weeks.

One advantage of stepping away from a project for so long is that you get to skip several generations of tooling. In this case I got to skip straight from Python’s vintage setuptools to poetry for managing dependencies and releases. The Python packaging situation has improved significantly over the past decade. I like poetry, and I like pyproject.toml over setup.py.

Migrating the diff UI from jQuery to React was a real throwback to 2015. It was also a nice reminder of the beauty of React. There was considerable duplication between the code for building the initial diff UI and for filling in additional rows when you clicked a “Show 12 rows” link. Adding “show 10 more” links would have made it even worse.

When I ported the code over to React, the duplication went away. It was easy to show the additional skipped rows in the data model and trust React to render them appropriately.

I’ve added quite a few new features and even started to play around with next-generation tooling like the React Compiler. But after a few weeks, I can tell that I’ve hit the point of diminishing returns. I’d like to go back to being a user again.

What’s next

What would I work on next for webdiff? There’s a long-standing, annoying bug where the terminal process doesn’t quit when you close the diff in your browser. I’d like to try to fix that. And now that the diff UI is fully React-ified, generating it lazily could make it easier to render diffs for large files like lockfiles (or checker.ts). I’d also be excited about a special mode for diffing minified JSON.

My biggest dream for a code review tool would be to have language services available while reviewing code. Google didn’t have this when I worked there, but it might have it now. I don’t think this is a feature webdiff will ever have, but if VS Code gets it right, it might be enough to make me switch.

AlphaGo vs. Lee Sedol

2016-03-15T00:00:00+00:00

I was dimly aware of the ongoing competition between AlphaGo and Lee Sedol, but I hadn’t paid much attention until I saw this chart on reddit:

It’s hard to read “Lee Sedol’s brilliant attack (78th)” and not get curious! This led me into a deep dive on the competition. You can read more about the move or watch a 15 minute summary of the match. The full 6 hour match, including a press conference afterwards, is also online. You can learn a lot about Go from listening to the YouTube commentators. One of them is Michael Redmond, the all-time top-rated American Go player. Even if you don’t understand how to play Go (I barely do), it’s fun to watch the experts react to this move: “Oh, this [is] very creative.”

Lee Sedol’s win in the fourth match is being celebrated as a victory for humankind. But it’s surprising that we’re here at all. AlphaGo, Google DeepMind’s computer Go program, had already won the first three games and hence the best-of-five competition. This is a coming of age moment for the neural net community. Over the past ten years, thanks to the emergence of large data sets and GPUs, the whole field has experienced a renaissance. But most of the great results have been on toy problems like image classification or traditional signal processing problems like speech recognition. Beating an elite Go player for the first time is a marquee result that transcends the field. As long as I’ve worked in software, Go has been the one game at which computers couldn’t compete. Everyone thought that this result was years away.

Last fall, AlphaGo competed against Fan Hui, a lower-ranked Go professional. It beat him 5-0. This was the first time that a computer Go program had defeated a professional. What happened next is suggestive. According to Wired, he began consulting with the DeepMind team:

As he played match after match with AlphaGo over the past five months, he watched the machine improve. But he also watched himself improve. The experience has, quite literally, changed the way he views the game. When he first played the Google machine, he was ranked 633rd in the world. Now, he is up into the 300s. In the months since October, AlphaGo has taught him, a human, to be a better player. He sees things he didn’t see before. And that makes him happy. “So beautiful,” he says. “So beautiful.”

We learn most quickly from our betters—people who can review your work, say what you did well and point out what would have been better. But if you’re the best Go player in the world, who do you learn from? I wouldn’t be surprised if this result prompts humans to discover new styles of play. Computers may get the best of us at Go in the long run, but we’ll get better at it in the process.

The fifth and final game happens tonight. If Lee Sedol wins, you could make the case that he just took a few games to figure out how to get the best of AlphaGo. If not, it means that it took completely brilliant play from the best in the world to beat a computer, and it’s likely to be the last time a human ever pulls this off. The match is being streamed live on YouTube.

Extending the Grid to Add 1,000 Photos to OldNYC

2016-01-19T00:00:00+00:00

I recently added around 1,000 new photos to the map on OldNYC. Read on to find out how!

At its core, OldNYC is based on geocoding: the process of going from textual addresses like “9th Street and Avenue A” to numeric latitudes and longitudes. There’s a bit of a mismatch here. The NYPL photos have 1930s addresses and cross-streets, but geocoders are built to work with contemporary addresses. OldNYC makes an assumption that contemporary geocoders will produce accurate results for these old addresses. For NYC, this is usually a good assumption! The street grid hasn’t changed too much in the past 150 years. But it is an assumption, and it doesn’t always pan out.

Two of the most noticeable problem spots are Stuytown and Park Avenue South:

The lettered Avenues (A, B, C, D) used to continue above 14th street. This was the Gas House district. But in the 1940s, this area was destroyed to make way for the super-blocks of Stuyvesant Town. Intersections like “15th and A” do no exist in the contemporary Manhattan grid and geocoders can’t make sense of them. But there are photos there!

The problem for Park Avenue South is different. Until 1959, it was known as 4th Avenue. So photographs from the 1930s are recorded as being at, for example, “4th Avenue and 17th street”, an interesection which no longer exists. Again, contemporary geocoders can’t make sense of this.

The frustrating thing here is that it’s perfectly obvious where all of these interesctions should be. Manhattan has a regular street grid, after all. So I set out to build my own Manhattan street grid geocoder.

To begin with, I gathered lat/lons for every intersection that I could. With some simple logic, this handled the Avenue renaming issue.

My initial idea to geocode unknown interesections was to interpolate on the avenues. For example, to find where the intersection of 18th Street and Avenue A should be, you can assume that the intersections of numbered streets and Avenue A are evenly spaced and then find where the 18th street intersection would fall:

Mathematically, you fit linear regressions from cross-street to latitude and longitude. This feels like it should work but, because the streets aren’t all perfectly spaced, it winds up producing results that don’t quite look right.

While I was playing around with this approach, I realized that I was checking the results using a different technique: continuing the straight lines of the streets until they intersected:

Mathematically, this means that you fit a linear regression to the latitude→longitude mapping for each Street and Avenue. To find an intersection, you find the point where these lines intersect. This works so long as the Streets and Avenues are straight. Fortunately, with a few exceptions like Avenue C and the West Village, they are (r²>0.99).

This approach produced very good results. The oddities which remained were as likely to be problems with the data as with the geocoder (one image was non-sensically labeled as “25th & D”, which extrapolates to somewhere in the East River).

While Stuytown and Park Avenue South were clear winners, new photos appeared all over the map:

It even helped uptown:

All told, there are about 1,000 new images on the map. Go check them out and! And please help transcribe the text on the back of them. My OCR system didn’t run on the new images, so they’re sorely lacking descriptions.

Here are a few favorites:

Looking down Avenue B from 15th to 17th Street. This Avenue no longer exists.

The Everett House hotel at Union Square in 1906. This building no longer exists, but there's a new hotel (the W) at the same location.

My takeaways from NIPS 2015

2015-12-12T00:00:00+00:00

I’ve just wrapped up my trip to NIPS 2015 in Montreal and thought I’d jot down a few things that struck me this year:

Saddle Points vs Local Minima

I heard this point repeated in a talk almost every day. In low-dimensional spaces (i.e. the ones we can visualize) local minima are the major impediment to optimizers reaching the global minimum. But this doesn’t generalize. In high-dimensional spaces, local minima are almost non-existent. Instead, there are saddle points: points which are a minimum in some directions but a maximum in others. Intuitively, this makes sense: in N dimensions, the odds of the curvatures all going the same way at a point is (1/2)^N. As Yoshua Bengio said, “it’s hard to build an n-dimensional wall.” This gives an intuition for why procedures like gradient descent are effective at optimizing the thousands of weights in a neural net: they won’t get stuck in a local optimum. And it gives an intuition for why momentum is helpful: it helps gradient descent escape from saddle points.
Model Compression

The tutorial on Hardware for Deep Learning was less about new hardware and more about how to make your software get the most out of existing hardware. Due to the high cost of uncached, off-chip memory reads, reducing the memory footprint of your models can be a huge performance win. Bill Dally presented a result on model pruning that I found interesting: by iteratively removing small weights from a model and retraining, they were able to remove 90+% of the weights with zero loss of precision. This parallels an observation from transfer learning, that small networks are most effectively trained using the output of larger networks. It would be nice if we could train these smaller networks directly. See the Deep Compression paper.
The importance of canonical data sets / problems

Over and over, talks and posters referenced the same canonical data sets: the MNIST set of handwritten digits, the CIFAR and ImageNet images, the TIMIT speech corpus and the Atari/Arcade Learning Environment (ALE). These have given researchers in their fields a shared problem on which to experiment, compete, collaborate and measure their progress. If you want to push a field forward, built a good challenge problem.
One-shot Learning

There was much high-level talk about how the human brain is very good at learning to perform new tasks quickly. Contrast this with neural nets, which require thousands or millions of training examples to reach human performance. This comparison is somewhat unfair because adult humans have years of experience interacting with the real world from which to draw on. There seems to be a great deal of interest in getting machines to do a better job of transferring general knowledge to specific tasks.
This conference has gotten huge!

I’ve read that registrations have gone up significantly over the last few years. This was palpable at the conference. Many of the workshop rooms were packed to the gills and I watched most of the larger talks in overflow rooms. This is probably good thing for the ML community. I’m not sure if it’s a good thing for NIPS.

A few smaller bits that struck me:

Highway Networks are cool. They let you train very deep networks, where the depth has to be learned. They found that depth > 20 was not helpful for MNIST, but was for CIFAR.
AlexNet seems to have become a canonical neural net for experimentation. I saw it referenced repeatedly, e.g. on the Deep Compression poster.
As an example of the above, I really enjoyed the Pixels to Voxels talk. Pulpit Agarwal & co showed that the activations of individual layers of AlexNet correlate to activations in regions of the brain. They were able to use this correspondence to learn what some of the mid-level regions of the visual cortex are doing.
We heard several speakers say that “backprop is not biologically plausible.” I assume this is because we don’t consume nearly enough labels for it to be practical at the scale of a human brain?
I asked someone at the NVIDIA booth whether the ML industry is large enough to drive GPU design & sales (as opposed to the game industry). It is. The GPUs designed for ML tend to be more robust than those used in games. They have error correcting codes built-in. In a game, if an arithmetic unit makes a mistake, it’ll be fixed on the next frame. When you’re training a neural net, that mistake can propagate.

I do relatively little machine learning in my day-to-day. NIPS is always a bit over my head, but it’s a good way to rekindle my interest in the field.

Dan writes on HammerLab

2015-10-21T18:03:00+00:00

I haven’t written a substantial blog post on danvk.org since January. Instead, I’ve been writing over on my groups’s blog at hammerlab.org.

Here are the posts I’ve written or edited:

SVG→Canvas, the pileup.js Journey (13 Oct 2015)

In which I explain why we changed from using SVG to using canvas for our genome browser (spoiler: it’s performance). This also introduces data-canvas, which compensates for some of the drawbacks of canvas, e.g. difficulty in tracking clicks and writing tests.
Bundling and Distributing Complex ES6 Libraries in an ES5 World (09 Jul 2015)

I helped edit this post, which was written by Arman Aksoy. It outlines our approach to writing ES6 JavaScript while distributing/bundling it for ES5 clients.
Introducing pileup.js, a Browser-based Genome Viewer (19 Jun 2015)

This post introduces pileup.js, the in-browser genome visualizer I’ve been working on for most of the past year.
Testing React Web Apps with Mocha (14 Feb 2015)
Testing React Web Apps with Mocha (Part 2) (21 Feb 2015)

A two-parter in which I explain how we set up testing for our React web app using Mocha, rather than Jest. Since writing this post, I’ve completely changed my mind about how web testing should be done. If your code is intended to be run in the browser, then you should test it in the browser, rather than using Node as these posts suggest.
Faster Pileup Loading with BAI Indices (23 Jan 2015)

BAM is a very widely used bioinformatics file format which stores aligned reads from a genome. These files can be huge, so the BAM Index (BAI) format was created to speed up retrieval. We found that the BAI was also too large for convenient access on the web, so we created an index of the index.
Streaming from HDFS with igv-httpfs (05 Dec 2014)

How we got data from HDFS into our genome viewer of choice by building a small piece of infrastructure. We’ve since stopped using this. Nowadays we have an NFS mount for our HDFS file system and serve from that using nginx.

Launched: OldNYC

2015-06-04T02:24:00+00:00

Two weeks ago I launched my latest side project, OldNYC. It’s a collaboration with NYPL Labs which places around 40,000 historical photos of New York City on a map. Avid readers of this blog know that I’ve been working on this for years.

The response to OldNYC has been completely overwhelming. Hundreds of thousands of people have used the site. Millions of images have been viewed. Users have left nearly a thousand comments and fixed thousands of typos in the OCR’d descriptions.

Rather than say more about the project myself, I’ll let you pick your write-up of choice. There have been many!

New York Times: New York Today: New Views of the Past
Gothamist: This Photo Map Will Bring You Back To Old NYC, Block By Block
The Guardian: New York vintage photo archive lets you track a century of change to your block
Citylab: Mapping the New York That Once Was
Pix11 News (TV): Website gives a virtual tour of old NYC through photo maps
Library Journal: Developer Maps Library Photo Archives
Kottke: Mapping photos of old NYC
Gizmodo: Here Are 40,000 Photos Of Old New York Plotted on a City Map
DNAInfo: See Old Photos of Your Neighborhood With This Interactive Map
West Side Rag: Throwback Thursday: New Tool Lets You See Photos of Your Street Through the Years
Daily Mail: The Big Apple back in time: Forty thousand historic photographs on an interactive map offer slice of life from old New York
Business Insider: This incredible map lets New Yorkers see vintage photos of their street corners
EV Grieve: Immerse yourself in archival photos of NYC
The Awl: Old New York, Mapped
The Week: Check out this amazing archive of historic New York pictures
Fast Company: See What Times Square And Wall Street Looked Like 100 Years Ago
Free Williamsburg: See photos of old Williamsburg thanks to the NY Public Library
PetaPixel: OldSF and OldNYC: Historical Photos Plotted on Maps
Brooklyn Magazine: Here are Thousands of Historical Photos of New York City, All on One Interactive Map
Mental Floss: Tour Old New York on Your Computer

PyCon 2015: Make web development awesome with visual diffing tools

2015-04-12T15:05:00+00:00

Here’s the video of my talk from PyCon 2015, Make web development awesome with visual diffing tools:

Here are the slides for the talk:

The two tools referenced are:

dpxdt for generating screenshots
webdiff for viewing image diffs

I used comparea and dygraphs as sample apps for the demos.

If you enjoyed my talk, you might also enjoy Brett’s talk from a few years ago, The Secret of Safe Continuous Deployment: Perceptual Diffs. It goes into more depth on how screenshots can facilitate rapid deployment.

Training an Ocropus OCR model

2015-01-11T04:58:00+00:00

In the last post, we walked through the steps in the Ocropus OCR pipeline. We extracted text from images like this:

The results using the default model were passable but not great:

O1inton Street, aouth from LIYingston Street.
Auguat S, 1934.
P. L. Sperr.
NO REPODUCTIONS.

Over the larger corpus of images, the error rate was around 10%. The default model has never seen typewriter fonts, nor has it seen ALLCAPS text, both of which figure prominently in this collection. So its poor performance comes as no surprise.

In this post I’ll walk through the process of training an Ocropus model to recognize the typewritten text in this collection. By the end of this post, the performance will be extremely good.

Generating truth data

Ocropus trains its model using supervised learning: it requires images of lines along with correct transcriptions. If you’re trying to recognize a known font, you can generate arbitrary amounts of labeled data (using ocropus-linegen). But in our case, we have to label some images by hand.

This is tedious and involves a lot of typing. Amazon’s Mechanical Turk is a popular way of farming out small tasks like this, but I prefer to do the transcription myself using localturk. It doesn’t take as long as you might think (I typed 800 lines in about an hour and 20 minutes). And it has the benefit of forcing you to look at a large sample of your data, something that’s likely to lead to insights.

(localturk in action)

I used this template for the transcription. Ocropus expects truth data to be in .gt.txt files with the same name as the PNG files for the lines. For example:

book/0001/010001.png
book/0001/010001.gt.txt

It’s important that you transcribe lines, not entire pages. I initially transcribed pages and tried to have Ocropus learn on them, but this doesn’t work at all.

Training a model

Ocropus trains a model by learning from its mistakes. It transcribes the text in a line, then adjusts the weights in the Neural Net to compensate for the errors. Then it does this again for the next line, and the next, and so on. When it gets to the last line of labeled data, it starts over again. As it loops through the training data over and over again, the model gets better and better.

ocropus-rtrain -o modelname book*/????/*.bin.png

This produces lots of output like this:

2000 70.56 (1190, 48) 715641b-crop-010002.png
   TRU: u'504-508 West 142nd Street, adjoining and west of Hamilton'
   ALN: u'504-5088 West 422nd Street, adjoining and west of Hammilton'
   OUT: u'3od-iS est 4nd Street, doning nd est of Sarilton'
2001 32.38 (341, 48) 726826b-crop-010003.png
   TRU: u'NO REPRODUCTIONS'
   ALN: u'NO REPRODUCTIONS'
   OUT: u'sO EROCoOri'
...

TRU is the truth data. OUT is the output of the model. ALN is a variant of the model output which is aligned to the truth data. It’s used to adjust the model weights more precisely. It typically looks better than the model output, especially in early iterations. It lets you know that you’re making progress.

Here’s a video that Thomas, the Ocropus developer, put together. It shows the network’s output for a single image as it learns (see the YouTube page for explanations of the different charts):

For my first model, I used 400 of the labeled lines as training data and held out the other 400 as test data. Ocropus saves models to disk every 1000 iterations, so it’s simple to evaluate the model’s performance as it learns:

The error rate starts high (over 50%) but quickly comes down to about 2% after 10,000 iterations, eventually hitting a minimum of 0.96% at 16,000 iterations.

The error rate on the test set is consistently about 3% higher than that on the training set. The best error rate on the test set was 4.20%.

There’s a lot of variation in the error rate. You might expect it to slowly decrease over time, but that’s not at all the case. I’m not quite sure how to interpret this. Does the error rate spike at 17,000 iterations because the model tries to jolt itself out of a local minimum? Is it just randomness?

In any case, it’s important to generate a chart like this. Choosing the wrong model could lead to needlessly bad performance.

Training with more data.

You’d expect that training on more data would yield a better model. So for my next model, I trained on all 800 labeled images (rather than just 400). I didn’t have a test set. Here’s what the error rate looked like:

This doesn’t make much sense to me. The lowest error rate on the 800 training images is 3.59%. But the model from the previous section achieved an error rate of 2.58% on the same data set (average of 0.96% and 4.20%). And it only saw half the data! How is that possible? Maybe this model just had bad luck.

There’s the same pattern as before of occasional spikes in error rate. More disturbing, after around 40,000 iterations, I started seeing lots of FloatingPointErrors. It’s unclear to me exactly what this means. Perhaps the model is diverging?

Here’s another model that I trained for even longer:

It achieves an error rate of 0.89% at iteration 33,000, then spikes to over 15% at 37,000. It eventually gets back down to 0.85% after 53,000 iterations, then starts spiking again. By the time I stopped it, I was again seeing lots of FloatingPointErrors.

The point of all this is that the error rates are quite erratic, so you need to look at them before choosing which model you use!

Training with the default model

So far we’ve built our models from scratch. But you can also build on top of an existing model.

Even though it’s never seen typewriter text or ALLCAPS, the default Ocropus model presumably knows a lot about Latin characters and the relationship between them in English words. And I trust the Ocropus developers to build a good Ocropus model far more than I trust myself.

You train on top of an existing model using the --load option:

ocropus-rtrain --load en-default.pyrnn.gz -o my-model *.png

Here’s what the error rate looks like:

Now we’re getting somewhere: the error rate gets all the way down to 0.277%!

Something interesting happens when you get the error rate significantly below 1%. The “mistakes” that the model makes are quite likely to be errors that you made while transcribing truth data! I noticed that I misspelled some words and even hallucinated new words like “the” into some of the lines.

Even crazier, there were typos in the original images that I subconsciously corrected:

(Look at the second to last word.)

A model with a 0.2% error rate is good enough to produce readable text. For example, here’s what it produces for the image from the last post:

→ Clinton Street, south from Livingston Street.

→ P. L. Sperr.

→ NO REPRODUCTIONS.

→ August 5, 1934.

i.e. it’s perfect. Here’s the output of the Neural Net for the last line:

Compare that to what it was before:

There’s still some ambiguity around 5/S, but it makes the right call. The a vs s error is completely gone.

Conclusions

At this point the model is good enough. If I were to improve it further, I’d either improve my image cropper or incorporate some kind of spell checking as a post-processing step.

The behavior of the models as they’re trained is sometimes inscrutable. Finding a good one involves a lot of trial and error. To avoid flailing, measure your performance constantly and keep a list of ideas to explore. “Train a model starting with the pre-built one” was item #6 on my list of ideas and it took me a while to get around to trying it. But it was the solution!

If you’re feeling lost or frustrated, go generate some more training data. At least you’ll be doing something useful.

At the end of the day, I’m very happy with the OCR model I built. Ocropus has some rough edges, but it’s simple enough that you can usually figure out what’s going on and how to fix problems as they come up. And the results speak for themselves!

Extracting text from an image using Ocropus

2015-01-09T22:45:00+00:00

In the last post, I described a way to crop an image down to just the part containing text. The end product was something like this:

In this post, I’ll explain how to extract text from images like these using the Ocropus OCR library. Plain text has a number of advantages over images of text: you can search it, it can be stored more compactly and it can be reformatted to fit seamlessly into web UIs.

I don’t want to get too bogged down in the details of why I went with Ocropus over its more famous cousin, Tesseract, at least not in this post. The gist is that I found it to be:

more transparent about what it was doing.
more hackable
more robust to character segmentation issues

This post is a bit long, but there are lots of pictures to help you get through it. Be strong!

Ocropus

Ocropus (or Ocropy) is a collection of tools for extracting text from scanned images. The basic pipeline looks like this:

I’ll talk about each of these steps in this post. But first, we need to install Ocropus!

Installation

Ocropus uses the Scientific Python stack. To run it, you’ll need scipy, PIL, numpy, OpenCV and matplotlib. Setting this up is a bit of a pain, but you’ll only ever have to do it once (at least until you get a new computer).

On my Mac running Yosemite, I set up brew, then ran:

brew install python
brew install opencv
brew install homebrew/python/scipy

To make this last step work, I had to follow the workaround described in this comment:

cd /usr/local/Cellar/python/2.7.6_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages
rm cv.py cv2.so
ln -s /usr/local/Cellar/opencv/2.4.9/lib/python2.7/site-packages/cv.py cv.py
ln -s /usr/local/Cellar/opencv/2.4.9/lib/python2.7/site-packages/cv2.so cv2.so

Then you can follow the instructions on the ocropy site. You’ll know you have things working when you can run ocropus-nlbin --help.

Binarization

The first step in the Ocropus pipeline is binarization: the conversion of the source image from grayscale to black and white.

There are many ways to do this, some of which you can read about in this presentation. Ocropus uses a form of adaptive thresholding, where the cutoff between light and dark can vary throughout the image. This is important when working with scans from books, where there can be variation in light level over the page.

Also lumped into this step is skew estimation, which tries to rotate the image by small amounts so that the text is truly horizontal. This is done more or less through brute force: Ocropy tries 32 different angles between +/-2° and picks the one which maximizes the variance of the row sums. This works because, when the image is perfectly aligned, there will be huge variance between the rows with text and the blanks in between them. When the image is rotated, these gaps are blended.

ocropus-nlbin -n 703662b.crop.png -o book

The -n tells Ocropus to suppress page size checks. We’re giving it a small, cropped image, rather than an image of a full page, so this is necessary.

This command produces two outputs:

book/0001.bin.png: binarized version of the first page (above)
book/0001.nrm.png: a “flattened” version of the image, before binarization. This isn’t very useful.

(The Ocropus convention is to put all intermediate files in a book working directory.)

Segmentation

The next step is to extract the individual lines of text from the image. Again, there are many ways to do this, some of which you can read about in this presentation on segmentation.

Ocropus first estimates the “scale” of your text. It does this by finding connected components in the binarized image (these should mostly be individual letters) and calculating the median of their dimensions. This corresponds to something like the x-height of your font.

Next it tries to find the individual lines of text. The sequence goes something like this:

It removes components which are too big or too small (according to scale). These are unlikely to be letters.
It applies the y-derivative of a Gaussian kernel (p. 42) to detect top and bottom edges of the remaining features. It then blurs this horizontally to blend the tops of letters on the same line together.
The bits between top and bottom edges are the lines.

A picture helps explain this better. Here’s the result of step 2 (the edge detector + horizontal blur):

The white areas are the tops and the black areas are the bottoms.

Here’s the another view of the same thing:

Here the blue boxes are components in the binarized image (i.e. letters). The wispy green areas are tops and the red areas are bottoms. I’d never seen a Gaussian kernel used this way before: its derivative is an edge detector.

Here are the detected lines, formed by expanding the areas between tops and bottoms:

It’s interesting that the lines needn’t be simple rectangular regions. In fact, the bottom two components have overlapping y-coordinates. Ocropus applies these regions as masks before extracting rectangular lines:

Here’s the command I used (the g in ocropus-gpageseg stands for “gradient”):

ocropus-gpageseg -n --maxcolseps 0 book/0001.bin.png

The --maxcolseps 0 tells Ocropus that there’s only one column in this image. The -n suppresses size checks, as before.

This has five outputs:

book/0001.pseg.png encodes the segmentation. The color at each pixel indicates which column and line that pixel in the original image belongs to.
book/0001/01000{1,2,3,4}.bin.png are the extracted line images (above).

Character Recognition

After all that prep work, we can finally get to the fun part: character recognition using a Neural Net.

The problem is to perform this mapping:

→ August 5, 1934.

This is challenging because each line will have its own quirks. Maybe binarization produced a darker or lighter image for this line. Maybe skew estimation didn’t work perfectly. Maybe the typewriter had a fresh ribbon and produced thicker letters. Maybe the paper got water on it in storage.

Ocropus uses an LSTM Recurrent Neural Net to learn this mapping. The default model has 48 inputs, 200 nodes in a hidden layer and 249 outputs.

The inputs to the network are columns of pixels. The columns in the image are fed into the network, one at a time, from left to right. The outputs are scores for each possible letter. As the columns for the A in the image above are fed into the net, we’d hope to see a spike from the A output.

Here’s what the output looks like:

The image on the bottom is the output of the network. Columns in the text and the output matrix correspond to one another. Each row in the output corresponds to a different letter, reading alphabetically from top to bottom. Red means a strong response, blue a weaker response. The red streak under the A is a strong response in the A row.

The responses start somewhere around the middle to right half of each letter, once the net has seen enough of it to be confident it’s a match. To extract a transcription, you look for maxima going across the image.

In this case, the transcription is Auguat S, 1934.:

It’s interesting to look at the letters that this model gets wrong. For example, the s in August produces the strongest response on the a row. But there’s also a (smaller) response on the correct s row. There’s also considerable ambiguity around the 5, which is transcribed as an S.

My #1 feature request for Ocropus is for it to output more metadata about the character calls. While there might not be enough information in the image to make a clear call between Auguat and August, a post-processing step with a dictionary would clearly prefer the latter.

The transcriptions with the default model are:

→ O1inton Street, aouth from LIYingston Street.
→ P. L. Sperr.
→ NO REPODUCTIONS.
→ Auguat S, 1934.

This is passable, but not great. The Ocropus site explains why:

There are some things the currently trained models for ocropus-rpred will not handle well, largely because they are nearly absent in the current training data. That includes all-caps text, some special symbols (including “?”), typewriter fonts, and subscripts/superscripts. This will be addressed in a future release, and, of course, you are welcome to contribute new, trained models.

We’ll fix this in the next post by training our own model.

The command to make predictions is:

ocropus-rpred -m en-default.pyrnn.gz book/0001/*.png

I believe the r stands for “RNN” as in “Recurrent Neural Net”.

The outputs are book/0001/01000{1,2,3,4}.txt.

If you want to see charts like the one above, pass --show or --save.

Extracting the text

We’re on the home stretch!

One way to get a text file out of Ocropus is to concatenate all the transcribed text files:

cat book/????/??????.txt > ocr.txt

The files are all in alphabetical order, so this should do the right thing.

In practice, I found that I often disagreed with the line order that Ocropus chose. For example, I’d say that August 5, 1934. is the second line of the image we’ve been working with, not the fourth.

Ocropus comes with an ocropus-hocr tool which converts its output to hOCR format, an HTML-based format designed by Thomas Breuel, who also developed Ocropus.

We can use it to get bounding boxes for each text box:

$ ocropus-hocr -o book/book.html book/0001.bin.png
$ cat book/book.html
...
<div class='ocr_page' title='file book/0001.bin.png'>
<span class='ocr_line' title='bbox 3 104 607 133'>O1inton Street, aouth from LIYingston Street.</span><br />
<span class='ocr_line' title='bbox 3 22 160 41'>P. L. Sperr.</span><br />
<span class='ocr_line' title='bbox 1 1 228 19'>NO REPODUCTIONS.</span><br />
<span class='ocr_line' title='bbox 377 67 579 88'>Auguat S, 1934.</span><br />
</div>
...

Ocropus tends to read text more left to right than top to bottom. Since I know my images only have one column of text, I’d prefer to emphasize the top-down order. I wrote a small tool to reorder the text in the way I wanted.

Conclusions

Congrats on making it this far! We’ve walked through the steps of running the Ocropus pipeline.

The overall results aren’t good (~10% of characters are incorrect), at least not yet. In the next post, I’ll show how to train a new LSTM model that completely destroys this problem.

Finding blocks of text in an image using Python, OpenCV and numpy

2015-01-07T00:00:00+00:00

As part of an ongoing project with the New York Public Library, I’ve been attempting to OCR the text on the back of the Milstein Collection images. Here’s what they look like:

A few things to note:

There’s a black border around the whole image, gray backing paper and then white paper with text on it.
Only a small portion of the image contains text.
The text is written with a tyepwriter, so it’s monospace. But the typewriter font isn’t always consistent across the collection. Sometimes a single image has two fonts!
The image is slightly rotated from vertical.
The images are ~4x the resolution shown here (2048px tall)
There are ~34,000 images: too many to affordably turk.

OCR programs typically have to do some sort of page-layout analysis to find out where the text is and carve it up into individual lines and characters. When you hear “OCR”, you might think about fancy Machine Learning techniques like Neural Nets. But it’s a dirty secret of the trade that page layout analysis, a much less glamorous problem, is at least as important in getting good results.

The most famous OCR program is Tesseract, a remarkably long-lived open source project developed over the past 20+ years at HP and Google. I quickly noticed that it performed much better on the Milstein images when I manually cropped them down to just the text regions first:

So I set out to write an image cropper: a program that could automatically find the green rectangle in the image above. This turned out to be surprisingly hard!

Computer Vision problems like this one are difficult because they’re so incredibly easy for humans. When you looked at the image above, you could immediately isolate the text region. This happened instantaneously, and you’ll never be able to break down exactly how you did it.

The best we can do is come up with ways of breaking down the problem in terms of operations that are simple for computers. The rest of this post lays out a way I found to do this.

First off, I applied the canny edge detector to the image. This produces white pixels wherever there’s an edge in the original image. It yields something like this:

This removes most of the background noise from the image and turns the text regions into bright clumps of edges. It turns the borders into long, crisp lines.

The sources of edges in the image are the borders and the text. To zero in on the text, it’s going to be necessary to eliminate the borders.

One really effective way to do this is with a rank filter. This essentially replaces a pixel with something like the median of the pixels to its left and right. The text areas have lots of white pixels, but the borders consist of just a thin, 1 pixel line. The areas around the borders will be mostly black, so the rank filter will eliminate them. Here’s what the image looks like after applying a vertical and horizontal rank filter:

The borders are gone but the text is still there! Success!

While this is effective, it still leaves bits of text outside the borders (look at the top left and bottom right). That may be fine for some applications, but I wanted to eliminate these because they’re typically uninteresting and can confuse later operations. So instead of applying the rank filter, I found the contours in the edge image. These are sets of white pixels which are connected to one another. The border contours are easy to pick out: they’re the ones whose bounding box covers a large fraction of the image:

With polygons for the borders, it’s easy to black out everything outside them.

What we’re left with is an image with the text and possibly some other bits due to smudges or marks on the original page.

At this point, we’re looking for a crop (x1, y1, x2, y2) which:

maximizes the number of white pixels inside it and
is as small as possible.

These two goals are in opposition to one another. If we took the entire image, we’d cover all the white pixels. But we’d completely fail on goal #2: the crop would be unnecessarily large. This should sound familiar: it’s a classic precision/recall tradeoff:

The recall is the fraction of white pixels inside the cropping rectangle.
The precision is the fraction of the image outside the cropping rectangle.

A fairly standard way to solve precision/recall problems is to optimize the F1 score, the harmonic mean of precision and recall. This is what we’ll try to do.

The set of all possible crops is quite large: W²H², where W and H are the width and height of the image. For a 1300x2000 image, that’s about 7 trillion possibilities!

The saving grace is that most crops don’t make much sense. We can simplify the problem by finding individual chunks of text. To do this, we apply binary dilation to the de-bordered edge image. This “bleeds” the white pixels into one another. We do this repeatedly until there are only a few connected components. Here’s what it looks like:

As we hoped, the text areas have all bled into just a few components. There are five connected components in this image. The white blip in the top right corresponds to the “Q” in the original image.

By including some of these components and rejecting others, we can form good candidate crops. Now we’ve got a subset sum problem: which subset of components produces a crop which maximizes the F1 score?

There are 2^N possible combinations of subsets to examine. In practice, though, I found that a greedy approach worked well: order the components by the number of white pixels they contain (in the original image). Keep adding components while it increases the F1 score. When nothing improves the score, you’re done!

Here’s what that procedure produces for this image:

The components are ordered as described above. Component #1 contains the most white pixels in the original image. The first four components are accepted and the fifth is rejected because it hurts the F1 score:

Accept #1, F1 Score → 0.886
Accept #2, F1 Score → 0.931
Accept #3, F1 Score → 0.949
Accept #4, F1 Score → 0.959
Reject #5 (F1 Score → 0.888)

Applying this crop to the original image, you get this:

That’s 875x233, whereas the original was 1328x2048. That’s a 92.5% decrease in the number of pixels, with no loss of text! This will help any OCR tool focus on what’s important, rather than the noise. It will also make OCR run faster, since it can work with smaller images.

This procedure worked well for my particular application. Depending on how you count, I’d estimate that it gets a perfect crop on about 98% of the images, and its errors are all relatively minor.

If you want to try using this procedure to crop your own images, you can find the source code here. You’ll need to install OpenCV, numpy and PIL to make it work.

I tried several other approaches which didn’t work as well. Here are some highlights:

I ran the image through Tesseract to find areas which contained letters. These should be the areas that we crop to! But this is a bit of a chicken and the egg problem. For some images, Tesseract misses the text completely. Cropping fixes the problem. But we were trying to find a crop in the first place!
I tried running the images through unpaper first, to remove noise and borders. But this only worked some of the time and I found unpaper’s interface to be quite opaque and hard to tweak.
I ran canny, then calculated row and column sums to optimize the x- & y-coordinates of the crop independently. The text regions did show up clearly in charts of the row sums:
The four spikes are the tops and bottoms of the two borders. The broad elevated region in the middle is the text. Making this more precise turned out to be hard. You lose a lot of structure when you collapse a dimension—this problem turned out to be easier to solve as a single 2D problem than as two 1D problems.

In conclusion, I found this to be a surprisingly tricky problem, but I’m happy with the solution I worked out.

In the next post, I’ll talk about my experience running OCR tools over these cropped images.

Choosing an iOS Podcasting App

2014-12-29T00:00:00+00:00

I recently switched back to iOS after a few years using Android. One aspect of Android that always bothered me was that I had trouble finding a great Podcasting app. I’d happily used Instacast on iOS, but I couldn’t find anything quite like it for Android.

I eventually settled on Podcast Addict. It epitomizes a stereotype of Android apps: tons of features, gajillions of options and a UI that was clearly designed by an engineer.

After switching back to iOS, I had a surprisingly hard time finding a Podcasting app which could do everything I wanted. For me, the hard requirements were:

A view of podcasts ordered by most-recently updated
Stream by default—don’t download episodes unless I ask!
An option to disable play-through.

In the last few years, Instacast received a major update. This introduced a new player interface which doesn’t work on the iOS lock screen. As a result, you can’t use the play/pause/volume features on your earbuds!

It also lacks the ability to sort podcasts by most-recently-updated. I subscribe to a mix of podcasts that update frequently and infrequently, so I find this to be a far more useful view than a complete list of all episodes. In particular, I don’t want an hourly podcast to crowd out less frequently updated shows.

So I set out to find a new Podcasting app. I really wish the iOS App Store had a try-before-you-buy option. Order-by-date is a fairly obscure feature, and it was impossible to determine if an app supported it without buying it.

Since I spent about $10 trying out apps and deleting them, here’s my guide to spare you that chore!

Instacast

As I mentioned, it’s missing the order by most recently updated feature and the ability to control playback via the lock screen.

Overcast

Overcast is the best looking of the bunch. I appreciate its freemium model and functional shoutouts to alternative apps—they made moving my list of feeds between apps painless.

That being said, I had a few problems with Overcast:

It doesn’t support streaming
Disabling play-through is a paid feature (which is fine, but it’s worth noting!)
It makes a distinction between all episodes and unplayed episodes for each Podcast which I don’t find to be very helpful. I’m not a completist—I tend to pick and choose episodes.
It’s very aggressive about downloading new episodes!

Podcasts (built-in app)

Podcasts is a barebones app for listening to Podcasts. I couldn’t tell if it supported streaming (I don’t think it does) and it definitely doesn’t support showing your podcasts ordered by when they were last updated. Next!

PodCruncher

This app came up when I searched for Downcast on the App store and looked promising. Unfortunately, it hasn’t been updated for the new “flat” look in iOS 7 and it doesn’t support the larger screen on the iPhone 6. It also lacks the most-recently-updated view.

Downcast

Finally I found Downcast, which meets all my Podcasting needs. It lets you order your podcasts by most-recently updated, though the option is tricky to find (Edit→Sort→Publication Date):

And there’s an option to disable play-through on the player screen, if you can recognize it:

Next on my list to try were Pocket Casts and Castro, but I stopped when I found that I was happy with Downcast.

After all this, I’m coming to appreciate that everyone has different ways of using their podcasting app, which makes designing them hard. I don’t care at all about Smart Playlists (which many apps emphasize) and I care about slightly obscure features like ordering my podcasts by when they were updated. This is a case of dangling by a trivial feature.

JavaScript String slice, substr, substring: which to use?

2014-11-17T00:00:00+00:00

I recently read Douglas Crockford’s JavaScript: The Good Parts. It’s a classic (published in 2008) which is credited with reviving respect for JavaScript as a programming language. Given its title, it’s also famously short.

One very specific thing it cleared up for me was what to do with all of JavaScript’s various substring methods:

String.prototype.substr(start[, length])
String.prototype.substring(start[, stop])

One method takes an offset and a length. The other takes two offsets. The names don’t reflect this distinction in any way, so they’re impossible to remember. I resorted to writing an Alfred Snippet with the syntaxes, to quickly look it up (since I always had to).

They also have some other differences in behavior: substring doesn’t allow its offset to be negative, but substr does. This is arbitrary. It’s impossible to remember because there’s no rhyme or reason to it.

Crockford cleared this all up. The solution is to never use either method!

Instead, you should use the slice method:

String.prototype.slice(start[, stop])

This takes two offsets. Either can be negative. And it’s the exact same as the corresponding Array slice method.

So: stop using substr and substring. Use slice!

GitHub integration and Image Diff improvements headline webdiff 0.8

2014-11-07T00:00:00+00:00

I’ve released webdiff 0.8.0, which you can install via:

pip install --upgrade webdiff

The most interesting new features are GitHub pull request integration and expanded image diffing modes.

You can view a GitHub Pull Request in webdiff by running something like:

webdiff https://github.com/hammerlab/cycledash/pull/175

Any github Pull Request URL will do. This will pull down the files from GitHub to local disk and then diff them in the standard webdiff UI. My main use case for this is looking at screenshot diffs and thinking “I want to see bigger images in this PR diff”. Speaking of which…

This version includes a few improvements to the image diff mode:

A “shrink to fit” option, which is enabled by default. This shrinks large images to fit in your browser window.
Consistent use of red/green borders for before/after images.
“Onion Skin” diff mode, which fades one image into the other.
“Swipe” diff mode, which lets you lets you drag a dividing line between the images.

These are based on GitHub’s image view modes. I still find “blink” to be the most helpful for spotting small changes, but now you’ve got choices!

There were a few smaller changes as well. Full release notes are on PyPI.

Life after Google, Six Months In

2014-10-31T00:00:00+00:00

It’s been almost exactly six months since I ended an eight year run at Google. One of the biggest reasons to do this was to come back up to speed with the open source ecosystem and to experience a different working environment (sample size 1→2!).

When I joined Google, it seemed like our tech stack was years ahead of anything else. This included tools like BUILD files for managing dependencies and compilation, borg for running jobs in a data center, and closure compiler for dealing with JavaScript’s idiosyncracies.

The open source world has come a long way since 2006. It’s solved many of the same problems that Google did, but in different ways. There’s the hadoop stack for distributed work. And there are polyfills and CommonJS for JavaScript. These aren’t necessarily better or worse than Google’s solutions, but they are different. And they’re the tools that new developers are learning.

Long term, this is a big problem for Google. “Ahead of the field” can rapidly turn into “Not Invented Here”. “Better” can become merely “different”.

A simple example of this is goog.bind. It works around some confusing behavior involving this in JavaScript. It was a great tool in 2004. But modern JavaScript has its own solution (Function.prototype.bind). It’s been around for years and is supported in 90+% of browsers. Google will still be writing goog.bind long after IE8 ceases to be relevant.

Facebook has a better solution to this: they always use the latest version of the JavaScript standard, and then transpile to something that older browsers will understand. This way they stay on the mainstream of technological development.

Here are a few other things I’ve found notable:

Almost any Google technology has an open-source equivalent.
Examples include Travis-CI (similar to TAP), the Hadoop stack (MapReduce, CNS, Dremel, …), CommonJS (Closure modules).
Package managers
For most Google engineers, something like 95+% of the code you work with is first-party (i.e. written at Google). For a smaller group, this ratio is going to be dramatically different. As a result, third party package managers are much more important. And the good news is that they’ve gotten much better over the last eight years! Leaving Google has finally forced me to learn how to use tools like NPM and pip. I’ve found this incredibly empowering. I used to avoid external dependencies for personal projects. I suspect many Google engineers do the same.
Markdown is more pervasive than I’d realized
I was vaguely aware of Markdown while I was at Google, but didn’t really see the point. Now that I’m out, I’ve been surprised to see how pervasive it is. I suspect that much of this comes from GitHub, where you use Markdown for READMEs, issues and code review comments. You also use it with GitHub pages, so I’m typing in it right now! It’s worthwhile to learn Markdown a bit better. It’s like HTML with infinitely less boilerplate.
Knowledge of git
It’s completely reasonable for someone with ten years of experience at Google to never have run git. I’m happy I was involved enough with open source projects that I have years of basic experience with it. But others have clearly gone much deeper. My git skills have gotten much better since leaving Google. I know how to use git rebase now!
sysadmin knowledge
borg meant that I never had to learn any systems administration. It’s a particular blind spot for me. For example, upstart was released over eight years ago, just after I joined Google. It’s incredibly widely used. I’d never heard of it.
Google uses a lot of email
We use almost no email in my new group. HipChat takes the place of a group mailing list and your coworkers @mention you when they want you to chime in. The chat format keeps things short. I still get emails for code reviews, but I could imagine this going through GitHub notifications instead.
Buying your own lunch isn’t that big a deal.
As it turns out, there are people in NYC who will make you food in exchange for money, or even deliver it! The variety is much greater than what you get at a Google Cafe, and it’s nice to have a good reason to go outside during the day. I also like the more flexible eating schedule. Almost everyone in my office eats lunch very late, perhaps at 3 or 4.

Fully Migrated to GitHub Pages

2014-10-23T00:00:00+00:00

My danvk.org site is now fully hosted on GitHub pages. I changed the DNS entry last night.

My hope was to do this without breaking anything. That didn’t prove to be possible, but I came close. And overall, the process wasn’t too bad! It was helpful to make a census of material that was on my old site using access logs. This turned up a few redirects I wouldn’t have thought of, and also reminded me of the many features my old site accumulated over the years. Some of these are now accessible under the “Features” menu on the new site.

There were a few pain points:

Redirects

It would have been enormously helpful if GitHub pages supported something like mod_rewrite. As it was I had to kill a few old links because I was completely unable to generate 301/302 redirects. I wound up hard-coding JavaScript redirects instead. It’s not ideal, and I’ll probably lose some pagerank, but it’s the best I could do.
Migrating the domain

I really hate DNS. It’s impossible to know whether your site isn’t working because you’ve misconfigured your DNS, or because the new records haven’t propagated out yet. I’m surprised more sites don’t go down because of DNS problems. The Global DNS Propagation Tracker was indispensible as a sanity check.

One thing that worked really well was migrating my old WordPress blog. I’d expected this to be a complete pain, but it was nothing of the sort. I used httrack to mirror the rendered version of my blog site to a folder on local disk. Then I checked that folder into GitHub. Done.

Finally, I learned a lot about Jekyll from this process. It really is a static content generator. It’s not a serving system. This is why it can’t do things like 301/302 redirects. GitHub pages would be a much more powerful serving system if it included support for things like mod_rewrite. But then it would be less Jekyll-y. The beauty of the system is that it’s pure static content, and hence insanely fast and simple.

Filtering JSON with pyjsonselect and jss

2014-10-13T00:00:00+00:00

The data store for Comparea is a giant 23MB GeoJSON file. Most of the space in that file is taken up by the giant lists of coordinates which define the boundaries of each shape. But there’s also some interesting metadata hidden amongst all those latitudes and longitudes:

{
  "features": [
    {
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              -69.89912109375001,
              12.452001953124963
            ],
            [
              -69.89570312500004,
              12.422998046875009
            ],
            ...
          ]
        ]
      },
      "type": "Feature",
      "properties": {
        "description": "Aruba is an island.",
        "wikipedia_url": "http://en.wikipedia.org/wiki/Aruba",
        "area_km2": 154.67007756254557,
        "population": 103065,
        "population_year": "???",
        "name": "Aruba"
      },
      "id": "ABW"
    }
  ]
}

I’d hoped that I could use jq to filter out all the coordinates and just look at the metadata. But I got bogged down reading through its extensive manual. At the end of the day, I didn’t want to learn an ad-hoc language just for filtering JSON files.

The maddening thing was that there’s already a great language for selecting elements in trees: CSS Selectors! I did some searching and learned that there’s already a standard for applying CSS-like selectors to JSON called JSONSelect. It dates from 2011. It has a spec and conformance tests, and it’s been implemented in a number of languages.

So I picked my language of choice (Python) and began implementing a new command line tool for filtering JSON files.

The first issue I ran into: the standard Python implementation didn’t conform to the standard! It only implemented 2/3 levels of CSS selectors from the spec, and many of the interesting selectors are in level 3.

The reference JavaScript implementation was only 572 lines of code and, with all those tests, I figured it wouldn’t be too hard to port it directly to Python. This was a fun project—there’s something very zen about coding against a spec, getting test after test to pass. I learned about a few nuances of JavaScript and Python by doing this:

Their regular expressions differ in how they specify unicode ranges
the reference implementation made use of the null vs. undefined distinction
JavaScript’s typeof function is quite odd
JavaScript’s Array.prototype.concat method is quite subtle in its behavior

I wound up re-implementing all of these quirks these in Python.

At the end of the day, I published pyjsonselect, the first fully-conformant JSONSelect implementation in Python. A small win for the open source world!

jss

So, how does the tool work? You can read about installation and basic usage on github, but here are a few motivating examples.

jss is a JSON→JSON converter. It supports three modes:

select: find all the values that match a selector (1→N)
filter out (-v): remove all values which match a selector (1→1)
filter in (-k): keep only values which match a selector (1→1)

Here’s how the filter out mode works:

$ jss -v '.coordinates' comparea.geo.json

{
  "features": [
    {
      "geometry": {
        "type": "Polygon"
      },
      "type": "Feature",
      "properties": {
        "description": "Aruba is an island.",
        "wikipedia_url": "http://en.wikipedia.org/wiki/Aruba",
        "area_km2": 154.67007756254557,
        "population": 103065,
        "population_year": "???",
        "name": "Aruba"
      },
      "id": "ABW"
    },
    ...
  ]
}

That knocked out all the coordinates keys from the GeoJSON file!

I eventually did figure out how to do this in jq. Here’s what it looks like:

$ jq 'del(..|.coordinates?| select(. != []))' comparea.geo.json
(same output)

To come up with that incantation, I had to dig through jq’s github issues. It’s certainly not something I could re-type from memory! The jss version is clear as could be.

It’s also significantly faster. For the 23MB comparea.geo.json file, the jss command runs in 1.7s on my laptop vs. 12.9s for jq. The trick to this speed is appropriate pruning of the selector search.

Here’s how the “select” mode works:

$ jss '.name' comparea.geo.json

"Aruba"
"Afghanistan"
"Angola"
"Anguilla"
"Albania"
"Andorra"
"United Arab Emirates"
"Argentina"
"Armenia"
"American Samoa"
...

Unlike “filter out”, which maps one JSON object to another JSON object, “select” extracts multiple values from a single object. Each line of output is its own JSON object. This is why it’s 1→N, vs 1→1 for the other modes. It’s useful if you want to do more processing using grep, sed and other familiar line-oriented tools.

Fancy selectors

You can specify as operations as you like. Here’s a more complex invocation:

$ jss -v .coordinates -k '.features>*:has(:contains("ZAF"))' comparea.geo.json

{
  "features": [
    {
      "geometry": {
        "type": "Polygon"
      },
      "type": "Feature",
      "id": "ZAX",
      "properties": {
        "description": "South Africa, officially the Republic of South Africa, is a country located at the southern tip of Africa. It has 2,798 kilometres of coastline that stretches along the South Atlantic and Indian oceans.",
        "population_source": "World Factbook",
        "sov_a3": "ZAF",
        "freebase_mid": "/m/0hzlz",
        "name": "South Africa",
        "population_source_url": "https://www.cia.gov/library/publications/the-world-factbook/fields/2119.html",
        "area_km2_source_url": "https://www.cia.gov/library/publications/the-world-factbook/fields/2147.html",
        "population_date": "July 2014",
        "wikipedia_url": "http://en.wikipedia.org/wiki/South_Africa",
        "area_km2": 1214470,
        "area_km2_source": "World Factbook",
        "population": 48375645
      }
    }
  ]
}

After filtering out the coordinates fields, it keeps only elements directly under the features key (i.e. a top-level feature) which contains “ZAF” somewhere (the “sov_a3” field, in this case).

Isn’t this just as complicated as the jq syntax? Sure! But at least you learned something useful. If you get better at writing CSS selectors as a result of filtering JSON files, then that’s great! You’ve become a better web developer in the process.

You can install jss with pip. Read more on github!

Facebook (non-)Insights

2014-10-13T00:00:00+00:00

There was a spike of traffic to Comparea over the weekend:

Awesome! All of it came from Facebook and went to a comparison of France vs Australia:

I can easily get insight into who tweeted it. But Facebook is a big black box. In theory, I can use Facebook Insights to track this. It claims that 50 actions have led to ~2400 visits to my site, but declines to say anything more:

I understand that Facebook is more private than Twitter, but this is frustrating. I’d hope to at least see which country the shares are happening in. Are these French people coming to Comparea? Australians? Facebook won’t even reveal that all the recent visits are to a single URL! I’d think that thousands of clicks would be enough to provide anonymity there.

I’m not sure exactly what sorts of insights Facebook can share without revealing identities, but I’d very much hoped for more than this.

Has anyone had positive experiences with Facebook Insights?

Trying out GitHub Pages

2014-10-01T00:00:00+00:00

I’m going to try hosting my site and blog on GitHub pages. My hope is that blogging using GitHub and Markdown will lower the barrier to writing, and that GitHub pages will eliminate any worries about performance and security while hosting my own site.

This is all very much a work in progress, so feedback is welcome!