Here they are for easy clicking:
Stepping back a little bit, what’s the status of my Boggle efforts?
The overall goal is to find the highestscoring Boggle board and prove that there are no better boards out there.
I’ve solved this problem for 3×3 Boggle boards. Here’s the best board:
P E R L A T D E S
Its longest words include REPLATED and REPASTED. You can see all the words here. This board has more points on it than any other board (using the Enable2K word list) and I’ve proved it.
The 4×4 problem is much harder, though, because there are eight billion times more boards to consider! I believe that this board is the best one:
P E R S L A T G S I N E T E R S
You can see a list of all the words on it here. The longest word is REPLASTERING. I haven’t been able to prove it, though, and I haven’t put any serious time into the problem since 2009.
You can find all the Bogglerelated code I’ve written on github and learn all about the algorithms and techniques by browsing posts in the Boggle Category on this blog.
]]>I built the online solver back in 2006, mostly just for fun. In the six years since, I’ve wound up using it almost exclusively on my phone during reallife Boggle games. The iPhone didn’t exist when I built that page, and it is a device for which its fancy design (with hover effects and fixed position elements) is very illsuited. Minimal designs tend to work better on Mobile, and I’ve just pushed out a new version for phones which is very minimal indeed.
The code behind this boggle solver differs from my other boggle code in that it’s designed for low latency rather than high throughput. This is precisely what you want from a web solver. You’re only finding the words on a single board, so time spent loading the word list dominates time spent finding words on the board. The current version uses a custom dictionary which can be MMAPped. This winds up being about 3x faster than reading a dictionary text file and building a Trie on every invocation.
I haven’t worked on Boggle in the past three years, so the state of the art in finding the highestscoring boards is still where it was when I made my last blog post about breaking 3×3 Boggle. The interesting development then was that you could get something like a 1000x speedup by upperbounding whole classes of boards rather than scoring each board individually. This is a big enough speed boost to make 3×3 Boggle “solvable” in a day on a desktop computer, but not enough to make 3×4 or 4×4 solvable, even on a big cluster of computers. I was optimistic that memoization could give some more big wins, but when those failed to materialize I got frustrated and stopped working on the project.
]]>There are two main shortcomings to the max/no mark upper bound:
These can both be fixed, but only by incurring great computational complexity. Remember the tradeoff between the “tightness” of a bound and the difficulty of computing it?
Our first example:
t  i  . 
{a,e}  .  . 
r  .  . 
Let’s look at the search tree starting with the “t” cell. There are three possible words: “tar”, “tie” and “tier”. Here’s the search tree that max/no mark generates:
(I’m drawing the trees slightly differently than I did in the last post, to make the branching more explicit. Branches inside dotted lines come from multiple possibilities on a cell. Other branches result from the ordinary process of looking in different directions from a cell.)
In this example, max/no mark chooses the “a” when it gets to the “{a,e}” cell directly from the “t”. This makes sense, since it results in one point to the e’s zero. However, when it gets to the “{a,e}” cell through the “ti”, it chooses the “e”. This also makes sense, since it gets two points from the “e” (“tie” and “tier”) vs. zero from the “tia” prefix. By the time it makes this choice, it has no memory of choosing the “a” the last time it encountered this cell. If it did, since the “tie” prefix results in two points to the “ta” prefix’s one, it would make sense to go back and change that choice. But remembering this would require lots of expensive bookkeeping. It’s faster (but yields a looser bound) if we just accept the possibility that we’ll make different choices.
Example number 2 of max/no mark’s failings:
t  h  . 
h  e  . 
.  .  . 
The problem in this case is that max/no mark doesn’t remember which words it’s already found. At first glance, it seems like this problem would be easy to remedy. And it certainly would on this board, in which each cell only has a single possible letter. But if you think about a board with lots of choices, you’ll start to see why it’s best to just give up and accept a looser bound.
]]>Here’s a class of boards:
f  {a,u}  r 
.  .  e 
.  .  . 
The dots mean “no letters here”. The class contains two different boards:


It contains two boards. The one with an “a” has two points worth of words on it, while the one with a “u” only has one. (We’re only looking at words starting with ‘f’ here.)
The diagrams show that the solver starts with the ‘f’ on each board and explores adjacent cells. When it finds a word, it scores it and passes the total score back up the depthfirst search tree.
Here’s how the max/no mark bound sees that board class:
When it gets to the “{a,u}” cell, it tries both possible letters. The “a” tree brings back 2 points, whereas the “u” tree brings back 1 point. So it chooses “a” and counts two points. As it so happens, this is the score of the highestscoring board. The sum/union bound would have added the 1 and the 2, resulting in an upper bound of 3. The max/no mark bound takes advantage of the fact that this cell can only be one of two possibilities, not both.
Now what if we throw a few more letters on:
f  {a,u}  r 
z  .  e 
z  y  . 
With the new letters, there are more points coming from the u:
The two points going through the ‘a’ are dropped on the floor. sum/union would have resulting in a bound of 4+2. When there are lots of letter choices on every cell, you can see why max/no mark is a much tighter bound.
It’s important to note that there are two sources of branching in these search trees: (1) being able to go in multiple directions from a cell (i.e. f>u>r or z) and (2) having multiple choices on a cell (i.e. f>a or u). The max/no mark bound sums the scores resulting from choices in case (1) and takes the max of choices in case (2). The sum/union bound takes the sum in both cases.
]]>So we have to come up with a technique that doesn’t involve looking at every single board. And I’ve come up with just such a method! This is the “exciting news” I alluded to in the last post.
Here’s the general technique:
Classes of Boards
By “class of boards”, I mean something like this:
{a,e,i,o,u}  {a,e,i,o,u}  r 
{b,c,d,f,g,h}  a  t 
d  e  {r,s,t,v} 
The squares that contain a set of letters can take on any of those letters. So this board is part of that class:
a  i  r 
d  a  t 
d  e  s 
189 points 
and so is this:
o  u  r 
f  a  t 
d  e  t 
114 points 
All told, there are 5 * 5 * 6 * 4 = 600 boards that are part of this class, each with its own score. Other fun classes of boards include “boards with only vowels” (1,953,125 members) and “boards with only consonants” (794,280,046,581 members).
Follow me past the fold for more…
Upper Bounds
Now on to step #3 of the general technique: calculating an upper bound. This is going to be easier if we introduce some mathematical notation:
b  =  A boggle board 
Score(b)  =  sum of the scores of all the words contained on b 
B  =  a class of boards, i.e. 
Score(B)  = 
An upper bound is a function such that , i.e. .
There’s one really easy upper bound: ! This just enumerates all the boards in the class B, scores each and takes the maximum score. It’s very expensive to compute for a large class of boards and hence not very practical. You and I both know that no board in containing only consonants has any points on it. We don’t need to enumerate through all 794 billion such boards to determine this.
With upper bounds, there’s a tradeoff between how hard they are to compute and how “tight” they are, i.e. how closely they approximate . is very tight but is hard to compute. At the other end of the spectrum, we know that all the words on a board are in the dictionary. So we could just sum up the scores of all the words in the dictionary and get a number, say 1,000,000. Then is an upper bound. It is very easy to compute, but is not very tight.
The trick is to hit some sort of sweet spot that strikes a good balance between “tightness” and ease of computation. Over the rest of this blog post, I’ll present two upper bounds that do this. Upper bounds have the nice property that if and are two upper bounds, then is also an upper bound. So by finding two bounds, we’ll get a third that’s better than either one alone.
sum/union
The idea of this bound is to find all the words that can possibly occur in a class of boards. Since each word can only be found once, we can add the scores of all these words to get an upper bound.
To get the list of words, we use the same depthfirst search strategy as we did to find words on a single board. The wrinkle is that, when we encounter a cell with multiple possible letters, we have to do a separate depthfirst search for each.
At first glance, it doesn’t seem like this would be tractable for a board class like this one (alternating vowels and consonants):
{a,e,i,o,u}  {bd,fh,jn,pt,vz}  {a,e,i,o,u} 
{bd,fh,jn,pt,vz}  {a,e,i,o,u}  {bd,fh,jn,pt,vz} 
{a,e,i,o,u}  {bd,fh,jn,pt,vz}  {a,e,i,o,u} 
In addition to the branching from going different directions on each square, there’s also a huge amount of branching from trying each letter on each square. But we’re saved by the same lesson we learned in boggle post #3: the dictionary is exceptionally effective at pruning thorny search trees. If we prune search trees like ‘bqu’ that don’t begin words, then there doesn’t wind up being that much work to do.
We can find all possible words on the above board in just under 1 second. This is about 10,000 times slower than it takes to score a conventional board, but it’s certainly tractable. The resulting score is 195,944. Given that no board scores higher than 545 points, this is a wild overestimate. But at least it’s a better bound than a million!
This technique does especially well on boards like this one, which contains all consonants:
{bd,fh,jn,pt,vz}  {bd,fh,jn,pt,vz}  {bd,fh,jn,pt,vz} 
{bd,fh,jn,pt,vz}  {bd,fh,jn,pt,vz}  {bd,fh,jn,pt,vz} 
{bd,fh,jn,pt,vz}  {bd,fh,jn,pt,vz}  {bd,fh,jn,pt,vz} 
This takes 0.23 seconds to score and results in a bound of 208 points (it contains words like ‘crypt’ and ‘gypsy’). We’ve already found a single board that has 545 points on it. So we can eliminate this entire class of 794 billion boards. That’s a speed of over 3 trillion boards/second! Of course, this board class is not typical.
It’s also worth pointing out why this upper bound isn’t tight. Consider this class of boards:
{a,i}  r  z 
f  z  z 
z  z  z 
You can find both “fir” and “far” on boards in this class, but there aren’t any boards that contain both. So while each “fir and “far” contribute a point to the upper bound, they should only really contribute a single point. The sum/union bound doesn’t take into account the relationships between various letter choices. It’s the best tradeoff between computability and “tightness” we’ve seen so far, but it’s not good enough to make the problem tractable.
max/no mark
In the sum/union upper bound, we dealt with multiple possible letters on the same square by trying each and adding the resulting scores (taking care not to count any word twice). But why take the sum of all choices when we know that any given board can only take on one of the possibilities? It would result in a much better bound if we took the max of the scores resulting from each possible choice, rather than the sum. This is the idea behind the “max/no mark” bound.
This is a huge win over sum/union, especially when there are many squares containing many possible letters. It does have one major drawback, though. The sum/union bound took advantage of the fact that each word could only be found once. With the max/no mark bound, the bookkeeping for this becomes completely intractable. The words we find by making a choice on one square may affect the results of a choice somewhere else. We can’t make the choices independently. The optimal set of choices becomes an optimization problem in its own right.
Rather than deal with this, max/no mark just throws up its hands. This is what the “no mark” refers to. In the past, we’ve recorded the words we find by marking the Trie. By not marking the Trie with found words, we accept that we’ll doublecount words sometimes. But it still winds up being an excellent upper bound.
Lets try some of our previous examples:
{a,e,i,o,u}  {a,e,i,o,u}  r 
{b,c,d,f,g,h}  a  t 
d  e  {r,s,t,v} 
sum/union: 2880  
max/no mark: 1307 
Alternating vowels and consonants:
{a,e,i,o,u}  {bd,fh,jn,pt,vz}  {a,e,i,o,u} 
{bd,fh,jn,pt,vz}  {a,e,i,o,u}  {bd,fh,jn,pt,vz} 
{a,e,i,o,u}  {bd,fh,jn,pt,vz}  {a,e,i,o,u} 
sum/union: 195944  
max/no mark: 15692 
A class that can be entirely eliminated:
{b,d,f,g,j,k,m,p,v,w,x,z}  a  {s,y} 
{i,o,u}  y  a 
{s,y}  {c,h,l,n,r,t}  {c,h,l,n,r,t} 
sum/union: 2497  
max/no mark: 447 
max/no mark isn’t always better than sum/union:
{b,d}  a  {b,d} 
a  {b,d}  a 
{b,d}  a  {b,d} 
sum/union: 9  
max/no mark: 132 
This is something of a worstcase because, while there are relatively few distinct words, there are many different ways to find them.
Putting it all together
Our two bounds do well in different situations. max/no mark works best when there are lots of choices to be made on particular cells and there are relatively few ways to make any particular word. sum/union works best when there are lots of possibilities but relatively few distinct words. Putting them together results in a bound that’s good enough to find the best 3×3 boggle board using the technique described at the beginning of this post.
Given an initial class of boards, we wind up with what I call a “breaking tree”. If the initial class has an upper bound less than 545 points, then we’re done. Otherwise, we pick a cell to split and try each possibility.
Here’s a relatively small breaking tree that results from running this program:
$ ./3x3/ibucket_breaker best_score 520 break_class "bdfgjkmpvwxz a sy iou xyz aeiou sy chlnrt chlnrt" ( 0%) (0;1/1) bdfgjkmpvwxz a sy iou xyz aeiou sy chlnrt chlnrt (820, 77760 reps) split cell 4 (xyz) Will evaluate 3 more boards... ( 0%) (1;1/3) bdfgjkmpvwxz a sy iou x aeiou sy chlnrt chlnrt (475, 25920 reps) (33.333%) (1;2/3) bdfgjkmpvwxz a sy iou y aeiou sy chlnrt chlnrt (703, 25920 reps) split cell 5 (aeiou) Will evaluate 5 more boards... (33.333%) (2;1/5) bdfgjkmpvwxz a sy iou y a sy chlnrt chlnrt (447, 5184 reps) ( 40%) (2;2/5) bdfgjkmpvwxz a sy iou y e sy chlnrt chlnrt (524, 5184 reps) split cell (iou) 3 Will evaluate 3 more boards... ( 40%) (3;1/3) bdfgjkmpvwxz a sy i y e sy chlnrt chlnrt (346, 1728 reps) (42.222%) (3;2/3) bdfgjkmpvwxz a sy o y e sy chlnrt chlnrt (431, 1728 reps) (44.444%) (3;3/3) bdfgjkmpvwxz a sy u y e sy chlnrt chlnrt (339, 1728 reps) (46.667%) (2;3/5) bdfgjkmpvwxz a sy iou y i sy chlnrt chlnrt (378, 5184 reps) (53.333%) (2;4/5) bdfgjkmpvwxz a sy iou y o sy chlnrt chlnrt (423, 5184 reps) ( 60%) (2;5/5) bdfgjkmpvwxz a sy iou y u sy chlnrt chlnrt (318, 5184 reps) (66.667%) (1;3/3) bdfgjkmpvwxz a sy iou z aeiou sy chlnrt chlnrt (509, 25920 reps)
The numbers in parentheses are the upper bounds. When they get below 520 (the parameter I set on the command line), a subclass is fully broken.
Using this technique and the following partition of the 26 letters:
I was able to go through all 262,144 (=4^9) board classes in about six hours on a single machine. This resulted in the boards I listed in the last post. Six hours is a big improvement over two years!
If that same factor (two years to six hours) held for the 4×4 case, then we’d be down to 380 years of compute time to find the best 4×4 boggle board. Or, equivalently, 138 days on 1000 machines. That’s still a lot. We’re not quite there yet, but we’re getting closer!
Code for the program that went through all possible board classes can be found here. While many people have found highscoring boards, I haven’t found any previous work on this upper bounding approach. So if you have any ideas/suggestions on how to improve the bound, they’re probably novel and useful!
]]>Last we spoke about Boggle, we used Simulated Annealing to get at the question “what is the highestscoring Boggle board?”
We found a board with 3625 points on it this way. It would have been nice to say that it was the best of all possible boards, but that would have been too rash. While it is a very good board and I have never seen a higherscoring one, that doesn’t mean there isn’t one out there. Maybe I’ve just been looking in the wrong places.
To prove that the 3625pointer is the best board, we’d need to show that every other board scores fewer points. In a previous post, I estimated that there were 2^69 possible boggle boards. At 10,000 boards/second, this would take 1.9 billion years of compute time!
When you run up against a computationally intractable problem, there are a few standard ways to deal with it:
Option 1 is the easiest. Option 2 is the hardest. And option 3 is the compromise we’ll be taking for the next few danvk.org boggle posts. Our simpler problem today: 3×3 boggle.
If we drop seven letters (4×4 – 3×3 = 7), we’re left with only 26^9 / 8 ≅ 2^39 boards to consider. If we can achieve 10,000 boards/sec, this would be just over two years’ worth of computation. That’s still a lot, but it’s much more reasonable than 1.9 billion years!
I believe I have solved the 3×3 boggle problem using an approach that is only slightly more clever than this. I used simulated annealing to find very highscoring 3×3 boards. This was the best one I found, along with its 4×4 best board buddy:


It’s worth noting that the optimal 3×3 board has a 2×3 region in common with the 3625 point 4×4 board.
In the next post we’ll talk about how I showed that this board was higherscoring than all 2^39 others in significantly less than two years (it took one day). As a teaser, I’ve included all the 3×3 boards with more than 500 points worth of words below the fold.
I’ve made the boards fit on a single line by reading them like a book: left to right, then top to bottom.
Board  Score 

perlatdes  545 
pelsartes  542 
delratpes  540 
lepsartes  536 
tepsarles  528 
selratpes  528 
legsartes  527 
berlatdes  526 
relsatpes  524 
perlatces  523 
derlatpes  522 
telsarpes  520 
parletdes  520 
lersatpes  520 
saltipres  518 
tepsarled  514 
tegsarles  514 
tedsalrep  514 
repsalted  514 
ledsartep  514 
pelsatres  513 
tegsarlep  511 
lepsarteg  511 
tapselres  510 
serlitpas  508 
lecsartep  508 
tedsarleg  507 
persatles  507 
legsarted  507 
tepsarleg  505 
selratdes  505 
legsartep  505 
canretdes  505 
canretdes  505 
pinlatres  504 
nirgatres  504 
lecsartes  504 
serlatpes  503 
letrasdep  503 
derlitpas  503 
derlatbes  502 
cerlatpes  502 
cerlatpes  502 
retgasnil  501 
nilgasret  501 
metparles  501 
merpatles  501 
tedsarlep  500 
tapselred  500 
redseltap  500 
pecsartes  500 
lepsarted  500 
As always, this using the Enable2K word list. Code can be found here
]]>In case you’d forgot, we’ve developed a blazing fast boggle solver capable of scoring 10,000 Boggle boards a second. What to do with this? Other than some interesting analyses, the most interesting question is:
What is the highestscoring Boggle Board?
In this post, we’ll try to answer that question using Simulated Annealing. Here’s a sneak peak at one of the exceptionally wordrich boards we’ll find:
p  e  r  s 
l  a  t  g 
s  i  n  e 
t  e  r  s 
3625 points 
Follow me past the fold for more…
Here’s the basic idea of Simulated Annealing:
If we make good decisions in step #5, the boards we keep will be progressively higher scoring.
The big decisions come in steps #3 and #5. What does it mean to change a board?
I decided to use two basic permutations to change the board: rerolling a die:

→ 

and swapping two dice:

→ 

Clearly one of these changes was beneficial and the other was not.
Generally speaking, we should accept the changes that increase the score in step #5 and reject the changes that don’t. This is a perfectly reasonable strategy. It’s called Gradient Descent and is widelyused.
With Simulated Annealing, we sometimes go with the lowerscoring board. The idea is to not get stuck in a rut. Maybe our board is pretty good. It’s scores higher than all the boards around it. But there are lots of boards a little more distant that are even higherscoring. With Simulated Annealing, we can take a shortterm loss for a longterm gain.
The exact details aren’t too interesting. If they interest you, take a look at my code.
Now for the fun part! Here are the results of a sample run. I’ve concatenated the letters on each row of the board so that they’ll fit on a single line.
Gen.  Score  Reign  Board 

0  19  1  kqiqwhyyjthdsibk 
1  4  1  iqiqjhyywthdskbk 
2  5  1  ibiqjhyywthdskqk 
3  9  1  ibiqjhcywthdskqk 
4  10  1  ibiqjychsthdwkqk 
5  8  1  ibicjyqhsthdwkqk 
6  9  1  ibicjyqdsthywkqk 
7  23  1  ibicjygdhtsywkqk 
8  16  1  ibncjygdhtsywkqa 
9  15  1  ibncyygdhtsymkqa 
10  15  1  ibncyygwhtsytkda 
11  38  1  ibncyygahtsyhkdw 
12  146  2  ibscyogahtnyhkdw 
14  140  4  ibscyogahtnbhkdw 
18  165  2  ibbcyogahtnshkdw 
20  159  1  ibbcdogahynshktw 
21  169  1  ibbcdogahynshmtw 
22  151  4  ibbcdogahynshbtw 
26  150  3  hbbcdogaiynshbtw 
29  142  2  hbbcdogagynshbtw 
31  148  1  hbbcdogaygnshbtw 
32  151  1  hbbqdogaygnshbtw 
33  174  7  hbtmdogaygnshbtw 
40  175  2  hdtmbogaygnshbtw 
42  172  3  hdtmbogaygnstbtw 
45  165  1  hdtmbogaygnetbtw 
46  274  1  hdtmbogaygnetbts 
47  298  5  hdtmbogabgnetyts 
52  296  8  hdtmbogabgnswyte 
60  387  1  hdtmboeabgnswytg 
61  403  7  hdtgboeabmnswytg 
68  409  3  hdtpboeabmnswytg 
71  413  6  hdtpboeabmnsgytg 
77  419  3  hptdboeabmnsgytg 
80  416  8  hptdbaeobmnsgytg 
88  421  9  phtdbaeobmnsgytg 
97  449  40  phtdbaeoymnsgytg 
137  454  2  phtsbaeoymnsgytg 
139  463  11  phtseaeoymnsgytg 
150  581  3  phtsmaeoyensgstg 
153  662  26  yhtsmaeopensgstg 
179  690  1  yhtsmaespenogstg 
180  734  2  lhtsmaespenogstg 
182  754  13  lctsoaespenogstg 
195  849  1  lctsoasepenogstg 
196  912  39  lctseasepenogstg 
235  974  10  lctseasepenoistg 
245  978  10  lctseasedenoistg 
255  1046  25  lctseasedsnoietg 
280  1281  53  lctseasedsrointg 
333  1511  2  lctseasedsronitg 
335  1511  8  lctseasedsronitg 
343  1515  17  lctseaiedsronitg 
360  1537  2  lctseaiedsronitp 
362  1643  32  lctseaiedtronisp 
394  1708  21  lstseaiedtronisp 
415  1727  46  lstseiaedtronisp 
461  1744  40  lstseiaedtrinosp 
501  1873  36  lstseiaentrinosp 
537  2053  9  lstseiaentrisosp 
546  2162  13  lstseiaentrigosp 
559  2445  42  sltseiaentrigosp 
601  2445  21  sltseiaentrigosp 
622  2524  58  sltseiaentrigesp 
680  2597  37  stlseiaentrigesp 
717  2614  232  stlseiaentragesp 
949  3090  276  stlseiaentrpgesa 
1225  3151  26  stlseiaentrpdesa 
1251  3182  n/a  stlseiaentrpdeso 
Some fun words on that last board: “desperate”, “entreaties”, “operated” and “serenest”. No shortage of 11pointers on this board!
That’s all for now. Next time we’ll look at some of the highestscoring boards that simulated annealing finds.
As usual, you can find all the source code for this project on the performanceboggle project at Google Code.
]]>More generally, Java left off a mighty stink back in 2004. Any GUI that I ran on the Mac would look out of place and felt clunky. Performance was poor. But in retrospect, I suspect much of the rank Java smell was really coming from the design patterns gibberish I was being forcefed at the same time. Why use a simple array when you could use an AbstractListFactory that does the same thing with 10x code bloat?
Regular readers only get one guess what program I wrote to get in the swing of things.
The C++ code translated almost lineforline to Java. Most of the changes were either syntax (“>” to “.”) or swapping String
for char*
. It took a while to remember that everything is a pointer in Java. The Java statement Trie t = new Trie();
is the equivalent of C++’s Trie* t = new Trie;
and not plain old Trie t;
, which uses the stack. Another major pain was the lack of unsigned types, which I used as hash codes in some tests.
Here’s the Java code, for those interested. I expected Java code to be an order of magnitude slower than the equivalent C++, so this initial benchmark surprised me:
C++: Evaluated 21632 boards in 0.694 seconds = 31,162 bds/sec Java: Evaluated 21632 boards in 1.108 seconds = 19,523 bds/sec
That’s only a 37% performance penalty, far less than the 90% or more I was expecting. Since my code doesn’t create/destroy any objects during the critical section, I’m really testing the effectiveness of the JIT. Since this code is a direct C conversion, it makes sense that it does well.
I don’t understand the internals of Java well enough to know how valid my benchmark is. Profiling C++ code is relatively straightforward, since nothing is going on behind your back. Not the case with Java. For example, does that 1.108 seconds include the time it took the JIT to compile my code? That might explain the whole perf difference. Also, am I fully optimizing? I hear there’s a difference between the client and server JITs. Which one am I using? How can I tell? Lazyweb?
A few more thoughts on the whole experience:
string.
” bring up a list of methods was great. This was less useful for more obscure Javaisms, like System.getProperty("user.dir")
to get the current working directory.import java.io.File;
to my code, which would have taken me a while to figure out on my own. That’s always a nuisance in C++.@Test
and it becomes a unit test. Very cool."a" + "b"
is special kludge for new StringBuilder().Append("a").Append("b").toString()
. I also missed printf
when writing to stdout. Is there no natural way to mix numbers and strings?A tool like Eclipse is a great boon in understanding a large, foreign code base. One thing I’ve discovered in my past year at Google is that, while I’ve become a fairly good coder through experience, I have almost no external experience working with other people’s code. Make no mistake, navigating other people’s code is a skill. And to a surprising extent, it’s a skill disjoint from coding itself. We have a few tools to help with this at Google, but nothing quite so lowlatency as what Eclipse pops up when you press type a period. The possibility of tools like Eclipse is a strong argument for languages with a simple syntax. When anyone can parse a language, they can build amazing tools like this one.
]]>Last time, we completed a board solver that could score 10,000 boards per second. So what to do with all that power? One option was to compute statistics on random boards. Another is to hunt for the holy grail of Boggle: the highest possible scoring Boggle board. The next few posts will be devoted to the search for this board.
Before undertaking any search, you need to get a feel for your search space. In our case, that’s the set of all 4×4 Boggle boards. How many are there? We can do a few backoftheenvelope calculations.
To create a board, you roll 16 dice. Each has six possible letters on it, which gives 6^16 possibilities. These dice land in some permutation on the board, which gives another factor of 16!. Finally, a 4×4 board has eightfold symmetry, which takes us down to 6^16 * 16! / 8 = 7.3e24 ≈ 2 ^ 83.
That’s one upper bound. But it assumed that all 6*16 = 96 symbols on the dice were unique. Obviously they’re not. After a roll, each of the 16 squares will have one of 26 letters on it. Divide by the symmetries, and you get 26 ^ 16 / 8 = 5e21 ≈ 2^72. Much better!
I haven’t been able to come up with any better upper bounds than these. The main flaw in the second approximation is that not all boards can be rolled with the sixteen dice that Hasbro provides. A board of all z’s or qu’s simply can’t occur. If we knew the probability that any sixteen character sequence could be rolled, this would give a true approximation of the number of distinct boards.
The easiest way to do this is with a computer program. It picks a random sequence of sixteen characters, checks whether this board can be rolled, and repeats several thousand times. I believe that checking whether a given board can be rolled is NPComplete, but in this case the greedy approximation works quite well. I wrote a program (leave a comment if you want to see it) to do this, and processed one million 16character sequences. Only 84,492 could be represented with the usual dice, or 8.4%. This gives a total of
(26 ^ 16 / 8) * (84,492 / 1,000,000) = 4.6e20 ≈ 2^69.
If you like confidence intervals, my sample size of one million boards gives a 95% confidence interval of [4.545e24, 4.666e24] for the total number of boards. Pretty good.
So, assuming we could enumerate all these boards quickly, how long would it take for our faster solver to find the best board? At 10k boards/sec, we’re looking at 4.5e16 seconds = 1.9 billion years! Clearly we need to do better.
]]>$ time ruby short.rb c a t d l i n e m a r o p e t s 2338 ruby short.rb c a t d l i n e m a r o p e t s 241.95s user 1.20s system 97% cpu 4:08.35 total
$ time jruby short.rb c a t d l i n e m a r o p e t s 2338 jruby short.rb c a t d l i n e m a r o p e t s 1178.86s user 40.84s system 108% cpu 18:44.44 total
I’d heard JRuby was slow, but this is spectacular. Four times slower than the alreadyslow Ruby?
I’d always thought that the point of JRuby was to run Ruby programs on the JVM, and hence get the benefits of the JVM’s JIT. I guess not. With that kind of performance, the only possible justification is the ability to use Java libraries.
]]>