08.11.09

A Few More Boggle Examples

Posted in boggle at 9:01 pm by danvk

Hopefully the last boggle post for a while! I figured I should throw up a few more examples while I still have time left on my OmniGraffle trial.

There are two main shortcomings to the max/no mark upper bound:

  1. It can’t keep track of which words it’s already found.
  2. It can’t keep track of which choices it’s made on each cell.

These can both be fixed, but only by incurring great computational complexity. Remember the tradeoff between the “tightness” of a bound and the difficulty of computing it?

Our first example:

t i .
{a,e} . .
r . .

Let’s look at the search tree starting with the “t” cell. There are three possible words: “tar”, “tie” and “tier”. Here’s the search tree that max/no mark generates:

tar-tier

(I’m drawing the trees slightly differently than I did in the last post, to make the branching more explicit. Branches inside dotted lines come from multiple possibilities on a cell. Other branches result from the ordinary process of looking in different directions from a cell.)

In this example, max/no mark chooses the “a” when it gets to the “{a,e}” cell directly from the “t”. This makes sense, since it results in one point to the e’s zero. However, when it gets to the “{a,e}” cell through the “ti”, it chooses the “e”. This also makes sense, since it gets two points from the “e” (“tie” and “tier”) vs. zero from the “tia” prefix. By the time it makes this choice, it has no memory of choosing the “a” the last time it encountered this cell. If it did, since the “tie” prefix results in two points to the “ta” prefix’s one, it would make sense to go back and change that choice. But remembering this would require lots of expensive bookkeeping. It’s faster (but yields a looser bound) if we just accept the possibility that we’ll make different choices.

Example number 2 of max/no mark’s failings:

t h .
h e .
. . .

the-the

The problem in this case is that max/no mark doesn’t remember which words it’s already found. At first glance, it seems like this problem would be easy to remedy. And it certainly would on this board, in which each cell only has a single possible letter. But if you think about a board with lots of choices, you’ll start to see why it’s best to just give up and accept a looser bound.

Some max/no mark examples

Posted in boggle at 1:41 pm by danvk

After the last post, several people mentioned that they were confused about how the “max/no mark” upper bound on the highest score in a class of Boggle boards worked. With some help from OmniGraffle, I’ve created some instructive examples.

Here’s a class of boards:

f {a,u} r
. . e
. . .

The dots mean “no letters here”. The class contains two different boards:

f a r
. . e
. . .

fare-score

 
f u r
. . e
. . .

fur-score

It contains two boards. The one with an “a” has two points worth of words on it, while the one with a “u” only has one. (We’re only looking at words starting with ‘f’ here.)

The diagrams show that the solver starts with the ‘f’ on each board and explores adjacent cells. When it finds a word, it scores it and passes the total score back up the depth-first search tree.

Here’s how the max/no mark bound sees that board class:


fare-fur-score

When it gets to the “{a,u}” cell, it tries both possible letters. The “a” tree brings back 2 points, whereas the “u” tree brings back 1 point. So it chooses “a” and counts two points. As it so happens, this is the score of the highest-scoring board. The sum/union bound would have added the 1 and the 2, resulting in an upper bound of 3. The max/no mark bound takes advantage of the fact that this cell can only be one of two possibilities, not both.

Now what if we throw a few more letters on:

f {a,u} r
z . e
z y .

With the new letters, there are more points coming from the u:


fare-fur-fuzzy-score

The two points going through the ‘a’ are dropped on the floor. sum/union would have resulting in a bound of 4+2. When there are lots of letter choices on every cell, you can see why max/no mark is a much tighter bound.

It’s important to note that there are two sources of branching in these search trees: (1) being able to go in multiple directions from a cell (i.e. f->u->r or z) and (2) having multiple choices on a cell (i.e. f->a or u). The max/no mark bound sums the scores resulting from choices in case (1) and takes the max of choices in case (2). The sum/union bound takes the sum in both cases.

08.08.09

Breaking 3×3 Boggle

Posted in boggle, math, programming at 10:35 am by danvk

Why is finding the highest-scoring Boggle board so difficult? It’s because there are so many boards to consider: 2^72 for the 4×4 case and 2^40 for the 3×3 case. At 10,000 boards/second the former corresponds to about 2 billion years of compute time, and the latter just two years. Just enumerating all 2^72 boards would take over 100,000 years.

So we have to come up with a technique that doesn’t involve looking at every single board. And I’ve come up with just such a method! This is the “exciting news” I alluded to in the last post.

Here’s the general technique:

  1. Find a very high-scoring board (maybe this way)
  2. Consider a large class of boards
  3. Come up with an upper bound on the highest score achieved by any board in the class.
  4. If it’s lower than the score in step #1, we can eliminate all the boards in the class. If it’s not, subdivide the class and repeat step #2 with each subclass.

Classes of Boards
By “class of boards”, I mean something like this:

{a,e,i,o,u} {a,e,i,o,u} r
{b,c,d,f,g,h} a t
d e {r,s,t,v}

The squares that contain a set of letters can take on any of those letters. So this board is part of that class:

a i r
d a t
d e s
189 points

and so is this:

o u r
f a t
d e t
114 points

All told, there are 5 * 5 * 6 * 4 = 600 boards that are part of this class, each with its own score. Other fun classes of boards include “boards with only vowels” (1,953,125 members) and “boards with only consonants” (794,280,046,581 members).

Follow me past the fold for more…
Read the rest of this entry »

08.04.09

Solving Boggle by Taking Option Three

Posted in boggle at 6:18 pm by danvk

There is exciting news to report in the land of Boggle!

Last we spoke about Boggle, we used Simulated Annealing to get at the question “what is the highest-scoring Boggle board?”

We found a board with 3625 points on it this way. It would have been nice to say that it was the best of all possible boards, but that would have been too rash. While it is a very good board and I have never seen a higher-scoring one, that doesn’t mean there isn’t one out there. Maybe I’ve just been looking in the wrong places.

To prove that the 3625-pointer is the best board, we’d need to show that every other board scores fewer points. In a previous post, I estimated that there were 2^69 possible boggle boards. At 10,000 boards/second, this would take 1.9 billion years of compute time!

When you run up against a computationally intractable problem, there are a few standard ways to deal with it:

  1. Give up.
  2. Come up with a brilliant new algorithm to solve the problem more quickly.
  3. Solve a simpler problem.

Option 1 is the easiest. Option 2 is the hardest. And option 3 is the compromise we’ll be taking for the next few danvk.org boggle posts. Our simpler problem today: 3×3 boggle.

If we drop seven letters (4×4 – 3×3 = 7), we’re left with only 26^9 / 8 ≅ 2^39 boards to consider. If we can achieve 10,000 boards/sec, this would be just over two years’ worth of computation. That’s still a lot, but it’s much more reasonable than 1.9 billion years!

I believe I have solved the 3×3 boggle problem using an approach that is only slightly more clever than this. I used simulated annealing to find very high-scoring 3×3 boards. This was the best one I found, along with its 4×4 best board buddy:


p e r
l a t
d e s
545 points

 

p e r s
l a t g
s i n e
t e r s
3625 points

It’s worth noting that the optimal 3×3 board has a 2×3 region in common with the 3625 point 4×4 board.

In the next post we’ll talk about how I showed that this board was higher-scoring than all 2^39 others in significantly less than two years (it took one day). As a teaser, I’ve included all the 3×3 boards with more than 500 points worth of words below the fold.

Read the rest of this entry »

02.19.09

Sky-High Boggle Scores with Simulated Annealing

Posted in boggle at 1:09 am by danvk

Don’t let the sixteen month hiatus fool you. There’s just no end to Boggle posts on danvk.org!

In case you’d forgot, we’ve developed a blazing fast boggle solver capable of scoring 10,000 Boggle boards a second. What to do with this? Other than some interesting analyses, the most interesting question is:

What is the highest-scoring Boggle Board?

In this post, we’ll try to answer that question using Simulated Annealing. Here’s a sneak peak at one of the exceptionally word-rich boards we’ll find:

p e r s
l a t g
s i n e
t e r s
3625 points

Follow me past the fold for more…

Read the rest of this entry »

10.09.07

A Java Surprise

Posted in boggle, programming at 11:44 pm by danvk

java.png
I’ve always been a Java and Eclipse naysayer, but I’m afraid new experiences are forcing me to reevaluate my skepticism. The last time I used Java was JDK 1.3 on a Sparc workstation back in early 2004. Eclipse was hella slow on that hardware, and somehow my workspace wound up in a temporary directory. This was a very bad thing, because as soon as I logged out, my project was gone forever. So I had good reason to swear off Eclipse.

More generally, Java left off a mighty stink back in 2004. Any GUI that I ran on the Mac would look out of place and felt clunky. Performance was poor. But in retrospect, I suspect much of the rank Java smell was really coming from the design patterns gibberish I was being force-fed at the same time. Why use a simple array when you could use an AbstractListFactory that does the same thing with 10x code bloat?

Regular readers only get one guess what program I wrote to get in the swing of things.
Read the rest of this entry »

08.02.07

How many Boggle boards are there?

Posted in boggle, math at 8:45 pm by danvk

I’ve taken a several months break from my Boggle series, mostly because I think everyone got tired of it. I’m going to come back to it, but hopefully at a slower pace this time around.

Last time, we completed a board solver that could score 10,000 boards per second. So what to do with all that power? One option was to compute statistics on random boards. Another is to hunt for the holy grail of Boggle: the highest possible scoring Boggle board. The next few posts will be devoted to the search for this board.

Before undertaking any search, you need to get a feel for your search space. In our case, that’s the set of all 4×4 Boggle boards. How many are there? We can do a few back-of-the-envelope calculations.

To create a board, you roll 16 dice. Each has six possible letters on it, which gives 6^16 possibilities. These dice land in some permutation on the board, which gives another factor of 16!. Finally, a 4×4 board has eight-fold symmetry, which takes us down to 6^16 * 16! / 8 = 7.3e24 ≈ 2 ^ 83.

That’s one upper bound. But it assumed that all 6*16 = 96 symbols on the dice were unique. Obviously they’re not. After a roll, each of the 16 squares will have one of 26 letters on it. Divide by the symmetries, and you get 26 ^ 16 / 8 = 5e21 ≈ 2^72. Much better!

I haven’t been able to come up with any better upper bounds than these. The main flaw in the second approximation is that not all boards can be rolled with the sixteen dice that Hasbro provides. A board of all z’s or qu’s simply can’t occur. If we knew the probability that any sixteen character sequence could be rolled, this would give a true approximation of the number of distinct boards.

The easiest way to do this is with a computer program. It picks a random sequence of sixteen characters, checks whether this board can be rolled, and repeats several thousand times. I believe that checking whether a given board can be rolled is NP-Complete, but in this case the greedy approximation works quite well. I wrote a program (leave a comment if you want to see it) to do this, and processed one million 16-character sequences. Only 84,492 could be represented with the usual dice, or 8.4%. This gives a total of

(26 ^ 16 / 8) * (84,492 / 1,000,000) = 4.6e20 ≈ 2^69.

If you like confidence intervals, my sample size of one million boards gives a 95% confidence interval of [4.545e24, 4.666e24] for the total number of boards. Pretty good.

So, assuming we could enumerate all these boards quickly, how long would it take for our faster solver to find the best board? At 10k boards/sec, we’re looking at 4.5e16 seconds = 1.9 billion years! Clearly we need to do better.

05.02.07

JRuby Performance, Yowch!

Posted in boggle, programming at 10:58 pm by danvk

I took JRuby 0.9.9 for a spin with the exceptionally-inefficient Boggle program from a few months back. Here are the numbers:

$ time ruby short.rb c a t d l i n e m a r o p e t s
2338
ruby short.rb c a t d l i n e m a r o p e t s  241.95s user 1.20s system 97% cpu 4:08.35 total
$ time jruby short.rb c a t d l i n e m a r o p e t s
2338
jruby short.rb c a t d l i n e m a r o p e t s  1178.86s user 40.84s system 108% cpu 18:44.44 total

I’d heard JRuby was slow, but this is spectacular. Four times slower than the already-slow Ruby?

I’d always thought that the point of JRuby was to run Ruby programs on the JVM, and hence get the benefits of the JVM’s JIT. I guess not. With that kind of performance, the only possible justification is the ability to use Java libraries.

02.16.07

Initial Impressions of D

Posted in boggle, programming at 12:07 am by danvk

I got curious about the programming language D after noticing it outperform C++ in the somewhat-silly Computer Language Shootout and reading Steve Yegge’s thoughts on it. I’ve played around it for the past few days. Here’s some thoughts.

First of all, D is what you might call a “low PageRank language.” If you search for ‘D’, there’s nothing related to the language until result #15. The next result isn’t until #160, and it’s pretty random. Contrast that with C. This makes searching for help on D almost impossible. Try “D slow file I/O” for a taste. This isn’t necessarily Google’s fault, either. There’s just not that many good D resources online yet.
Read the rest of this entry »

02.11.07

Some Boggle Statistics

Posted in boggle at 10:46 pm by danvk

With a fast boggle solver in hand, it’s time for some fun statistics. These are all based on boggle boards rolled with real boggle dice. I’m going sans-code this time, but if you’re interested in seeing it, feel free to holla.

Most common words:

3 letters   4 letters   5 letters
Word Freq (%)   Word Freq (%)   Word Freq (%)
toe 19.258   teen 6.718   eaten 2.034
tee 19.074   tees 6.564   enate 2
ten 17.944   tent 6.02   sente 1.954
net 17.944   note 5.976   setae 1.944
tea 17.65   tone 5.838   tense 1.86
set 17.51   teat 5.804   tease 1.856
eta 17.176   toes 5.664   teeth 1.788
ate 17.176   toea 5.548   eater 1.788
tae 16.518   nets 5.432   teens 1.712
eat 16.518   test 5.344   seton 1.702
tie 16.432   rete 5.208   notes 1.702
het 15.684   nett 5.204   tents 1.646
ret 15.108   nest 5.174   retie 1.632
eth 14.938   tens 5.172   steno 1.624
oes 14.698   sent 5.156   sheet 1.618
the 14.542   neat 5.146   ester 1.618
eon 14.474   etna 5.144   oaten 1.61
one 14.366   ante 5.144   teats 1.608
ose 13.82   thee 5.064   tones 1.606
see 13.78   tote 5.052   enter 1.596

I looked these words up and they all check out. See the Scrabble dictionary if you’re not convinced.

How many words can we expect to find on each board?

words.png

That looks like a log-normal distribution. The mean is 98.53 words. How many points?

scores.png

That’s also a log-normal distribution with the characteristically long tail. The mean is 140.97 points per board.

How many words of each length can we expect to find on a board? Here’s a histogram of the number of words of each length on a board:

lens.png

Those also look like log-normals, with four letter words being most common.

Put another way, what’s the likelihood of finding a word of a given length on a board?

Len. Likelihood
3 99.97994%
4 99.901%
5 98.62%
6 87.56%
7 56.21%
8 21.36%
9 3.94%
10 0.442%
11 0.0362%
12 0.00228%
13 0.0001%

For context, the longest word I’ve ever found in a game was “thrashers” at nine letters.

The most common words were based on a 50,000 board sample. The graphs are based on a 5,000,000 board sample. Feel free to contact me if you’d like source or the Excel spreadsheet.

« Previous entries Next Page » Next Page »