danvk.org » math

Takeaways from Stanford’s Machine Learning Class

danvk — Tue, 20 Dec 2011 00:04:33 +0000

Over the past two months, I’ve participated in Andrew Ng’s online Stanford Machine learning class. It’s a very high-level overview of the field with an emphasis on applications and techniques, rather than theory. Since I just finished the last assignment, it’s a fine time to write down my thoughts on the class!

Overall, I’ve learned quite a bit about how ML is used in practice. Some highlights for me:

Gradient descent is a very general optimization technique. If you can calculate a function and its partial derivatives, you can use gradient descent. I was particularly impressed with the way we used it to train Neural Networks. We learned how the networks operated, but had no need to think about how to train them — we just used gradient descent.
There are many advanced “unconstrained optimization” algorithms which can be used as alternatives to gradient descent. These often have the advantage that you don’t need to tune parameters like a learning rate.
Regularization is used almost universally. I’d previously had very negative associations with using high-order polynomial features, since I most often saw them used in examples of overfitting. But I realize now that they are quite reasonable to add if you also make good use of regularization.
The backpropagation algorithm for Neural Networks is really just an efficient way to compute partial derivatives (for use by gradient descent and co).
Learning curves (plots of train/test error as a function of the number of examples) are a great way to figure out how to improve your ML algorithm. For example, if your training and test errors are both high, it means that you’re not overfitting your data set and there’s no point in gathering more data. What it does mean is that you need to add more features (e.g. the polynomial which I used to fear) in order to increase your performance.

The other takeaway is that, as in many fields, there are many “tricks of the trade” in Machine Learning. These are bits of knowledge that aren’t part of the core theory, but which are still enormously helpful for solving real-world problems.

As an example, consider the last problem in the course: Photo OCR. The problem is to take an image like this:

and extract all the text: “LULA B’s ANTIQUE MALL”, “LULA B’s”, “OPEN” and “Lula B’s”. Initially, this seems quite daunting. Machine Learning is clearly relevant here, but how do you break it down into concrete problems which can be attacked using ML techniques? You don’t know where the text is and you don’t even have a rough idea of the text’s size.

This is where the “tricks” come in. Binary classifiers are the “hammer” of ML. You can write a binary classifier to determine whether a fixed-size rectangle contains text:

Positive examples
Negative examples

You then run this classifier over thousands of different “windows” in the main image. This tells you where all the bits of text are. If you ignore all the non-contiguous areas, you have a pretty good sense of the bounding boxes for the text in the image.

But even given the text boxes, how do you recognize the characters? Time for another trick! We can build a binary classifier to detect a gap between letters in the center of a fixed-size rectangle:

Positive examples
Negative examples

If we slide this along, it will tell us where each character starts and ends. So we can chop the text box up into character boxes. Once we’ve done that, classifying characters in a fixed-size rectangle is another concrete problem which can be tackled with Neural Networks or the like.

In an ML class, you’re presented with this pipeline of ML algorithms for the Photo OCR problem. It makes sense. It reduces the real-world problem into three nice clean, theoretical problems. In the class, you’d likely spend most of your time talking about those three concrete problems. In retrospect, the pipeline seems as natural as could be.

But if you were given the Photo OCR problem in the real world, you might never come up with this breakdown. Unless you knew the trick! And the only way to learn tricks like this is to see them used. And that’s my final takeaway from this practical ML class: familiarity with a vastly larger set of ML tricks.

Crossword Word Frequency

danvk — Sat, 26 Dec 2009 17:45:02 +0000

In a previous post, I discussed downloading several years’ worth of New York Times Crosswords and categorizing them by day of week. Now, some analysis!

Here were the most common words over the last 12 years, along with the percentage of puzzles in which they occurred:

Percentage	Word	Length
6.218%	ERA	3
5.703%	AREA	4
5.413%	ERE	3
5.055%	ELI	3
4.854%	ONE	3
4.585%	ALE	3
4.496%	ORE	3
4.361%	ERIE	4
4.339%	ALOE	4
4.317%	ETA	3
4.317%	ALI	3
4.227%	OLE	3
4.205%	ARE	3
4.138%	ESS	3
4.138%	EDEN	4
4.138%	ATE	3
4.048%	IRE	3
4.048%	ARIA	4
4.004%	ANTE	4
3.936%	ESE	3
3.936%	ENE	3
3.914%	ADO	3
3.869%	ELSE	4
3.825%	NEE	3
3.758%	ACE	3

(you can click column headings to sort.)

So “ERA” appears, on average, in about 23 puzzles per year. How about if we break this down by day of week? Follow me past the fold…

Monday:

Percentage	Word	Length
9.404%	ALOE	4
8.777%	AREA	4
7.837%	ERIE	4
6.426%	ONE	3
6.426%	IDEA	4
6.426%	ARIA	4
6.270%	ONCE	4
6.270%	EDEN	4
6.113%	ERA	3
6.113%	ELSE	4
6.113%	ASEA	4
5.799%	ERE	3
5.643%	ORE	3
5.643%	ETAL	4
5.643%	ARE	3
5.643%	ANTE	4
5.486%	OREO	4
5.486%	ALEE	4
5.329%	TREE	4
5.329%	ESS	3
5.329%	ELI	3
5.329%	ACRE	4
5.172%	TSAR	4
5.172%	ANTI	4
5.016%	ORAL	4

The four letter words are more common now. Also look how much higher the percentages are. There’s less variety in the fill of Monday puzzles. “ALOE” and “ARIA” are classic crossword words, not to mention “OREO”.

Saturday:

Percentage	Word	Length
3.286%	ERA	3
2.973%	ONE	3
2.973%	ETE	3
2.817%	TEN	3
2.817%	EVE	3
2.817%	ETA	3
2.660%	IRE	3
2.660%	ERR	3
2.660%	ERE	3
2.504%	OTIS	4
2.504%	OLE	3
2.504%	ENE	3
2.504%	ELL	3
2.504%	ELI	3
2.504%	ARE	3
2.504%	ARA	3
2.504%	ALA	3
2.504%	ACE	3
2.347%	RTE	3
2.347%	ICE	3
2.347%	ATE	3
2.347%	ALE	3
2.191%	TSE	3
2.191%	TERSE	5
2.191%	SRI	3

Lots of three letter words and much lower percentages. “OTIS” is surprising to me, but I don’t do many Saturday puzzles, so who am I to say?

It would be really interesting to combine this with some document frequency numbers for the English language. This would find words which are much more common in crosswords than they are in general, i.e. crosswordese.

I’d include everything necessary to reproduce this here, but the puzzles are not free. See this directory for the program I used to tabulate the statistics and complete word counts, both overall and for each day of the week. The first puzzle in my collection was 2006-10-23 and the last was 2009-01-19.

How long is a day?

danvk — Fri, 06 Nov 2009 00:39:40 +0000

As we approach the winter solstice, the days get shorter and shorter. There’s a common misconception about how quickly this change happens. Most people know that:

The summer solstice (June 21) is the longest day of the year.
The winter solstice (December 21) is the shortest day of the year.
The days get shorter between Summer and Winter.
The days get longer between Winter and Summer.

Many people take these four pieces of information and assume that the day length changes like this over the course of the year:

(The x-axis is the date. The y-axis is length of the day in hours.)

This is consistent with the four pieces of information, but is incorrect! There aren’t many sharp edges like that in Physics. Reality is much smoother:

The length of the day slowly increases as we approach the summer solstice, then slowly decreases as we leave it. This is great — it means that there are lots of long days in the summer. As we get to the autumnal equinox, the rate of change hits a maximum. The same thing happens around the winter solstice, only in reverse.

The summer solstice is the longest day of the year, but not by much! Here’s some day lengths for San Francisco:

Date	Day Length	Difference
Jun 18, 2009	14h 46m 45s	+ 09s
Jun 19, 2009	14h 46m 51s	+ 06s
Jun 20, 2009	14h 46m 54s	+ 02s
Jun 21, 2009	14h 46m 54s	< 1s
Jun 22, 2009	14h 46m 50s	− 03s
Jun 23, 2009	14h 46m 43s	− 06s
Jun 24, 2009	14h 46m 33s	− 10s

The lengths of the days around the solstice differ by only a few seconds! On the other hand, here are some day lengths around the autumnal equinox (September 22):

Date	Day Length	Difference
Sep 19, 2009	12h 15m 35s	− 2m 24s
Sep 20, 2009	12h 13m 10s	− 2m 24s
Sep 21, 2009	12h 10m 46s	− 2m 24s
Sep 22, 2009	12h 08m 21s	− 2m 24s
Sep 23, 2009	12h 05m 56s	− 2m 24s
Sep 24, 2009	12h 03m 32s	− 2m 24s
Sep 25, 2009	12h 01m 07s	− 2m 24s

The length of each day changes by several minutes in September. Over a single week the day gets a whole 15 minutes shorter!

note: the interactive graphs are dygraphs, a JS library I created. Check it out!

Breaking 3×3 Boggle

danvk — Sat, 08 Aug 2009 17:35:04 +0000

Why is finding the highest-scoring Boggle board so difficult? It’s because there are so many boards to consider: 2^72 for the 4×4 case and 2^40 for the 3×3 case. At 10,000 boards/second the former corresponds to about 2 billion years of compute time, and the latter just two years. Just enumerating all 2^72 boards would take over 100,000 years.

So we have to come up with a technique that doesn’t involve looking at every single board. And I’ve come up with just such a method! This is the “exciting news” I alluded to in the last post.

Here’s the general technique:

Find a very high-scoring board (maybe this way)
Consider a large class of boards
Come up with an upper bound on the highest score achieved by any board in the class.
If it’s lower than the score in step #1, we can eliminate all the boards in the class. If it’s not, subdivide the class and repeat step #2 with each subclass.

Classes of Boards
By “class of boards”, I mean something like this:

{a,e,i,o,u}	{a,e,i,o,u}	r
{b,c,d,f,g,h}	a	t
d	e	{r,s,t,v}

The squares that contain a set of letters can take on any of those letters. So this board is part of that class:

a	i	r
d	a	t
d	e	s
189 points

and so is this:

o	u	r
f	a	t
d	e	t
114 points

All told, there are 5 * 5 * 6 * 4 = 600 boards that are part of this class, each with its own score. Other fun classes of boards include “boards with only vowels” (1,953,125 members) and “boards with only consonants” (794,280,046,581 members).

Follow me past the fold for more…

Upper Bounds
Now on to step #3 of the general technique: calculating an upper bound. This is going to be easier if we introduce some mathematical notation:

b	=	A boggle board
Score(b)	=	sum of the scores of all the words contained on b
B	=	a class of boards, i.e.
Score(B)	=

An upper bound is a function such that , i.e. .

There’s one really easy upper bound: ! This just enumerates all the boards in the class B, scores each and takes the maximum score. It’s very expensive to compute for a large class of boards and hence not very practical. You and I both know that no board in containing only consonants has any points on it. We don’t need to enumerate through all 794 billion such boards to determine this.

With upper bounds, there’s a trade-off between how hard they are to compute and how “tight” they are, i.e. how closely they approximate . is very tight but is hard to compute. At the other end of the spectrum, we know that all the words on a board are in the dictionary. So we could just sum up the scores of all the words in the dictionary and get a number, say 1,000,000. Then is an upper bound. It is very easy to compute, but is not very tight.

The trick is to hit some sort of sweet spot that strikes a good balance between “tightness” and ease of computation. Over the rest of this blog post, I’ll present two upper bounds that do this. Upper bounds have the nice property that if and are two upper bounds, then is also an upper bound. So by finding two bounds, we’ll get a third that’s better than either one alone.

sum/union
The idea of this bound is to find all the words that can possibly occur in a class of boards. Since each word can only be found once, we can add the scores of all these words to get an upper bound.

To get the list of words, we use the same depth-first search strategy as we did to find words on a single board. The wrinkle is that, when we encounter a cell with multiple possible letters, we have to do a separate depth-first search for each.

At first glance, it doesn’t seem like this would be tractable for a board class like this one (alternating vowels and consonants):

{a,e,i,o,u}	{b-d,f-h,j-n,p-t,v-z}	{a,e,i,o,u}
{b-d,f-h,j-n,p-t,v-z}	{a,e,i,o,u}	{b-d,f-h,j-n,p-t,v-z}
{a,e,i,o,u}	{b-d,f-h,j-n,p-t,v-z}	{a,e,i,o,u}

In addition to the branching from going different directions on each square, there’s also a huge amount of branching from trying each letter on each square. But we’re saved by the same lesson we learned in boggle post #3: the dictionary is exceptionally effective at pruning thorny search trees. If we prune search trees like ‘bqu’ that don’t begin words, then there doesn’t wind up being that much work to do.

We can find all possible words on the above board in just under 1 second. This is about 10,000 times slower than it takes to score a conventional board, but it’s certainly tractable. The resulting score is 195,944. Given that no board scores higher than 545 points, this is a wild overestimate. But at least it’s a better bound than a million!

This technique does especially well on boards like this one, which contains all consonants:

{b-d,f-h,j-n,p-t,v-z}	{b-d,f-h,j-n,p-t,v-z}	{b-d,f-h,j-n,p-t,v-z}
{b-d,f-h,j-n,p-t,v-z}	{b-d,f-h,j-n,p-t,v-z}	{b-d,f-h,j-n,p-t,v-z}
{b-d,f-h,j-n,p-t,v-z}	{b-d,f-h,j-n,p-t,v-z}	{b-d,f-h,j-n,p-t,v-z}

This takes 0.23 seconds to score and results in a bound of 208 points (it contains words like ‘crypt’ and ‘gypsy’). We’ve already found a single board that has 545 points on it. So we can eliminate this entire class of 794 billion boards. That’s a speed of over 3 trillion boards/second! Of course, this board class is not typical.

It’s also worth pointing out why this upper bound isn’t tight. Consider this class of boards:

{a,i}	r	z
f	z	z
z	z	z

You can find both “fir” and “far” on boards in this class, but there aren’t any boards that contain both. So while each “fir and “far” contribute a point to the upper bound, they should only really contribute a single point. The sum/union bound doesn’t take into account the relationships between various letter choices. It’s the best trade-off between computability and “tightness” we’ve seen so far, but it’s not good enough to make the problem tractable.

max/no mark
In the sum/union upper bound, we dealt with multiple possible letters on the same square by trying each and adding the resulting scores (taking care not to count any word twice). But why take the sum of all choices when we know that any given board can only take on one of the possibilities? It would result in a much better bound if we took the max of the scores resulting from each possible choice, rather than the sum. This is the idea behind the “max/no mark” bound.

This is a huge win over sum/union, especially when there are many squares containing many possible letters. It does have one major drawback, though. The sum/union bound took advantage of the fact that each word could only be found once. With the max/no mark bound, the bookkeeping for this becomes completely intractable. The words we find by making a choice on one square may affect the results of a choice somewhere else. We can’t make the choices independently. The optimal set of choices becomes an optimization problem in its own right.

Rather than deal with this, max/no mark just throws up its hands. This is what the “no mark” refers to. In the past, we’ve recorded the words we find by marking the Trie. By not marking the Trie with found words, we accept that we’ll double-count words sometimes. But it still winds up being an excellent upper bound.

Lets try some of our previous examples:

{a,e,i,o,u}	{a,e,i,o,u}	r
{b,c,d,f,g,h}	a	t
d	e	{r,s,t,v}
sum/union: 2880
max/no mark: 1307

Alternating vowels and consonants:

{a,e,i,o,u}	{b-d,f-h,j-n,p-t,v-z}	{a,e,i,o,u}
{b-d,f-h,j-n,p-t,v-z}	{a,e,i,o,u}	{b-d,f-h,j-n,p-t,v-z}
{a,e,i,o,u}	{b-d,f-h,j-n,p-t,v-z}	{a,e,i,o,u}
sum/union: 195944
max/no mark: 15692

A class that can be entirely eliminated:

{b,d,f,g,j,k,m,p,v,w,x,z}	a	{s,y}
{i,o,u}	y	a
{s,y}	{c,h,l,n,r,t}	{c,h,l,n,r,t}
sum/union: 2497
max/no mark: 447

max/no mark isn’t always better than sum/union:

{b,d}	a	{b,d}
a	{b,d}	a
{b,d}	a	{b,d}
sum/union: 9
max/no mark: 132

This is something of a worst-case because, while there are relatively few distinct words, there are many different ways to find them.

Putting it all together
Our two bounds do well in different situations. max/no mark works best when there are lots of choices to be made on particular cells and there are relatively few ways to make any particular word. sum/union works best when there are lots of possibilities but relatively few distinct words. Putting them together results in a bound that’s good enough to find the best 3×3 boggle board using the technique described at the beginning of this post.

Given an initial class of boards, we wind up with what I call a “breaking tree”. If the initial class has an upper bound less than 545 points, then we’re done. Otherwise, we pick a cell to split and try each possibility.

Here’s a relatively small breaking tree that results from running this program:

$ ./3x3/ibucket_breaker --best_score 520 --break_class "bdfgjkmpvwxz a sy iou xyz aeiou sy chlnrt chlnrt"
(     0%) (0;1/1) bdfgjkmpvwxz a sy iou xyz aeiou sy chlnrt chlnrt (820, 77760 reps)
                            split cell 4 (xyz) Will evaluate 3 more boards...
(     0%)  (1;1/3) bdfgjkmpvwxz a sy iou x aeiou sy chlnrt chlnrt (475, 25920 reps)
(33.333%)  (1;2/3) bdfgjkmpvwxz a sy iou y aeiou sy chlnrt chlnrt (703, 25920 reps)
                            split cell 5 (aeiou) Will evaluate 5 more boards...
(33.333%)   (2;1/5) bdfgjkmpvwxz a sy iou y a sy chlnrt chlnrt (447, 5184 reps)
(    40%)   (2;2/5) bdfgjkmpvwxz a sy iou y e sy chlnrt chlnrt (524, 5184 reps)
                            split cell (iou) 3 Will evaluate 3 more boards...
(    40%)    (3;1/3) bdfgjkmpvwxz a sy i y e sy chlnrt chlnrt (346, 1728 reps)
(42.222%)    (3;2/3) bdfgjkmpvwxz a sy o y e sy chlnrt chlnrt (431, 1728 reps)
(44.444%)    (3;3/3) bdfgjkmpvwxz a sy u y e sy chlnrt chlnrt (339, 1728 reps)
(46.667%)   (2;3/5) bdfgjkmpvwxz a sy iou y i sy chlnrt chlnrt (378, 5184 reps)
(53.333%)   (2;4/5) bdfgjkmpvwxz a sy iou y o sy chlnrt chlnrt (423, 5184 reps)
(    60%)   (2;5/5) bdfgjkmpvwxz a sy iou y u sy chlnrt chlnrt (318, 5184 reps)
(66.667%)  (1;3/3) bdfgjkmpvwxz a sy iou z aeiou sy chlnrt chlnrt (509, 25920 reps)

The numbers in parentheses are the upper bounds. When they get below 520 (the parameter I set on the command line), a sub-class is fully broken.

Using this technique and the following partition of the 26 letters:

bdfgjvwxz
aeiou
lnrsy
chkmpt

I was able to go through all 262,144 (=4^9) board classes in about six hours on a single machine. This resulted in the boards I listed in the last post. Six hours is a big improvement over two years!

If that same factor (two years to six hours) held for the 4×4 case, then we’d be down to 380 years of compute time to find the best 4×4 boggle board. Or, equivalently, 138 days on 1000 machines. That’s still a lot. We’re not quite there yet, but we’re getting closer!

Code for the program that went through all possible board classes can be found here. While many people have found high-scoring boards, I haven’t found any previous work on this upper bounding approach. So if you have any ideas/suggestions on how to improve the bound, they’re probably novel and useful!

Why You Invest When the Market Goes Down

danvk — Sun, 22 Feb 2009 21:51:07 +0000

The S&P 500 certainly hasn’t made anyone rich over the last year:

Most people see this and think “an investment one year ago would have lost 45% of its value”. Others think “great, now stocks are all cheaper!”

In reality, most ordinary people invest portions of their paychecks, either through their 401(k) or a personal account. This means they’re doing time averaging. Sure, investments when the market was up aren’t doing well. But investments when the market was down are doing just fine.

This is all kind of wishy-washy, though. Let’s try to quantify it. Suppose a market drop and recovery looks like a parabola:

The prices here are parabolas

for various values of a. a=0 means the market is flat. a=0.5 means the market loses 50% of its value.

If you invest dt dollars in this market at each point in time, you can work out a nasty integral and show that the number of shares you have at time t is:

and hence the value of your shares at the end is:

Here’s what that looks like:

The x-axis is a, the fraction by which the market drops. The y-axis is your total return on investment. If the market drops by 50% (a=0.5) then your total return on investment is around 55%. With time-averaging, the more the market drops, the better you do.

This makes a lot of sense if you think about it. Say the market drops 99.9% and then recovers. The shares you bought when it was at its bottom earned you a return of 1000x. Investing at the bottom is important! You should keep investing, even as the market drops. If you don’t, you’ll miss that bottom.

A Minus Times a Minus

danvk — Thu, 01 Jan 2009 02:32:14 +0000

Back in grade school, I never got a satisfactory answer to the question “why is a minus times a minus equal to a plus?”

I didn’t think about it again for years, until my mom asked for an explanation to give when she taught her students. This time around, I immediately realized how you’d go about proving it. Here’s how it goes:

1 + (-1)	=	0	(definition of “-1″)
(-1) * 1 + (-1) * (-1)	=	0	(multiply both sides by -1 and distribute)
-1 + (-1)(-1)	=	0	(anything times 1 is itself)
(-1)(-1)	=	1	(add 1 to both sides)

Easy as could be, right? Unfortunately, mom preferred my brother’s explanation:

A minus times a minus is equal to a plus.
The reasons for this, we need not discuss.

How to Read Polls

danvk — Thu, 25 Sep 2008 22:45:29 +0000

On September 15, SurveyUSA released this poll of likely voters in Virginia:

McCain (R) 46%
Obama (D) 50%
Margin of Error: +/-3.7%

Tables like this appear on TV and in newspapers all the time. But they’re never accompanied by any explanation of how to interpret the margin of error. Commentators usually interpret it in one of two ways:

Obama is ahead by more than the margin of error, hence his lead is statistically significant.
That “+/-” means either number could be off by that amount. If you added 3.7% to McCain’s 46% and subtracted 3.7% from Obama’s 50%, McCain would actually be ahead. So Obama’s lead is not statistically significant; it is less than twice the margin of error.

In either case, they are wrong.

So what’s the right way to interpret the margin of error? A lead is significant if it is 1.6 times the margin of error or greater. That’s 5.9% for our poll, so Obama’s lead is not significant.

This is a strange, non-intuitive rule, which explains why commentators don’t use it. The derivation is more revealing than the rule itself.

Obama’s lead is “statistically significant” if there’s a 95% probability that Obama is actually ahead. The “95%” is completely arbitrary, but the probability

P(Obama ahead)

is quite interesting. I wish news organizations would report this probability instead of the margin of error. It’s easier to interpret the statement “There’s an 86.5% chance that Obama is ahead” than a statement about margins of error.

These margins of error, incidentally, are just one over the square root of the sample size. For the poll described above, there were 732 voters surveyed. The square root of 732 is 27 and one over that is .03696 or 3.7%. The reported margin of error is not a standard deviation.

The probability that Obama is ahead can be determined using Bayes' Rule, which quantifies the effect of evidence on our belief in a hypothesis. It relates a Hypothesis (H) and an Observation (O):

H = Obama is ahead of McCain.
O = In a poll of 732 likely voters, 50% preferred Obama and 46% preferred McCain.

Here it is:

Bayes’ Rule: P(H|O) = P(O|H)/P(O) * P(H)

This rule is important enough that each of these quantities has a name:

P(H) is the prior probability, our belief that the hypothesis is true before seeing additional evidence.
P(O|H) is the likelihood function, the probability of seeing the evidence if the hypothesis were true.
P(O) is the marginal probablity, the probablity of seeing the evidence at all. It’s often thought of as a normalizing term.
P(H|O) is the posterior probability. It’s what we’re really after, the likelihood of the hypothesis in light of new evidence.

Let’s start with the likelihood function, P(O|H). What are the odds of seeing this survey is a certain portion p of voters prefer Obama? It follows from the binomial formula:

pO = portion of voters preferring Obama
pM = portion of voters preferring McCain
a = pO * N (number of voters who prefer Obama)
b = pM * N (number of voters who prefer McCain)
N = a + b (total decided voters)
P(O|H) = B(a, b) = N! / (a! b!) * pO^a (1-pO)^b

This is a binomial distribution over pO. Notice that we’re only considering the two-way vote here, the 96% of the electorate that prefers either McCain or Obama.

To aid in the application of Bayes’ Rule, statisticians have developed the notion of a conjugate prior. The conjugate prior for the binomial distribution is the beta distribution. This means that, if our likelihood function is a binomial distribution, we can choose a beta distribution for our prior probability and get another beta distribution for the posterior probability.

In this case, it’s simplest to assume a uniform distribution for Obama’s portion of the vote. In other words, it’s equally probable that he’ll get 1% of the vote as it is that he’ll get 50% or 99% of it. Mathematically, if pO is the portion of voters who prefer Obama, then

pO ~ U(0, 1) = B(1, 1)

Bayes’ rule then gives the following distribution for pO after observing the poll:

pO’ ~ B(a + 1, b + 1) = B(pO * N + 1, pM * N + 1)

This is concentrated in a small region (note the x-axis) around 50 / (50 + 46) = 52.1%, Obama’s fraction of the two-way vote. The probability that Obama is ahead is the portion of mass to the right of pO’ = 50%:

This fraction is calculated numerically using an integral. It’s an important enough quantity to have a name, but not important enough to have a short, catchy name. It’s the regularized incomplete beta function,

P(Obama ahead) = I_0.5(b, a) = I_0.5(732 * 0.46, 732 * 0.50)

It can be calculated using a program like Mathematica or Octave, or by using an online calculator.

Another way of formulating this is to ask, “what is the fraction Δ by which a candidate must lead in a poll to have a 95% chance of really being ahead?” For a small sample, Δ will be large. For a large sample it will be small.

In a survey of N voters, a candidate with a lead of Δ can claim his chance of leading is:

P(leading) = I_0.5(N*(0.5-Δ), N*(0.5+Δ))

By inverting the regularized incomplete beta function, one can calculate what lead is necessary for 95% confidence. But that’s hard. Here’s a table to make things simpler:

N	MoE	Δ	Δ/MoE
100	10.0%	16.36%	1.6362
200	7.07%	11.60%	1.6402
500	4.47%	7.35%	1.6431
1000	3.16%	5.20%	1.6438
1500	2.58%	4.25%	1.6443
2000	2.24%	3.68%	1.6444
2500	2.00%	3.29%	1.6445
3000	1.83%	3.00%	1.6445

The ratio Δ/MoE; quickly approaches a constant, somewhere around 1.644. Hence the rule I mentioned at the beginning of the post. If a candidate is ahead by more than about 1.6 times the sampling error, that corresponds to 95% confidence. If the lead is equal to the sampling error, this corresponds to about 85% confidence. A lead of half the sampling error corresponds to about 70% confidence.

Living with probability 1-ln(2)

danvk — Tue, 16 Oct 2007 03:35:31 +0000

I learned a great brainteaser while hiking in Yosemite with many a Googler this weekend. It’s from a collection called “Seven Puzzles You Think You Must Not Have Heard Correctly” (PDF), and that title is particularly apt for this one.

Here’s the question:

The names of 100 prisoners are placed in 100 wooden boxes, one name to a box, and the boxes are lined up on a table in a room. One by one, the prisoners are led into the room; each may look in at most 50 boxes, but must leave the room exactly as he found it and is permitted no further communication with the others.

The prisoners have a chance to plot their strategy in advance, and they are going to need it, because unless every single prisoner finds his own name all will subsequently be executed.

Find a strategy for them which which has probability of success exceeding 30%.

Since each prisoner has a 50/50 shot at finding his own number, you’d expect their chances of survival to be (1/2)^100, which is tiny! But that’s not quite true. Here’s the reason it’s not impossible: the number of prisoners you expect to find their own number will always be 50 no matter what your strategy. But your strategy can skew the distribution. For example, if everyone looks at the same fifty lockers, you’ll always have exactly 50 prisoners find their own numbers. The problem is to find a strategy that skews the distribution towards the extremes: either it succeeds spectacularly or fails miserably.

There’s a solution in that link above, and I’ll post my own in a few days. Give it a shot!

How many Boggle boards are there?

danvk — Fri, 03 Aug 2007 03:45:32 +0000

I’ve taken a several months break from my Boggle series, mostly because I think everyone got tired of it. I’m going to come back to it, but hopefully at a slower pace this time around.

Last time, we completed a board solver that could score 10,000 boards per second. So what to do with all that power? One option was to compute statistics on random boards. Another is to hunt for the holy grail of Boggle: the highest possible scoring Boggle board. The next few posts will be devoted to the search for this board.

Before undertaking any search, you need to get a feel for your search space. In our case, that’s the set of all 4×4 Boggle boards. How many are there? We can do a few back-of-the-envelope calculations.

To create a board, you roll 16 dice. Each has six possible letters on it, which gives 6^16 possibilities. These dice land in some permutation on the board, which gives another factor of 16!. Finally, a 4×4 board has eight-fold symmetry, which takes us down to 6^16 * 16! / 8 = 7.3e24 ≈ 2 ^ 83.

That’s one upper bound. But it assumed that all 6*16 = 96 symbols on the dice were unique. Obviously they’re not. After a roll, each of the 16 squares will have one of 26 letters on it. Divide by the symmetries, and you get 26 ^ 16 / 8 = 5e21 ≈ 2^72. Much better!

I haven’t been able to come up with any better upper bounds than these. The main flaw in the second approximation is that not all boards can be rolled with the sixteen dice that Hasbro provides. A board of all z’s or qu’s simply can’t occur. If we knew the probability that any sixteen character sequence could be rolled, this would give a true approximation of the number of distinct boards.

The easiest way to do this is with a computer program. It picks a random sequence of sixteen characters, checks whether this board can be rolled, and repeats several thousand times. I believe that checking whether a given board can be rolled is NP-Complete, but in this case the greedy approximation works quite well. I wrote a program (leave a comment if you want to see it) to do this, and processed one million 16-character sequences. Only 84,492 could be represented with the usual dice, or 8.4%. This gives a total of

(26 ^ 16 / 8) * (84,492 / 1,000,000) = 4.6e20 ≈ 2^69.

If you like confidence intervals, my sample size of one million boards gives a 95% confidence interval of [4.545e24, 4.666e24] for the total number of boards. Pretty good.

So, assuming we could enumerate all these boards quickly, how long would it take for our faster solver to find the best board? At 10k boards/sec, we’re looking at 4.5e16 seconds = 1.9 billion years! Clearly we need to do better.

Dan fights Bad Math!

danvk — Mon, 16 Jul 2007 06:07:16 +0000

Inspired by the recent Laffer curve fuss, I’ll be doing my MarkCC imitation tonight, calling out the scourge of bad math wherever it rears its ugly head.

Case number 1: I’m on something of an Alex Haley kick, so I’ve been reading The Autobiography of Malcolm X. Here’s a passage that jumped out at me:

Here is an example: a British anthropologist named Dr. Louis S. B. Leakey is displaying some fossil bones—a foot, part of a hand, some jaws, and skull fragments. On the basis of these, Dr. Leakey has said it’s time to rewrite completely the history of man’s origin.
This species of man lived 1,818,036 years before Christ. And these bones were found in Tanganyika. In the Black Continent.¹

How good of Dr. Leakey to be so precise! And here I thought radio-carbon dating was only accurate to about one part in a hundred at best. What year was this written in again? 1964? Well, 1,818,036 + 1,964 = 1,820,000. Something makes me think Alex never learned about significant figures..

Case number 2: This one’s an exercise for the reader, since my tongue is tied. According to The Economist’s Technology Quarterly,

The average installed storage capacity of a Fortune 1000 company grew from 198 terabytes in early 2005 to 680 terabytes in October 2006, according to figures from Deloitte, a consultancy.²

What was different about the Fortune 1000 in early 2005 and October 2006? Why might this claim by incredibly misleading?