12.26.09

Crossword Word Frequency

Posted in math, programming at 10:45 am by danvk

In a previous post, I discussed downloading several years’ worth of New York Times Crosswords and categorizing them by day of week. Now, some analysis!

Here were the most common words over the last 12 years, along with the percentage of puzzles in which they occurred:

Percentage Word Length
6.218% ERA 3
5.703% AREA 4
5.413% ERE 3
5.055% ELI 3
4.854% ONE 3
4.585% ALE 3
4.496% ORE 3
4.361% ERIE 4
4.339% ALOE 4
4.317% ETA 3
4.317% ALI 3
4.227% OLE 3
4.205% ARE 3
4.138% ESS 3
4.138% EDEN 4
4.138% ATE 3
4.048% IRE 3
4.048% ARIA 4
4.004% ANTE 4
3.936% ESE 3
3.936% ENE 3
3.914% ADO 3
3.869% ELSE 4
3.825% NEE 3
3.758% ACE 3

(you can click column headings to sort.)

So “ERA” appears, on average, in about 23 puzzles per year. How about if we break this down by day of week? Follow me past the fold…


Monday:

Percentage Word Length
9.404% ALOE 4
8.777% AREA 4
7.837% ERIE 4
6.426% ONE 3
6.426% IDEA 4
6.426% ARIA 4
6.270% ONCE 4
6.270% EDEN 4
6.113% ERA 3
6.113% ELSE 4
6.113% ASEA 4
5.799% ERE 3
5.643% ORE 3
5.643% ETAL 4
5.643% ARE 3
5.643% ANTE 4
5.486% OREO 4
5.486% ALEE 4
5.329% TREE 4
5.329% ESS 3
5.329% ELI 3
5.329% ACRE 4
5.172% TSAR 4
5.172% ANTI 4
5.016% ORAL 4

The four letter words are more common now. Also look how much higher the percentages are. There’s less variety in the fill of Monday puzzles. “ALOE” and “ARIA” are classic crossword words, not to mention “OREO”.

Saturday:

Percentage Word Length
3.286% ERA 3
2.973% ONE 3
2.973% ETE 3
2.817% TEN 3
2.817% EVE 3
2.817% ETA 3
2.660% IRE 3
2.660% ERR 3
2.660% ERE 3
2.504% OTIS 4
2.504% OLE 3
2.504% ENE 3
2.504% ELL 3
2.504% ELI 3
2.504% ARE 3
2.504% ARA 3
2.504% ALA 3
2.504% ACE 3
2.347% RTE 3
2.347% ICE 3
2.347% ATE 3
2.347% ALE 3
2.191% TSE 3
2.191% TERSE 5
2.191% SRI 3

Lots of three letter words and much lower percentages. “OTIS” is surprising to me, but I don’t do many Saturday puzzles, so who am I to say?

It would be really interesting to combine this with some document frequency numbers for the English language. This would find words which are much more common in crosswords than they are in general, i.e. crosswordese.

I’d include everything necessary to reproduce this here, but the puzzles are not free. See this directory for the program I used to tabulate the statistics and complete word counts, both overall and for each day of the week. The first puzzle in my collection was 2006-10-23 and the last was 2009-01-19.

2 Comments

  1. Mom said,

    December 26, 2009 at 11:24 am

    I enjoyed reading this Dan. Hope Rex Parker gets ahold of it!

  2. Pam D'Angelo (Ben's mom) said,

    January 7, 2010 at 1:41 pm

    What fun to see the numbers behind what I do every day. Thanks for an enjoyable, enlightening post.