01.31.07

Reading a dictionary into a hash in Ruby

Posted in programming at 12:15 am by danvk

My experiment with code golf the other day forced me to think about ways to load a dictionary into a hash using only one line of code. I couldn’t think of a way then, but today I came up with this:

h = Hash[*(IO.readlines(Dict).map{|w|[w.chomp,1]}.flatten)]

Slow as hell, but it works! With little searching, I found another method, which is much better:

h = IO.readlines(Dict).inject({}) {|h,w| h[w.chop!]=1; h }

The “;” in there pushes the meaning of “one line”, but change “1; h” to “1 and h” and it’s less ambiguous. That got me thinking about all the different ways to load a dictionary into a hash. Which one is the fastest? Here’s all the ones I could think of, sorted from fastest to slowest. They were tested with Dict="/usr/share/dict/words":

Time	Command
0.443s	`h = {}; open(Dict).read.split.each { \|w\| h[w]=1 }`
0.448s	`h = {}; IO.read(Dict).split.each { \|w\| h[w]=1 }`
0.555s	`h = {}; IO.readlines(Dict).map { \|w\| h[w.chop!] = 1 }`
0.568s	`h = {}; IO.readlines(Dict).map { \|w\| h[w.chomp!] = 1 }`
0.882s	`h = {}; IO.readlines(Dict).map { \|w\| h[w.chop] = 1 }`
0.897s	`h = {}; IO.readlines(Dict).map { \|w\| h[w.chomp] = 1 }`
0.958s	`h = {}; IO.foreach(Dict) { \|w\| h[w.chop!] = 1 }`
1.064s	`h = open(Dict).read.split.inject({}) {\|h,w\| h[w.chop!]=1; h }`
1.065s	`h = IO.read(Dict).split.inject({}) {\|h,w\| h[w.chop!]=1; h }`
1.306s	`h = {}; IO.foreach(Dict) { \|w\| h[w.chomp] = 1 }`
1.475s	`h = {}; open(Dict).read.scan(/\w+/) { \|w\| h[w]=1 }`
1.479s	`h = {}; IO.read(Dict).scan(/\w+/) { \|w\| h[w]=1 }`
1.172s	`h = IO.readlines(Dict).inject({}) {\|h,w\| h[w.chomp]=1; h }`
26.022s	`h = Hash[*(IO.readlines(Dict).map{\|w\|[w.chomp,1]}.flatten)]`

There are some clear trends here. First and foremost, it pays to eschew line-oriented processing. The read method slurps in the whole string and then split segments it into words. I suspect that split with no parameter (=split on whitespace) is so common that it’s special-cased in the Ruby interpreter. A little digging into the Ruby C source confirms that guess. The relevant stuff is the awk_split stuff starting at line 3514. Just for the record, the Ruby C source is some of the cleanest I’ve ever seen.

Another clear trend: using the self-modifying chomp! and chop! is much faster than chomp and chop. This makes a lot of sense, since Ruby doesn’t need to create bunches of new strings.

So in conclusion, the fastest technique is:

h = {}; open(Dict).read.split.each { |w| h[w]=1 }

and the fastest one-liner is:

h = open(Dict).read.split.inject({}) {|h,w| h[w.chop!]=1; h }

I suspect that a solution using scan could win if Ruby had a special case for whitespace in that routine like it does in split.

Permalink

Comments are closed.

danvk.org

01.31.07

Reading a dictionary into a hash in Ruby

Archives

Categories

Meta