01.31.07

Reading a dictionary into a hash in Ruby

Posted in programming at 12:15 am by danvk

My experiment with code golf the other day forced me to think about ways to load a dictionary into a hash using only one line of code. I couldn’t think of a way then, but today I came up with this:

h = Hash[*(IO.readlines(Dict).map{|w|[w.chomp,1]}.flatten)]

Slow as hell, but it works! With little searching, I found another method, which is much better:

h = IO.readlines(Dict).inject({}) {|h,w| h[w.chop!]=1; h }

The “;” in there pushes the meaning of “one line”, but change “1; h” to “1 and h” and it’s less ambiguous. That got me thinking about all the different ways to load a dictionary into a hash. Which one is the fastest? Here’s all the ones I could think of, sorted from fastest to slowest. They were tested with Dict="/usr/share/dict/words":

Time Command
0.443s h = {}; open(Dict).read.split.each { |w| h[w]=1 }
0.448s h = {}; IO.read(Dict).split.each { |w| h[w]=1 }
0.555s h = {}; IO.readlines(Dict).map { |w| h[w.chop!] = 1 }
0.568s h = {}; IO.readlines(Dict).map { |w| h[w.chomp!] = 1 }
0.882s h = {}; IO.readlines(Dict).map { |w| h[w.chop] = 1 }
0.897s h = {}; IO.readlines(Dict).map { |w| h[w.chomp] = 1 }
0.958s h = {}; IO.foreach(Dict) { |w| h[w.chop!] = 1 }
1.064s h = open(Dict).read.split.inject({}) {|h,w| h[w.chop!]=1; h }
1.065s h = IO.read(Dict).split.inject({}) {|h,w| h[w.chop!]=1; h }
1.306s h = {}; IO.foreach(Dict) { |w| h[w.chomp] = 1 }
1.475s h = {}; open(Dict).read.scan(/\w+/) { |w| h[w]=1 }
1.479s h = {}; IO.read(Dict).scan(/\w+/) { |w| h[w]=1 }
1.172s h = IO.readlines(Dict).inject({}) {|h,w| h[w.chomp]=1; h }
26.022s h = Hash[*(IO.readlines(Dict).map{|w|[w.chomp,1]}.flatten)]

There are some clear trends here. First and foremost, it pays to eschew line-oriented processing. The read method slurps in the whole string and then split segments it into words. I suspect that split with no parameter (=split on whitespace) is so common that it’s special-cased in the Ruby interpreter. A little digging into the Ruby C source confirms that guess. The relevant stuff is the awk_split stuff starting at line 3514. Just for the record, the Ruby C source is some of the cleanest I’ve ever seen.

Another clear trend: using the self-modifying chomp! and chop! is much faster than chomp and chop. This makes a lot of sense, since Ruby doesn’t need to create bunches of new strings.

So in conclusion, the fastest technique is:

h = {}; open(Dict).read.split.each { |w| h[w]=1 }

and the fastest one-liner is:

h = open(Dict).read.split.inject({}) {|h,w| h[w.chop!]=1; h }

I suspect that a solution using scan could win if Ruby had a special case for whitespace in that routine like it does in split.

Comments are closed.