More than one way to get it wrong…
One of the features that Ruby is known for is the multitude of different ways that you can accomplish the same thing. This is in direct contrast with Python’s “there should be one, and preferably only one way to do it” mantra, and gives Ruby perhaps a hint more of an aesthetic quality than other languages. Want to replace all of the values in an Array using a function? Well then, you should use map…or was that collect? Why not pick the shorter one? Or the one that reads better? Or the one that rhymes?
Ruby’s multitude of ways don’t always boil down to using different names for the same method. Often times you can achieve the same end result in Ruby using wildly different approaches. Sometimes these approaches will be mostly equivalent in terms of performance and memory use. Most of the time, they won’t be. It is also not unreasonable to expect that the approach that runs the fastest in one situation might not be the fastest in another. For this reason, it is doubly important for Rubyists to become comfortable with the concept and the practice of benchmarking.
Getting to Know Your Tools
Luckily, Ruby’s standard library ships with just the thing you need in order to benchmark your code. Appropriately enough, this library is called Benchmark. Since it is part of the standard library, and not the core, you will still have to require it before you can use it. Here’s a simple example of timing how long Ruby takes to add all the numbers from 1 to 1,000,000:
require 'benchmark'
result = Benchmark.measure do
sum = 0
1.upto(1_000_000) { |i| sum += i }
end
puts result
Running this script gives us:
> ruby count.rb 0.090000 0.000000 0.090000 ( 0.094249 )
Great! So, what do those numbers mean?
The Benchmark library uses a convention of reporting 4 different timing numbers (always reported in seconds): user, system, total, and real. The user time is the time that your Ruby program spends in userland, which is to say time spent not doing kernel level stuff. System time, then, is the time that your Ruby program spends in the kernel doing things such as switching threads, acquiring locks, and any input/output coordination. Total is just the sum of these two numbers.
Real time is a bit interesting and definitely something we should pay attention to when benchmarking. It is sometimes called “wall clock” time and represents the actual time that it took the program to run start to finish, whereas the “total” time represents only the amount of time that the code spent processing (time “on the chip”). These two numbers can differ for a couple of reasons.
Consider that you’re computer is probably running quite a few other programs in addition to the benchmark. In the course of running the Ruby program, your machine might become distracted and decide to run another process. The time spent running any other process does not count as running time for your Ruby program, but it does add to the real time it takes that program to run. So, can this explain the 0.004249 seconds difference we see here?
One way to get a better idea of how much context switching affects our run time is to use the -l option of the time command. Many shells contain a built-in version of time that is not as capable as /usr/bin/time, so we must be sure to use the full path. Here, then, is what it will tell us:
> /usr/bin/time -l ruby count.rb
0.090000 0.000000 0.090000 ( 0.097167 )
0.18 real 0.14 user 0.03 sys
4505600 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
10280 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
17 signals received
44 voluntary context switches
23 involuntary context switches
Wow! Our machine switched away from Ruby a total of 67 times while running this tiny little benchmark. Really, if you consider all that your computer is doing at any given time, 67 context switches doesn’t seem like a whole lot, but it likely explains at least some of the discrepancy between the real and total time measurements.
This leaves only one piece of our benchmark output slightly mysterious: system time. So far, every time we’ve run the benchmark the system time has been 0 s. Really, though, this shouldn’t be all that surprising. Up to this point our benchmark has just been doing simple math. The kernel doesn’t need to get involved until we start doing something more complicated, like allocating memory.
Consider this variation on our benchmark:
require 'benchmark'
result = Benchmark.measure do
str = ''
1.upto(100_000) { str += 'a' }
end
puts result
Running this, we now get:
> ruby count.rb 2.480000 1.580000 4.060000 ( 4.066068 )
Ah! Now we have a system time to report. This is because the String#+= function creates a new object each time (which, consequently, is also why you should avoid using it for appending strings to strings if possible). When the Ruby interpreter starts, it requests a chunk of memory from the operating system that it will then use to create the initial batch of objects it needs as it begins running a program.
Here, though, our benchmark program has exhausted this initial allocation, and the Ruby interpreter needs to go back to the operating system and ask for more memory. Since memory allocation requires the participation of the kernel, this work is now reflected in the system time reported. (For bonus credit, try re-running this benchmark with different values for the #upto call and watch what happens to the values reported.)
The Global VM Lock and You
One other reason that real and total times might differ is if your program is utilizing more than one processor or core. Again, total time is “time on chip” so if you have 4 chips, then it is conceivable that every second of real time could correspond to 4 seconds of total time. In practice, you’ll almost never get such complete parallelism, but paying attention to the difference between real and total time in a multi-threaded application can give you a concrete idea of just how well your program is parallelizing its work. We can alter our little benchmark to utilize threads to see this effect in practice:
require 'benchmark'
result = Benchmark.measure do
sum = 0
(1..4).map {
Thread.new do
1.upto(1_000_000) { |i| sum += i }
end
}.each(&:join)
end
puts result
Here I’m (ab)using Ruby’s Array#map function to create 4 threads, each of which will add up the numbers from 1 to 1,000,000, and then wait for them all to finish. If we run this now, we’ll get:
> ruby count.rb 0.410000 0.000000 0.410000 ( 0.401647 )
The machine I ran this benchmark on has 4 cores. In theory if the total time required for this task is 0.41 s and Ruby can achieve perfect parallelism, then the real time could be as little as 0.1025 s. Instead, we see that the real time is just about 0.402 s. That is less than the total time, indicating that we are achieving at least some parallelism, but nowhere near our theoretical best-case scenario.
In order to understand what’s happening here, let’s try another slightly different experiment. This time, instead of adding up numbers, we’ll write to and read from some files…with a little twist. Instead of using the blocking IO methods in Ruby that one would traditionally use, we’ll let Ruby do non-blocking IO.
require 'benchmark'
result = Benchmark.measure do
(1..4).map { |i|
if i.odd?
Thread.new do
f = File.open("tmp#{i}.txt", 'w')
1.upto(100_000) do
f.write_nonblock('a' * 100) rescue nil
end
end
else
Thread.new do
f = File.open("tmp#{i - 1}.txt", 'r')
1.upto(100_000) do
f.read_nonblock(100) rescue nil
end
end
end
}.each(&:join)
end
puts result
Non-blocking IO is tricky to get right. There are a number of contingencies and exceptional cases that you need to handle, but as this is just for illustrative purposes, we’re simply silencing exceptions with a rescue nil statement. Don’t do this in production code!
Here, then, is what running this benchmark gives us:
> ruby count.rb 0.790000 1.550000 2.340000 ( 1.850699)
Excellent! We still don’t have perfect parallelism, but it’s clear that the real time (1.85 s) is significantly less than the total time (2.34 s). So why does non-blocking IO do so much better parallelizing than adding numbers?
You may have heard Rubyists talk about a Global Interpreter Lock (GIL) or, by the name that Matz gives it, a Global VM Lock (GVL). The effect of the GVL is that even if you have multiple Ruby threads running at the same time, only one line of Ruby code can be executing at a given time across all running threads. In the benchmark where we have Ruby adding numbers, the time it takes to perform the actual addition is tiny compared to the time it takes the Ruby runtime to interpret and execute the addition expression. Since that interpretation step is subject to the GVL, we won’t get much parallelism at all.
There are a few ways to get around the limitation of the GVL in Ruby, but by far the most common is to use non-blocking IO. When Ruby encounters a non-blocking IO operation, it will release the GVL while the operation is executing, allowing other threads to execute other lines of Ruby code. This explains why the second of our “parallel” benchmarks is able to achieve far more real parallelism than the one performing addition.
Threading is Never Easy
The other way to get around the GVL is to simply use a Ruby implementation that does not have a GVL, such as JRuby. If we revisit our threaded summation benchmark from before, but this time run it with JRuby, we can begin to really appreciate the difference a GVL can make. In fact, the multithreaded performance of JRuby 1.7.0 was so outstanding that I had to increase the number of inner-loop iterations to 10,000,000 just to get benchmark numbers that made sense. Here is what running that benchmark looks like:
> jruby count.rb 9.320000 0.130000 9.450000 ( 2.374000)
WOW! The total time for 4 threads adding up all the numbers from 1 to 10,000,000 was 9.45 s, meaning that perfect 4-way parallelism would give a real time of 2.36 s, and here JRuby gives us a real time of 2.37 s! (Running the original benchmark that only added the numbers from 1 to 1,000,000 actually gave results that indicated better than perfect parallelism. I can only assume such results indicate either subtle timing inaccuracies for such a quick benchmark, or that JRuby’s optimizations are skewing the true total time.)
Such ideal parallelism seems almost too good to be true. That’s because it is, in a sense. Let’s add a tad more reporting to our summation benchmark. This time, we’re going to output not only the timing results, but also the final sum:
require 'benchmark'
sum = 0
result = Benchmark.measure do
(1..4).map { |i|
Thread.new do
1.upto(10_000_000) { |i| sum += i }
end
}.map(&:join)
end
puts result
puts "Total sum is: #{sum}"
Now, running under both Ruby and JRuby reveals a problem:
> ruby count.rb 3.810000 0.000000 3.810000 ( 3.818641) Total sum is: 200000020000000 > jruby count.rb 9.460000 0.100000 9.560000 ( 2.324000) Total sum is: 71815504623895
Huh, we seem to be missing quite a bit of the sum using JRuby. As it turns out, the GVL can be a pain, but it can also be a major boon. Since the GVL will lock around each line of Ruby code that executes, we get a “free” threadsafe variable with sum. Under JRuby, where there is no GVL to protect us, there is nothing preventing from multiple threads assigning to sum simultaneously. Since each subsequent value calculated for sum depends on the previous value of sum, having more than one thread write to sum at once will generate erroneous results.
As the saying goes: with great power comes great responsibility. If we wanted to write a truly equivalent summation benchmark in JRuby, we would have to add our own locking around the assignments to sum:
require 'benchmark'
sum = 0
sum_mtx = Mutex.new
result = Benchmark.measure do
(1..4).map { |i|
Thread.new do
1.upto(1_000_000) { |i| sum_mtx.synchronize { sum += i } }
end
}.map(&:join)
end
puts result
puts "Total sum is: #{sum}"
Note that, in order to get reasonable benchmark times, I had to take the number of inner loop iterations back down to 1,000,000. Running this modified benchmark yields:
> jruby count.rb 12.830000 46.770000 59.600000 ( 48.168605) Total sum is: 2000002000000
Ouch! It turns out that making up for the missing GVL by performing our own locking is rather expensive. It also, unsurprisingly, eliminated most (but not all) of the parallelism benefit that we had before. Of course, this only becomes a problem if you have to share state between your threads. We could just as easily have created a separate sum variable for each thread and summed across each thread’s sum at the end, avoiding the need for a lock.
Without getting too far into a discussion of the cost/benefit of having a GVL, it should be clear that raw benchmark results are just one part of the story in finding the best or fastest way of doing something. As with any metric, you must always consider the results of a benchmark in the larger context of the problem you are attempting to solve. Still, benchmarking is an important skill for any Ruby programmer, and it can help you to make decisions based on evidence, instead of whether or not the methods you call all end with the letter “t”.
This is part 1 of 3 in a series of “Prologue” blog posts leading up to a talk I will be giving at RubyConf 2012 in Denver titled “There and Back Again -or- How I Set Out to Benchmark an Algorithm and Ended Up Fixing Ruby”. Check back for the next installment where we will explore the Benchmark library in a bit more depth.

manhattanmetric
Just pushed the first of 3 “Prelude” posts leading up to my RubyConf talk: http://t.co/QrGVXBtS