Failing Quickly When Testing For Performance 2

Posted by ryan Fri, 24 Nov 2006 05:46:00 GMT

I was working with an algorithm today that I discovered had a bug that caused it to run for an unacceptable amount of time, hogging a lot of system resources in the process. Whenever I find a bug in a piece of code I'm working on, I write a failing unit test for it that defines the correct behavior. For this algorithm, I needed to define what an "acceptable amount of time" was in the test, and then test for that level of performance so that the test results were consistent across multiple computers with possibly differing resource loads and load fluctuations. I also needed to ensure that the test would fail as quickly as possible in the event that the algorithm did not perform as desired.

The method containing the algorithm takes a string parameter such as "1-4, 23, 50-52", specified as user input and representing a range of numbers. It then generates an array of numbers; for the string previously mentioned, the array would contain the numbers 1, 2, 3, 4, 23, 50, 51, and 52. The method also takes an optional parameter for the maximum amount of numbers that would be acceptable for it to generate, since generating an array containing all numbers for a range string like "1-9999999999999" would send the generating system into epileptic fits, complete with bus lines frothing. As you may have guessed, this was where the problem was: The method in question generated all of the numbers in the specified range string, and then it checked to see if the amount of numbers generated exceeded the specified maximum.

I needed to define an acceptable response time for a given maximum size of the generated array of numbers for my test. It seems to me that it should take the same amount of time for the algorithm to complete with a range for 10 numbers with a maximum resulting array size of 5 as it does with a range for 10 million, billion, or squigillion numbers with the same result size. Basically, when the algorithm determines that the given range will exceed the maximum, it should end. The challenge here is that different computers will have different timings to reach the maximum, so a reasonably-accurate system-specific timing expectation needed to be calculated.

For this purpose, I wrote a method that determines the range of acceptable response times for the algorithm, given a desired number count, maximum result size, and the number of sample timings to make, since timings will differ slightly from one invocation to another.

def acceptable_timing(number_count, result_size_limit, sample_count=10)
  timings = []
  
  sample_count.times do 
    generator = NumberGenerator.new("1-#{number_count}", result_size_limit)
    start_time = DateTime.now
    generator.numbers
    end_time = DateTime.now
    timings << end_time - start_time
  end
  
  0.0..average(timings) + standard_deviation(timings)
end

The next challenge was testing the numbers method with a range string that represents a large set of numbers, but using the same result_size_limit that was used in the call to acceptable_timing. I decided that a range of 9999999 numbers was sufficiently large to determine that the timing was acceptable; after all, it should take the same amount of time with the same result size limit as if I were to use 100 numbers, right? However, the problem with using a set of 9999999 numbers is that, with the bug, the test will hang for an extremely long time and hog a lot of system resources. We want our tests to fail as fast as possible, and give a useful error message if and when that failure occurs.

To ensure that the test fails fast, I decided to launch a separate thread to call the method under test so that I can stop it as soon as it's determined that it's taking longer than the acceptable amount of time to return.

def completes_within?(threshold, &block)
  start_time = DateTime.now
  thread = Thread.new &block
  while true
    if !threshold.include?(DateTime.now - start_time)
      thread.kill
      return false
    end
    return true if thread.stop?
  end
ensure
  thread.join
end
And finally, the test:
def test_numbers_fails_fast_when_result_size_limit_exceeded
  range_size = 9999999
  result_size_limit = 5
  generator = NumberGenerator.new("1-#{range_size}", result_size_limit)

  acceptable_amount_of_time = acceptable_timing(100, result_size_limit)

  assert_equal true, \
    completes_within?(acceptable_amount_of_time) { generator.numbers }, \
    "Exceeded acceptable time to determine that range of #{range_size} " + \
    "numbers exceeds limit of #{result_size_limit}"
end

I considered using a range size smaller than 9999999 to avoid the threading and make the solution simpler. My reasoning for not doing that is, if I were to pick a smaller number, it would still have to be sufficiently larger than the range size I used to determine the acceptable amount of time for the method under test to return. The larger range size gives me confidence that a failed timing is not just because of a resource spike on the computer running the test, at least if the test is supposed to fail. If I have to pick a large number anyways, it's going to take the test longer to fail, thus violating the idea of fail-fast testing. Therefore, I might as well just abort the method as soon as I know it's going to take too long.

To further improve the reliability of this test, the completes_within? method could be called multiple times and, if a success is ever achieved, the test passes. However, this would make the test run longer, so the choice of whether to use it or not should depend on the variation in resource load that is expected amongst the computers that will be running the tests. If the tests are running on a dedicated machine, this technique probably wouldn't be needed.

In order to gain 100% confidence that there will be no false negatives in the test results, the structure of the code could be modified so that it can be determined whether the algorithm is considering the result limit while it generates the numbers, or afterwards, as in the case of the buggy version of the algorithm. The tradeoff here is that a certain amount of the algorithm logic must be externalized so that the necessary assertions can be set up in the test. This makes the algorithm itself less adaptable to change, as some changes could make the test fail inappropriately, since not only would the results be getting tested, but also the way in which the algorithm works.