Skip to content

Refactor benchmark tools for statistical significance

Checklist
  • tests and code linting passes
  • a test and/or benchmark is included
  • documentation is changed or added
  • the commit message follows commit guidelines
Affected core subsystem(s)

benchmark

Description of change

I have been rather confused about the benchmark suite and I don't think it is as user friendly as the rest of nodecore. This PR attempt to remove most of the confusion I was facing when I started using it. Primarily it:

  • removes unused/undocumented files
  • allows partially setting the benchmarks variables using process arguments.
  • refactor compare.js such comparing node versions and getting statistical significance is easy.
  • refactor the plot.R tool (now called scatter) to show a scatter plot with confidence bars.
  • refactor cli tools such the cli API is more homogeneous.
  • documents all the tools.
  • removes the implicit process.exit(0) after bench.end().
  • uses process.send to avoid most parsing (the benchmark cli arguments haven't changed).

The specifics are documented in the commit messages. Please also see the the new README as quite a lot have changed (be sure the to check my spelling!).

Note that some benchmark takes a very long time to complete, e.g. timers/timers.js type=depth thousands=500 takes 11.25 min. Thus running it 30 times for statistical significance is unreasonable. I suspect the only reason why it is set to so many iterations is to get a small variance, but with the the new compare tool the variance can be estimated instead of being reduced. Thus we can reduce the number of iterations and still get the information we need. But I suggest we do that in another pull request, as is very different discussion.

Motivation (long story): I wanted to benchmark the effect of some async_wrap changes. I went to the benchmark/ directory and read the README. However I quickly discovered that it was primarily about running benchmarks a single time and how to write benchmarks. And most importantly it didn't explain how to compare two node versions. This is now documented in the new README.

I then had to search for the tools myself and discovered the large amount of benchmarks files which where not put into categorized directories. I assumed they where somehow extra significant, but in reality they just appear to be unused. These files are now removed.

After discovering the compare tool, which has the cli API

node benchmark/compare.js
            <node-binary1> <node-binary2> +
            [--html] [--red|-r] [--green|-g] +
            [-- <type> [testFilter]]

I was confused about what the --red, --green was and how the node-binary1 and node-binary2 compared, should I write ./node-old ./node-new or ./node-new ./node-old if I wanted a positive improvement factor to signify an improvement? The new compare API is:

usage: ./node benchmark/compare.js <type> ...
  --new    ./new-node-binary  new node binary (required)
  --old    ./old-node-binary  old node binary (required)
  --runs   30                 number of samples
  --filter pattern            string to filter benchmark scripts
  --var    variable=value     set benchmark variable (can be repeated)

After understanding common.js this it was still unclear if the performance was statistically significant different. I tried running the benchmark 5 times and got that 4/5 was an improvement, I was expecting it to have the same performance or be slower. (spoiler: it wasn't significant). The compare.js script now runs the benchmarks many times (30 by default) and there is an R script to analyse the csv results.

At this point I wanted to do a rewrite of the benchmark tools (not the benchmarks themself) and changed a few other things in the process as well. - I'm a mathematician so I care a lot about statistical significance :)

Merge request reports

Loading