Skip to content

test: avoid test-cluster-master-kill flakiness

Checklist
  • tests and code linting passes
  • a test and/or benchmark is included
  • the commit message follows commit guidelines
Affected core subsystem(s)

test, cluster

Description of change

I've observed that test-cluster-master-kill fails intermittently on an AIX 6.1 machine (oslevel 6100-07-08-1339) due to timeout before worker termination. There was a previous PR (https://github.com/nodejs/node-v0.x-archive/pull/9431) which arbitrarily increased the timeout for AIX to 1 second, however this value is still a guess and appears to be insufficient. It's also worth noting the arbitrary timeouts have also caused problems for other platforms, see https://github.com/nodejs/node/pull/2891#issuecomment-179158473. Arbitrary timeouts cannot compensate for external factors such as system load.

In this PR I propose removing the timeout mechanism entirely, here is how the test currently works:

  1. Fork a master process from the test harness.
  2. The master uses cluster.fork to create a worker.
  3. The worker starts an HTTP server.
  4. The master will kill itself with process.exit.
  5. The worker will then also call process.exit as part of its IPC pipe disconnect handler. (*)
  6. The test harness will wait an arbitrary amount of time after the master's exit to check for the child's liveness, if it is still alive the test fails.

(*) Without this mechanism the worker would become an orphan child of init.

Step 6 is inherently flaky. The test was originally added as part of https://github.com/nodejs/node-v0.x-archive/commit/94d337eb0fbf78bda2af6938b4a65d3c94808caf#diff-0faa53fc02580d5de2ebb484c41d691cR498 where it specifically tested the new disconnect pathway.

The test boils down to the following actions:

  1. Cause the worker's disconnect handler to be run by calling exit(0) in the cluster master.
  2. Insure that the disconnect handler also exits the worker process.

Since step 2 does not actually require step 1, I propose the following alternate flow:

  1. Fork a master process from the test harness.
  2. The master uses cluster.fork to create a worker.
  3. The worker starts an HTTP server.
  4. The master kills the IPC pipe (note this not the same as using the graceful cluster shutdown API in https://github.com/nodejs/node-v0.x-archive/pull/2740/commits/6c383c61611fb78d0961a332c1ad55d1d5514ba6) which causes the child to disconnect itself. Ungracefully closing the IPC pipe causes the child to execute the same disconnect handler it executes when the master calls exit(0).
  5. The master waits for the child's exit event.
  6. The test harness checks for the child's liveness after the master's exit event.

With this setup there is no need for the arbitrary wait time. The obvious problem is if the child never exits the test will hang - however in that case the test will still be killed by the test harness's global timeout.

I do believe a timeout mechanism is useful for detecting liveness issues, but the presence of arbitrary timeouts in the tests themselves should be minimized, the single timeout in the test harness suffices.

Any comments are appreciated, especially from @AndreasMadsen.

Merge request reports

Loading