Regression Testing Performance?

One of the biggest things you worry about when you develop software is that a change in one area of the codebase will have ripple effects and break something else.  This is why it’s smart to put aggressive regression testing in place, preferably automated regression testing done in conjunction with a continuous integration process.  This way, a pile of tests gets run automatically every time you commit source code changes to the repository.

At Flex, we use Subversion for source control and a continuous integration tool called Bamboo.  Every five minutes, Bamboo checks Subversion for new commits and triggers a full rebuild and retest of the codebase.  Here’s a screenshot from Bamboo for clarification:

As you can see, we have the code broken into various modules and each module has a certain number of tests.  We also have the Functional Test Suite, which adds additional tests for pricing, availability math, etc.  We don’t do a release of Flex until this screen is clear of pending builds and 100% green.

We don’t catch every ripple effect, or regression as we call them, but this tool prevents us from missing more regressions than we do.

But lately we’ve noticed a new kind of regression, one that isn’t strictly functional.  The code still works as it always did, but somewhere along the line a change that seems minor to us causes a ripple effect in terms of performance.  Today we discovered that the calculation logic for total prep time was getting invoked for every bar code scan and slowing down scan response times to unacceptable levels in certain circumstances.  This of course was a ripple effect of tweaking the total prep time value so it could be retrieved in a batch HQL query, meaning that every time an equipment list was saved to the database, the total prep time was recalculated.  Doing this for every bar code scan introduced slowness in the form of an N+1 select issue (where N is the number of line items on the underlying pull sheet).

Once we identified the bottleneck, is was an easy fix.  We pulled the totalPrepTime out of Hibernate and this reduced the test scan in our case from 5.7 seconds to 580 milliseconds.  We also found another performance slow down in scans when containers are configured to automatically process their contents.  We call these auto scans and the auto scans were needlessly performing orphan scan checks (which include an availability check), so for containers with lots of contents, this could result in a major slow down.  We’ve added a new flag to skip orphan detection when processing a container child.  This sped things right up.

This got Chris and I thinking about how we might do a better job of catching performance regressions in the same way we catch functional regressions now.  If we’d established a benchmark for scan performance in our test bed, we could always trigger a test failure if a test run exceeds the benchmark by some predetermined value, perhaps by one standard deviation.

It’ll be tricky to establish initial benchmarks and I think I favor a statistical technique that “learns” what the typical performance characteristics are, maybe with some provisions for adding hard limits for certain test scenarios.  Otherwise it’s just trial and error and you’d end up chasing a lot of false negatives.

Adding statistical performance analysis to our test suite sounds like a lot of fun for me and a good way to prevent accidentally releasing new performance bugs in the future.  When the FastTrack schedule gets back under control, maybe I’ll twist Chris’s arm to let me take a crack at it.

Leave a Comment