Big Tester Is Watching You - The Other Use For Production Data - Duncan Nisbet

We all know that we can use data generated from real customers to help us to create user journeys, scenarios & personas etc. but are you aware of the benefits of visualising some key aspects of that Production data for the Development team to see?

This is a story how of how I got my love of big data & the wealth of information it can provide to the Development team. My experience of monitoring & visualising Production data to get an idea of customer experience is very limited & limited to web applications at that. I’m hoping it will shape my ideas for future challenges…

I was skeptical at first - An XP Developer (whose opinion I value) fresh from Velocity Conf saw a presentation from Etsy about metrics driven engineering & was trying to convince me that we didn’t need to run as many tests in the Production environment to prove releases. Eventually my trust grew as I started to understand the metrics & what kind of stats I could produce myself with the Production data.

I liken it to the difference between white & black box testing: Previously the releases would go out into the production environment, us Testers would then retest the changes that went out in that release. Once tested we’d forget about the changes & move onto the next release. We had no real idea what experience the customer was having similar to having no idea what the code is doing in black box testing.

With the move to more frequent deployments, this time taken on re-testing the changes in Production wasn’t feasible as the other work was piling up behind us. This manual “Testing In Production” was causing the work in the SDLC to clump & batch, with a lot of “throwing over the wall” going on. Lo & behold Testers were becoming the bottleneck.

Once we started getting used to the metrics & graphs & what they stood for our confidence in them started to grow & we gradually moved away from re-testing the changes in Production. Eventually, we had a suite of simple automated tests which were executed from the cloud (outside our internal network) to check & us quickly eyeballing the site to ensure it wasn’t completely goosed.

The graphs on display were kept simple so they could be observed at a glance from a distance. They showed traffic (no. of HTTP requests), site performance (page load & response times), errors (4xx & 5 xx series) & 3rd party integration points (both internal & external).

Seeing how & when customers used our site gave us a much richer & more varied idea of the state of our web application than a Tester performing a single user journey.

During a release, we could see traffic drop off the boxes as they were taken out of the pool & the new code deployed to them.

When these boxes were brought back up & traffic pointed at them, we could immediately see if there were any problems with the new release - these problems would manifest themselves as (for example) spiking response times, 500 or 404 errors & 3rd party integration timeouts / circuit breakers.

Having this information allowed us to have the conversation about whether we rollback the change, forward fix or leave it alone.

We would dive deeper into the graphs & logs on the servers to triage the problem & provide the Ops team with an informed response.

We’d let them know if there was a problem with our release, not the other way round.

The primary effect was that the number of successful releases went up & the number of rollbacks went down. The Ops team obviously loved this. This helped us to earn their respect & trust.

A secondary effect was more cultural - the team took more pride in their work & invested themselves more in the stability & performance of our product out in the wild.

Questions would be asked during times when there was no release, or when people walked into the office, when the graphs were not following their usual pattern: “Anybody know we’re seeing more 404s than yesterday?” or “That 3rd party provider has tripped again. Is anyone on the case?”

I’ve gone on too long already, but to sum up some of the benefits I’ve found with monitoring Production data include:

Knowing what experience the customer is having right now compared to previous times in history;
Confidence that the Production environment is stable /enough/ for a release;
Monitoring the Production environment during & after a release;
Providing information about problems in the Production environment;
Enabling the Development team to provide feedback to the Ops team on the state of the release;
Freeing up Testers time so they can add more value in their testing where they can make more of a difference.

My follow up post to this one is going to be about my current challenge of visualising the Production data in my new company & how the Development team can use that data.

Some useful links for further reading:

Mark Crossfield’s post on the Graphite monitoring solution he kicked off at my previous company (his present company)
Etsy’s blog post “Measure Anything, Measure Everything” (also quoted in Marks post)
Matt Heusser’s post on his visit to Etsy - check out those monitors!
Picture showing investigation of an outage / drop in traffic from the previous night - not even the logs could tell you why…

Some useful links for further reading:

Further reading