Cynefin & Software Bugs - Duncan Nisbet - Software Delivery Consultant

I have recently finished the fabulous but grueling “BBST Bug Advocacy course” run by the AST so my head is full of ideas about bugs, including how to reproduce them & document them in such a way that those making decisions them have the information they need.

This studying came on top of a lot of work I’ve been doing into the Cynefin framework. My brain is now trying to weave the models I learned on Bug Advocacy into the Cynefin framework.

This post is where I am up to right now with it…

When we discover a bug, it is not always apparent what action caused the undesired behaviour. Of course, there are those bugs which can easily identify, but then there are those that require some time to investigate. There are also those WTF bugs where we have no idea where they came from.

In this post, I try to map the different types of bugs I have encountered to the Cynefin framework. The intention of this is to help me understand the dynamics of bug reproduction & documenting so that I have another tool to use when giving my bugs credibility.

If you’re new to the Cynefin framework, I suggest you read my previous post which has a host of links to help you get up to speed.

For the purposes of this post, I am using the following definition of a bug:

“Any behaviour that threatens the value of a software product for some person who matters”

Misbehaviour is undesired behaviour for favoured stakeholders (potentially desired behaviour for disfavoured stakeholders)

Obvious domain

In the Obvious domain, cause & effect are, well, obvious. In the context of bugs, this means that we can clearly see what action has caused the undesired behaviour.

E.g. a form has no submit button – the user cannot submit the form.

It is easy for us to categorise this bug (e.g. as missing functionality or deviation from the acceptance criteria)

We can capture the misbehaviour in an automated check to assert if it reappears in future.

As a Tester on one project, I would write a failing check to demonstrate the presence of a bug (supported by conversations with Programmers & Product Owners, of course). I would comment this check out & commit it to the repo in order to pair on fixing it with the Programmer. Depending on certain factors (e.g. where/when the bug was found), this check would serve as the only documentation of the bug.

Complicated domain

In the Complicated domain, cause & effect are separated by space & time. There may also be several actions that caused the effect.

E.g. clicking the submit button on a form provides no feedback to the customer (of a success or failure)

Trying to identify what action caused the misbehaviour requires analysis (or investigation) & typically expertise.

The purpose of this investigation is primarily to refine the reproduction steps to the bare minimum required to repeatedly demonstrate the misbehaviour.

In the example above, determining the steps might include clicking the submit button with:

all the form fields empty
only the mandatory fields empty
all the fields completed correctly

But there are other factors to consider, such as is this behaviour the actual fault, or is it merely a symptom of more sinister fault? How likely is it that a customer would find this bug?

Typical questions I’d be asking for the example above include:

did the form actually submit (check DB)
what errors are being thrown, if any (check logs, dev tools console if in web browser)
if the form did not submit, how far did the request go?

On the Bug Advocacy course, we learned of the RIMGEA heuristic for investigating bugs. I’d be looking to apply this heuristic in the Complicated domain:

Replicate (demonstrate that the bug can be reproduced)
Isolate (minimum steps to reproduce)
Maximise (how bad can behaviour get)
Generalise (how likely are customers to experience the behaviour)
Externalise (what’s the impact for the customers & our organisation)
And say it clearly & dispassionately (leave out the accusation & judgement)

For complicated bugs, after some time & investigation we will hopefully be able to create automated checks around the misbehaviour in case it should reappear. At this stage, we have moved the bug from the Complicated to the Obvious domain.

Complex domain

In the Complex domain, cause & effect are only apparent in retrospect

E.g. Form submission is periodically unsuccessful in Production environment

In the Complex domain, we dealing with hard to reproduce, or supposedly “non-reproducible”, bugs.

From the information we had at the time, we could not envisage the misbehaviour. We wouldn’t have caught this bug with a scripted test.

It could be the case that this bug wasn’t uncovered during development or testing, why was that?

We need to probe the system misbehaviour by running some experiments, in order to try & reproduce the behaviour.

Some experiments I’d run to try & identify when the form fails to submit include:

after a set number of attempts
after the page containing the form has been open for a set amount of time
some action I’m doing differently
the data I’m using
how does the failure rate compare on a different system

Through experimentation, we are exploring the software to find out more about it & why we might be seeing the misbehaviour that we are seeing.

We are using the RIMGEA heuristic to help guide our exploration, narrow down the circumstances under which we experience the behaviour & hopefully shift the bug from the Complex to the Complicated or even the Obvious domain (we have completely missed a fundamental aspect of the system when picking the development work).

Chaotic domain

No cause & effect relationship is perceivable.

E.g. clicking the submit button on a form empties the customer’s account

Why should submitting a form wipe out a customer’s account?

Now isn’t the time to be asking why, we need to act! We need to stop customer’s accounts haemorrhaging money. It’s time to implement those procedures you have for situations just like this.

I would argue that no testing happens in Chaos – you throw up constraints to limit the impact of the misbehaviour which moves the situation from Chaos into Complextiy. From here, you’re back into dealing with the behaviour as a complex bug.

For Production bugs this is likely to involve a “war room” for the post mortem to generate actions to prevent this bug from ever happening again.

Summing up

In a group setting, I’d like to experiment using the Cynefin framework as a retrospective tool, possibly in bug triage to see if there’s any useful information that could be gleaned from the model.

Personally, I’m going to apply the framework to the bugs I come across to help determine what level of further analysis I might need to do.

One interesting thought that sprung to mind whilst writing this post is that my actions in Complicated & Complex were very similar. It seems my ideas of investigation & analysis are very similar to that of experiments. This means that I am potentially treating complex bugs as complicated & vice-versa. I wonder if I’m missing something there.

Obvious domain

Complicated domain

Complex domain

Chaotic domain

Summing up

Further reading