Drilling down a software bug: lessons about observability, monitoring, automation and good practices

Katerina Koutsonikoli @magikat0

B.3.019 - Monday 4th February 2019 - 17:30 → 18:25

So, there is case, every once in a few days the kafka cluster dies and along dies the whole application resulting in revenue loss for the company.

Starting there, I would like to describe how we approached the unknown issue, assumptions, failure, trial and error, until we found the root cause: a known bug in the respective version of the famous distributed software.
Of course, every software has bugs and hitting a major one is not such uncommon but more than that, what is actually important is the actual lessons learnt during the process:

  • how do we monitor our infrastructure and ways to improve it
  • what actually happens inside our application, how do failures of external software affect it, how we can improve stability
  • tools we have Vs tools we need, call for more automation
  • what are the actual resources we need for our use case, if we under/overprovision, understand our scale and optimize the costs
  • importance of documentation and post-mortems, the 5 WHYs.

Major takeaway of this talk: tackle your incidents as a way to understand more about your systems (both technical systems: infra, code, tools AND non-technical systems: teams, workflows, procedures, practices) and design them better.

Speaker Info