Incident Management
When the shit hits the fan
Late Friday evening, just when you are about to sit down for that family dinner, the phone buzzes and you get notifications from your favourite monitoring service that the website is down.
You excuse yourself and prepare for a complete night of fixing whatever needs to be fixed.
Lesson number one in Incident management is: avoid it. Implement release windows that doesn’t release anything to production in evenings or after Friday lunch.
Nothing is worse than having to spend the weekend (in the best case) to fix some new bug or start your laptop on Monday morning just to realise an entire weekend of uptime and hence no sales or the service is gone.
No one is there to fix it, so you need to scramble together some developers that are unfortunate enough to respond to your calls and WhatsApp messages.
Once you are all online you spend the night fending off the obvious question from Product Owners, Manager, Co-workers, Stakeholders, Customers or your family — “when is it fixed?”
Or, you can have a written Incident Management process which still needs people to do some work, but clearly takes care of the entire incident in a responsible, manageable and outcome-driven way.
By having a rotating schedule of developers in a call-chain, no one has to be available all the time or sit and wait for issues. But if the, yeah stuff, hits the fan then you are ready. Because it will.
Once the process is activated it is as important that stakeholders are trained and understand that there is a process and the issue is being worked on, no matter how fast they want the issue resolved.
They need to respect the process, in order to avoid mayhem, blaming, conflicts and “he said-she saids”. And most importantly, to get it fixed.
Read more in "The CTO Playbook" available on Amazon/Kindle.