Implement error budget to gauge how much time is spent in features and how much in tech debt
From the CI Long polling retro
This comes from the SRE book, the general idea is that we should be tracking an SLA, and we should target at meeting it exactly. This means that if we want to have an SLA of 99.5, if we are under it, 99.4 for ex. we can't introduce new changes that will make things unstable. If you are above it you can introduce breaking change because you have error budget left.
So far so good and makes sense, but then this book introduces another concept that is quite interesting: this error budget has to be spent, this means that while you are over your SLA you can make changes, and you have room to deliver new features and whatever makes sense, but also that in case you are still over your SLA you need to induce an outage to avoid people assuming that systems will be 100% available. You have to meet your SLA from both directions either unintentionally or intentionally.
When the error budget is depleted, your team needs to work on preventing failure so you can meet your SLA in the next cycle.
Since we are moving into a culture where teams own their own services the impact of this error budget could be huge.
For example, forcing the filesystem to fail and going away will force the rest of the systems to evolve and be able of surviving such a failure, same for the DB, same for git, same for sidekiq. This will force the whole application to degrade gracefully. This will be so because the failure of one system will impact the SLA of the other systems: if your system is impacted by the failure of another you lose error budget, so you are forced to work in preventing failure.
This is so because one thing is that the application fails with a 502, and a different thing is that the application replies with text and provides context of why it could not fulfill the request (disks are not available, for ex), in which case you are complying with your SLA because you are degrading your service within the available resources, and you are being explicit about what the problem is, which makes everyone else's life easier.