The first DevOps assessments had started back in 2012. This is just under 9 years ago. Over the years, it’s become clear through the available research that certain things about DevOps are as true today as they were in the early days of the movement. DevOps still gives organizations a serious competitive advantage. Automation, collaboration and sharing are as important as ever. And the organizations doing DevOps well don’t have to make a trade-off between moving fast and keeping things stable and secure.
This intention – moving fast and keeping things stable and secure can be misunderstood. It does not mean pushing your developers to commit more times per day, overtime or risk-accepting all open security issues. Automation, collaboration and sharing, as mentioned, are the key factors here. Understanding them the right way, is the true spirit of DevOps (and not hiring DevOps engineers). The people, who worked closely in DevOps and wanted to document the right approach, had created an organization.
DevOps Research and Assessment – Over the years, have surveyed more than 50,000 technical professionals worldwide to better understand how the technical practices, cultural norms, and lean management practices we associate with DevOps affect IT performance and organisational performance.
They’re measurements which express the goal of making money perfectly well, but which also permit you to develop operational rules for running your plant,” he says. “There are three of them. Their names are throughput, inventory and operational expense.Eliyahu M. Goldratt, The Goal: A Process of Ongoing Improvement
Mr Goldratt was writing a book about running a manufacturing plant. What DevOps movement took from it, was that you can run software products in a very similar fashion. The manufacturing spirit of lean, forged to fit software world (mainly by systems administrators, who had enough of the crappy software delivery practices, which costed them long nights). This is what DevOps is today. If you are looking for proof, companies like Amazon, Google, Facebook or more local to UK – JustEat and Zoopla are being run with the same assumptions on mind.
It took a while to build a set of metrics, that would best describe software-based organization throughput, inventory and operational expense. Thus why the group of people behind DORA started from regularly interviewing the industry. They knew the trap of statistics. False correlations and ending up in tracking things that do not matter.
The book they have published – Accelerate: Building and Scaling High-Performing Technology Organizations, has a section dedicated on how they have arrived at the right science. Note their goal was to find an answer on why some organizations are performing well, including financially, while some other organizations, which spend as much or more money on their ‘IT’ do not. The metrics they came up with, correlated that success with the study results. And are in use for the last 6 years, in most cases very precisely predicting which tech companies have a much bigger chance to succeed.
What are then the DORA metrics, which the science behind State Of Devops had given birth to? I have listed them below. Before you read through them, note that:
Metrics tell the story, don’t think on how to make the metrics better, think which practice you might be missing.
- Production Deploy Frequency – how big is your batch (the theory is, the less times you deploy, the bigger your batch, aka amount of stuff you release at once; the bigger the batch, the more potential problems it carriers and most likely the more time it takes to get to production)
defined as: how often your organization deploys code to production.
- Lead Time – amount of blockers for devs to get stuff to production (the longer, the bigger merge requests, more process issues, tech debt, conflicting requirements, rework and unexpected work)
defined as: how long it takes for a code commit to be deployed to production.
- Mean Time To Recover – how fast you can recover (but also tells you how much time it takes you to tell there is an issue in the first place)
defined as: how long it takes to restore the service after a service incident occurred.
- Change Fail Rate – what is your quality focus (the higher, the more questionable your quality practices)
defined as: what percentage of changes result either in degraded service or subsequently require remediation (e.g. leads to impairment or outage, requires hotfix, rollback, fix forward).
In the end it follows the rules of lean:
- Value is only released to production, once it leaves the factory floor (production release frequency)
- Pay attention and improve daily to have as little unexpected work and rework as possible (change fail rate)
- Optimize your bottlenecks to have the right amount of Work In Progress (lead time)
- Invest in automation, scalability, observability, security and other practices that help to keep your services running seamlessly (mean time to recover)
Any improvements made anywhere besides the bottleneck are an illusion.Gene Kim, The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win
I am currently involved in building the capability to apply DORA metrics. They are an excellent tool to start introducing the right practices into your organization. Also to start asking the right questions. Re-building the culture of your organization.
In terms of how you can implement the measurements. Here are some ideas I have found useful:
- Production Deploy Frequency – best if automated via CI/CD tools like jenkins / circleci. However I have also seen integration with servicenow or JIRA. All that matters is to record each production deployment.
- Lead Time – any code versioning tool, with conjunction with a CI/CD tool. You can configure jenkins, gitlab, circle ci and others with github, svn, etc. Then you just measure the time between a commit was created and the same commit went live via ci/cd. Worst case you can even use excel and measure these things manually.
- Mean Time To Recover – I really liked blameless in terms of integrating all recovery procedures. However multiple other tools like servicenow and JIRA provide the capability to track and record time of outages.
- Change Fail Rate – the best idea I have seen were correlated with automatic tracking of google four golden signals after a change was deployed. However number of redeploys can be retrieved from CI/CD tooling or even recorded manually in tools like JIRA or servicenow.
Whether you decide on automating early or just start collecting the metrics manually, remember to make the metrics really visible. Find a senior sponsor that would continuously remind others what metrics say should be continuously discussed.
This is all I have prepared for today. I hope you will find it helpful and interesting enough to start discussing this approach within your organization.
The key message you need to remember from this article is: if you can’t measure it, you won’t be able to improve it.
Think about the metrics by which your result will be judged, and make a resolution to live by them every day so that in the end, your result will be judged a success.
My personal choice are DORA metrics. DORA metrics can help you make important decisions to improve the software development and delivery processes. Further investigations can help in finding any bottlenecks in the process. With these findings, organizations can make informed adjustments in their process workflows, automation, team composition, tools, and more.