It seems so long ago now. When really it was only around 18 months ago that we could only deploy changes to production once a fortnight. Even then, we were so nervous about doing so that we had three or four engineers working out of hours to complete the deployment. We would even deploy to our 2 production instances on separate nights. So the first one could settle in before deploying to the second. Now we deploy twice daily, and the whole process is fully automated. The net result? 480 hours saved on lead time to deployment!
So, how did we do it?
Automating regression testing
Before we started on this journey all regression testing was done manually. This took 2 people between 2 and 3 days to complete. We did have a test automation framework in place. Unfortunately it had long been neglected. No one really understood exactly what it was testing. After some investigation, we decided the best approach was to start again with the automated tests. By starting again, we could be sure that the tests had suitable validation and followed the highest value user journeys. We also decided to make sure they would not be reliant on pre-populated data, something the legacy framework had been.
You may have seen elsewhere in my blogs we settled on a C# Selenium / SpecFlow framework. We decided to go with this for several reasons. C# was the primary programming language of most software engineers working on the product at the time. Our legacy framework was written in a similar way, so some engineers were already familiar with it. It would also allow us to re-implement the good parts of that legacy framework.
Test coverage grew gradually, as did our confidence in the new framework and tests. We started by building tests that would cover off the tests we were running manually as part of our deployments. In doing so, we had a small targeted set of tests to focus on, but ones that brought immediate value once they were complete. From there, we identified the manual test scenarios where automating would bring the highest reward. These consisted of scenarios that were complex to run manually, or required specialist product knowledge, that at the time was not widespread within the team. By the time we reached sufficient coverage in automation alone, we managed to reduce the time taken for regression testing to just 1 hour. A best case saving of 44-man hours!
Since then, we have continued to improve and grow our test coverage. Resulting in more diverse test coverage, and more comprehensive testing of the product before each release. It would be reasonable of you to think that had caused us to increase the duration of the tests too. You would be wrong. Through careful performance optimisation, despite the greater coverage, the tests now take just 30 minutes to run!
Every time I think about how we used to do things; I shake my head in disbelief. Not because what we did was wrong, but because of how far we have come! As I mentioned in the introduction, at the start of this journey we deployed fortnightly, but there is more to the story than just that:
- Deployments were completed fortnightly, with a 2 day burn in between our 2 production instances.
- The deployment process was undertaken manually and at risk of human error.
- Deployments were undertaken out of hours and required the product to be taken offline to complete.
- Post-release smoke tests were completed manually (until the automation above was ready to take over)
- The duration of the manual regression testing, and the need to allow time for bug fixes, meant the release candidate was already over a week out of date compared to the main branch by the time it was deployed to production.
The first, and most important, part of improving our deployments was the implementation of Octopus Deploy. We were fortunate to be working with a fantastic engineer who undertook work to migrate our manually executed legacy deployment tooling to Octopus Deploy. Once Octopus Deploy was implemented it opened the floodgates for what was to follow.
- Deployments could now be completed by anyone with access to do so in Octopus Deploy. Without needing to be able to manually access production EC2 instances. In both cases access was limited to specific roles, but with Octopus Deploy this was able to be opened up much more widely.
- As confidence grew, we were able to move to deploying both production environments in parallel.
- Later, we were able to move to weekly deployments. This amazed us at the time, how could we do better than that? Well, that was to come!
Automating the pipeline
After we reached weekly deployments, it was time to start improving the Octopus Deploy processes. Instead of manually creating our release candidate, we created a script to do that for us. The script could be run as an Octopus Deploy Runbook on a schedule.
At the time we were using TeamCity as our build tool. TeamCity has a concept it calls VCS Triggers (Version Control System Triggers). These triggers respond to changes in your Version Control System (In our case, GitHub). They then undertake the configured builds. We set up VCS triggers to automatically build our Release candidates when a “Release Branch” (Identified by a naming convention) was created or updated. Once built, TeamCity had a final build step that would trigger Octopus Deploy to automatically deploy them to a designated test environment. At this point we had our regression testing fully automated, so these were triggered as part of the Octopus Deploy.
We now had a fully automated release candidate creation and sign off process – mind blown!
That wasn’t enough though! I now saw the potential to do the unthinkable…
Daily deployments was going to require some big changes to our ways of working. We had to find a way of testing the process without disrupting our software engineers. One of the major changes to the deployment process itself was being be able to deploy the product without taking it offline. The solution to all of this was to take a step back, and use the tools we had built for automating the process in a more manual fashion. This would allow us to manually review the code that had been merged to main each day before committing to carrying out our daily deployment. Doing the manual review allowed us to assess risk of deploying changes without taking the product offline. We could also determine what changes in process would be needed to minimise risk when moving away from our weekly offline out of hours deployments to daily online deployments. We spent around 3 weeks doing this before deciding to bring up the proposed changes with our software engineers.
When we spoke with our software engineers, we put forward the details of how this would work, and what considerations they needed to have when undertaking development. Despite some concerns, there was overwhelming support. In fact, those conversations resulted in us deciding to go a step further than planned. After all, if we could release daily, what was stopping us from releasing…
The answer was nothing, so that is exactly what we did!
To support the twice daily deployments we had to take our automation of deployments one step further. We introduced automatic promotion to production. A simple trigger set up in Octopus that, following a successful deployment (and testing) of a release candidate to our pre-prod test environment, would automatically deploy to production. Post-release smoke tests and suitable alerting were setup too, of course.
Software Development practices improvements
It’s hard to think now that when this journey started, our main branch was not always deployable. Our development team at the time developed in feature branches, but they were merged to main before testing had taken place. It was not uncommon to have a broken main branch. Developer testing was also pretty limited. Some developers would undertake a TDD approach, but that was about as good as it got. So how did we make this better?
Shift left approach
First up, we had to start introducing a shift left approach to testing and quality. We already had some elements of this in place. We undertook story shaping sessions (Three amigos) and had well documented Acceptance Criteria and Acceptance tests on tickets before development work began. Unfortunately though the game of Jira Board ping pong continued, with software engineers considering the work “dev complete” without testing. I can remember more than one occasion where work was “dev complete” but the code wouldn’t build, or start, or was just so fundamentally broken you couldn’t test anything and had to send it back to the software engineer. To overcome this, we had to get the software engineers on side when it came to the concept of developer testing. For some, this was easier than others.
Removing the fence
The thing with that over the fence approach is, it causes testing bottlenecks. So by the time a ticket is tested, a software engineer can be days into another ticket. When you then go back to them with a problem, they have to drop what they are doing, re-familiarise themselves with the ticket you have been testing, work out how to fix it, and then when they send it back to you, they have to do the same again with the ticket they had been working on but got distracted from. Repeat that a couple of times and that cognitive overhead can cause huge losses of time from the software engineer. Far more time in fact, than it would have taken them to check that their changes work correctly and fulfil the requirements of the ticket.
So how do you get them to try it to see the improvement?
There are many ways, and it depends on the team and where they are at with the journey. It can be as simple as asking them to give it a go, and supporting them with it. Other teams or individuals may be less open to the idea, and so may need to be shown how it can improve their development experience. Often this requires persuading at least one engineer to give it a go, but once you have, and you get buy-in from that engineer, the others will follow.
Always deployable main branch
This was a big one. By the time we got to this point, things had improved greatly, but the majority of testing was still happening in the main branch. The game changer for us was the introduction of ephemeral test environments. The initial introduction was manual, in the sense that you had to trigger the creation and destruction of an environment, but it was enough. We could now have as many environments as needed at any time, easily and conveniently. There was no longer an excuse for testing in main. As a way of enforcing this, we introduced the concept of Quality Engineering as code owners. This meant that a Quality Engineer had to approve a pull request before it could be merged to main. Just like that, we had an always deployable main branch.
We’ve continued to make improvements to our ephemeral environment process, and now have them tied to pull requests directly. They are created when a pull request is created (or if created as draft, once the PR is moved to ready for review), and then once the pull request is merged (or closed) the environment is automatically destroyed. Any code pushed to a pull request is automatically built and deployed to the environment too.
Build confidence gradually
This for me, is the big takeaway from this experience. No matter how confident you are in your own work and what you are proposing, you need to allow others to build that same confidence gradually. Whether it’s confidence in your automated testing, the improved developer experience that comes with developer testing, or simply confidence in you as an engineering leader. Once you have that confidence, you need to treat it gently, because if you misuse it, you’ll find it’s a fragile thing that is far easier lost than gained.
In this story, that confidence was achieved by successfully making our way to weekly deployments. Then experimenting in a safe way with daily deployments. So that when it came to proposing the required changes to make this a permanent change, I had the buy-in, support and confidence of our engineering community. So much so, that when I began talking more widely about moving to a daily deployment pattern, the resounding response was, “If you say we can, let’s do it!”. In fact, the confidence from some was so high that not only did they believe we could now successfully move to daily deployments, but they also wanted me to begin looking at twice daily! That kind of support and confidence is priceless. It also got us to twice daily deployments and 480 hours deployment lead time saved!
Subscribe to The Quality Duck
Did you know you can now subscribe to The Quality Duck? Never miss a post but getting them delivered direct to your mailbox whenever I create a new post. Don’t worry, you won’t get flooded with emails, I post at most once a week.