DevOps automation is the process of getting machines to handle repetitive work in the software deployment and operations lifecycle so that operators can deploy iterative updates faster and their systems operate more reliably.
Since the term DevOps was coined in 2009, automation has moved further and further right from automating development, integration, and delivery work to today's frontier on the operational side, where we see new tools to automate observability, reliability, and remediation.
Why should you automate DevOps work?
From an engineer’s perspective, DevOps tools empower development teams and make them more effective. By decreasing cross-team dependency and avoiding manual processes for infrastructure provisioning and configuration, developers increase the frequency of releases and receive faster feedback, improving their overall experience.
From the business perspective, DevOps automation reduces the lead time to deploy features. Automation also increases platform reliability and availability through auto-healing (finding and fixing errors automatically) or by reducing incidents caused by human error or environment inconsistencies. Additionally, it eliminates the need for large teams, minimizing duplicated and repetitive efforts from different development teams, and reducing cross-team friction.
DevOps tools can and should be used to achieve those outcomes, as DevOps techniques seamlessly integrate with the Agile Methodology.
What should be automated?
If you’re looking for where to focus your automation efforts, we recommend starting with where your largest bottlenecks are.
According to Puppet, the goal of DevOps automation is to progress towards an entirely self-service model where:
- incident responses are automated
- resources are available to developers on demand
- applications are re-architected based on business means
- security teams are involved in the design and development
To get here, organizations progress through a series of automation stages, solving common problems along the way, including slow service provisioning (either being too complex or requiring too much effort and cross-team collaboration) and difficulties setting up test and deployment pipelines.
Once organizations have automated the bottlenecks that engineers encounter releasing and deploying new code to production, dramatically speeding up engineering velocity in the process, they start looking into what they can automate in the operational stage, when software is in production and is providing business value.
For operations bottlenecks, research recent incident data—particularly repetitive incidents. Identify problems that cause prolonged or frequent incidents. Improvements should increase reliability and availability for the platform, either by preventing a class of incidents or decreasing their impact.
We recommend asking what the biggest sources of toil for your Engineering department are. These are tasks described as manual, repetitive, automatable, and devoid of enduring value. A decrease in toil frees engineers to work in other areas.
After selecting the problems that automation can help you with, define what success would look like and build a business case.
How should you automate DevOps processes?
With your automation problem in hand, try to identify what tools exist to solve your problem.
As a general rule, it's usually more effective to use readily available tools and standards instead of building and maintaining your own.
When adding a new tool to your technology stack, think about:
- Direct costs (licensing and hosting)
- Rollout effort (initial investment)
- Maintenance effort (ongoing investment)
- Complexity added to the system
- Reliability and support requirements
- What other problems can the tool help you with
Ideally, you're looking for a tool that's not only flexible enough to solve your current problem but also useful for future challenges. You're looking for tools that can potentially be reused in other layers as well, without a lot of burden from the team maintaining or using it. You should be able to decide which tools you'd host yourself and which tools you'd prefer to use as a SaaS.
Let’s get into the areas where tooling exists.
Continuous Integration and Continuous Delivery
One of the tenets of DevOps is the ability to safely and reproducibly deploy artifacts to all environments consistently. So it’s no surprise that the most established and popular class of DevOps tools are the CI/CD orchestrators required to build code and deploy artifacts safely and consistently (such as Jenkins, GoCD).
CI/CD tools allow you to create a deployment pipeline that starts from a commit in a version control system (such as Git, SVN), assesses it against several different quality check tools (code linting, unit tests, integration, and end-to-end tests) and—if all quality checks pass—deploys that version to production. The deployment pipeline includes continuous integration (CI), continuous delivery (CD), and infrastructure provisioning, depending on your architecture.
Feature flagging tools are also part of this group, as they are used to quickly deploy code to production in a safe and controlled way.
Every company should use a CI/CD tool. Questions you may want to consider include:
- How much effort will it require to provision and maintain the tool to the needed availability?
- How much effort does it require to support and maintain your pipelines?
- How easy is it to create pipelines for development teams?
- How easy is it to create shared pipeline templates for multiple teams?
- What security features do you need from your tools?
Those questions will allow you to select the best tool for your scenario.
Configuration and Infrastructure as Code Tools
Keeping all infrastructure, configuration, and application code stored in a version control system is another crucial component of DevOps automation. The ability to define infrastructure and configuration as code allows engineers to apply the same scrutiny and auditing to infrastructure code as they do to application code.
There are several different flavors of tools:
- Infrastructure provisioning: tools used to provision infrastructure components (such as network components, managed services, virtual machines) from code. Examples: Terraform, Pulumi, CloudFormation.
- Configuration management: used to configure the operational system, software requirements, package dependencies, and system files within machines. Examples: Chef, Puppet, Ansible, Packer.
- Container technologies: used to provision vendor-agnostic container orchestrators to run containerized applications. Examples: Kubernetes, OpenShift, Nomad.
- Serverless functions: used to aid the deployment of serverless functions. Examples: serverless, chalice, CDK.
Mature DevOps teams use a combination of tools to achieve infrastructure and configuration as code. Tools vary based on the technology stack and business requirements. When evaluating tools, ensure they allow your teams to deploy infrastructure safely and easily, while also sharing templates and standards with ease.
Observability and Monitoring tools
Monitoring and observability present a newer class of tools, less established than those used for CI/CD.
Tools in the observability and monitoring space include application log services (ELK stack) and data gathering agents and visualization tools for metrics and instrumentation data (Prometheus, Grafana, Datadog, New Relic). It also includes monitoring systems to generate alerts when the platform isn't working as expected based on metrics, logs, or health checks (Sensu, Nagios. Dynatrace, CloudWatch).
Resilience and Reliability
Relying on human intervention for reliability (either for problem identification or remediation) is unsustainable in the long term. There’s a lot of innovation happening to automate the organizational incident process of on-call management, incident process management, and remediation.
Breaking down the tools:
- On-call Management: defines on-call shifts and notifications, sending notifications to the on-call engineer in case of a problem (triggered from the monitoring tools). These are sometimes referred to as "paging systems” for historical reasons (Pagerduty, VictorOps).
- Incident Management: stores data from past and current incidents, allowing deep investigation and communication with external parties (Kintaba, Blameless, Firehydrant, Transposit).
- Automated Remediation Systems: tools that centralize remediation and auto-healing processes, preventing or fixing repetitive incidents without human intervention. (Shoreline).
What does good DevOps automation look like?
Let's consider a fictitious environment where all the base infrastructure is managed as code, using Terraform.
When a new application is created, its pipeline is defined by using a shared library in its Jenkinsfile, which automatically includes all default steps and required quality checks, as well as build notifications to the company communication tool (for example, their Slack channel). The application repository also contains a Terraform file with some basic configuration (for example, memory and CPU requirements and a health check endpoint) and uses a module shared by all applications. This module contains the required DNS, load balancer, and container configuration, plus some automatic monitoring. Developers do not need to set up their pipelines or default infrastructure, as these are all controlled by the shared code. If they need extra resources like data storage, they can easily add additional resources to their Terraform files.
After a repository commit, Jenkins runs the new pipeline. After running the commands for code linting, unit tests, and integration tests, the pipeline triggers a Terraform deployment to the staging environment and runs end-to-end tests. A Slack notification alerts developers to check the result and further run any manual tests they desire. If developers are happy with the result, they can promote the build in Jenkins via a manual approval button, triggering a Terraform deployment to production. Controlled by feature flags, this new feature is only available for internal staff.
Engineers ensure that monitoring and resilience are adequate before the feature flag switches on for external customers.
And once the code is live in production, the operations teams can build automations that handle common issues without cutting a ticket. For example, creating a job to automatically resize disks that are running out of room, or a job to identify TLS certificates that are nearing expiration and auto-renew them with HashiCorp Vault. Terraform stores these operational automations as well.
Tools to automate testing and deployment, like CI/CD and Terraform, are now very well established, dramatically reducing the time (and toil) it takes to ship code. A great deal of innovation is happening in the tools we use to operate our software in production, and we see this in observability, incident management, for example, and with automated remediation tools like Shoreline.
As we continue to offload the repetitive work of operating software in production businesses will see gains in reliability and customer satisfaction and engineering organizations will benefit from fewer tickets and fewer 3am outages.