Back to Blog

Boosting Developer Productivity by 25% with Incident Automation

Shoreline recently helped Razorpay, a FinTech leader in India, elevate their system reliability and improve developer productivity by 25% as part of their strategic initiative for incident automation.

Vikrant Saini, Technology Operations Leader at Razorpay, recently posted a blog about an innovative solution for developer productivity: automated incident management. In his blog, Vikrant shares Shoreline’s role in helping Razorpay with runbook automation as part of its Incident Management charter, resulting in a 20% reduced “Time to Isolate” and 25% improvement in developer productivity.

What did it take for Razorpay to realize these improvements?

The Challenge: Reducing Dependency on Manual Intervention

As a leading payments aggregator processing $100 Billion annually with 10 Million merchants, Razorpay encountered a recurring challenge. Razorpay's incident management team, responsible for the numerous incidents each month, needed to pull on-call engineers into incident management calls frequently. During these calls, they primarily ran basic SQL queries to identify issues, many of which originated outside the Razorpay ecosystem. This repetitive task was more than just a time sink during the call – it was a barrier to the productivity and efficiency of the incident management team and the engineers. Incident escalation comes at a heavy cost and creates an ideal use case for automation.

Shoreline's Solution: Incident Automation to Reduce Toil

Razorpay recognized the potential to automate the Standard Operating Procedures to execute the SQL queries and pulled in Shoreline to collaborate on the solution. The idea was simple yet powerful: if an issue could be quickly identified through predefined steps before an incident management call even started, why not let an automated system handle it? This would accelerate time to incident resolution and free up engineers for more critical tasks, enhancing overall productivity.

The automation solution was a collaborative effort that included Razorpay’s Incident Management, Infrastructure, Development, and Shoreline Teams. The teams identified the top five alerts by trigger frequency and meticulously crafted Runbooks for each. Runbooks are the Standard Operating Procedures, or instructions, to guide actions when incidents occur. These Runbooks were approved by senior engineers and integrated into Shoreline. Once automated, each Runbook underwent multiple rounds of testing to ensure the accuracy of results and the teams' ability to interpret the information effectively.

Streamlining Communication: A Pre and Post-Automation Comparison

Automation significantly altered how incidents were communicated. Previously, Slack threads for incident resolution were bustling with inputs from various teams. Post automation, these threads saw a remarkable reduction in messages from engineers, as the investigation results were now being automatically published.

The strategic implementation of automation at Razorpay by Shoreline involved a meticulous, multi-step process. Initially, the team identified repetitive alerts that could be automated, followed by creating detailed Runbooks outlining the steps for automated resolution. These Runbooks were then integrated into Shoreline's automation tool, undergoing rigorous testing to ensure accuracy and efficiency. The process also included establishing a standardized procedure for Runbook automation, emphasizing clarity and ease of understanding. Critical to the success of this project was the approval from senior engineers and the conducting of multiple dry runs, aligning the automation with Razorpay's operational standards. This transition to automated incident management gradually shifted responsibilities from engineers to the system, markedly reducing response times and manual intervention. The outcome of this strategic automation was twofold: a significant reduction in the 'Mean Time to Isolate,' enhancing application availability, and a notable boost in developer productivity by freeing them from time-intensive incident management tasks.

The Outcomes: A Testament to Success

Our collaborative efforts yielded impressive results, and we couldn’t say it better than Vikrant himself:

“Here at Razorpay, we’re thrilled to report a significant 20% reduction in the “Time to Isolate,” i.e., identifying the root cause of service disruption incidents. In numerous cases, the automation of runbooks via Shoreline has empowered our teams to swiftly pinpoint the incident’s cause in a remarkable span of less than 5 minutes. These results demonstrate the tangible benefits of our automation efforts, enhancing efficiency and reducing downtime with remarkable success.”

- Vikrant Saini, Technology Operations Leader