As the GM responsible for developing Dataiku Online, a SaaS offering based on our core product Dataiku, I knew from day one that providing an outstanding customer experience was our top priority. Sure, we had to first build the SaaS solution, but we would soon have to ensure it ran as reliably as any world-class service.
Spike in Customer Usage Drives Spike in DevOps Toil
As we launched our new SaaS solution publicly in 2021 and made it the platform for all Dataiku free pilots, usage spiked. So our infrastructure team got busy managing the environment to ensure performance and availability targets were exceeded for our customers. With a sophisticated environment using managed Kubernetes, experienced engineers were necessary to ensure that all remediations were handled quickly and safely.
Some of this work became quite repetitive, as the same issues cropped up daily, or more often. For example:
- With a large and growing fleet under heavy use by data scientists, of course disks were going to fill up.
- With new infrastructure coming online all the time, of course some metadata would sometimes get corrupted.
- With free trials open to the public, of course some users will operate outside of the fair use policy of the trial program.
It was clear that our experienced engineers did not like fixing the same things again and again. There’s a reason Google SREs call it toil.
Automation Offered Better Experiences and Outcomes
While these issues were not quite predictable, their solutions were well understood. We knew that we couldn’t simply hire more engineers to handle these issues manually, because it would be impossible to retain them over time doing this unfulfilling, repetitive work. We saw the opportunity to automate remediations for these incidents and others like them. This would save countless hours for our infrastructure team, allowing them to focus more on engaging projects, while simultaneously delivering a better experience for our customers.
Based on a recommendation from our investor at Dawn Capital, we met with the team at Shoreline.io and learned how they might accelerate elimination of unnecessary toil. (Dawn subsequently invested in Shoreline the following year.) Shoreline’s pitch demonstrated how they solved a similar use case, and we grew confident that they could be a close to out-of-the box solution to our current challenges.
The first solution we wanted to achieve was increasing space on disks that could rapidly fill as customers ran their analysis. Handling these situations manually was taking 1-2 hours per day, and increasing.
With Shoreline, we can collect custom defined metrics to create precise alarm triggers that detect when customers have limited space left. Then, on top of that, we set up advanced automated remediation: cleaning some temporary files or triggering a resize of the disk. Now, prior to any impact of user experience due to a disk getting full, Shoreline is taking care of the issues automatically. This automation runs all day, overnight, and all weekend long, reducing stress on our team.
Shoreline offers a lot of power to our small team. As we find a solution that can be automated, we are doing it. In the span of 6 months, as of June 2022, we have automated the following solutions:
- Ensuring we never run out of storage space
- Restoring corrupted metadata to standard specs
- Shutting down inappropriate use of the platform
There is nothing better than seeing that Slack message that another issue has been automatically resolved. Or, one more disk has plenty of space. Or, one more illegitimate use of our free trials process was stopped. And on and on.
Results Validate Automation Strategy and Enable Scale
It’s easy to measure time savings from automation. Each time the automation is run, you can just add up the minutes and hours for work that wasn’t required. Almost 170 remediations were automatically triggered last month, conservatively saving over 20 FTE days of DevOps work, while improving app performance.
Perhaps even more impactful are all the benefits of our approach that are harder to measure:
- The impact of improved customer experience, or avoidance of a negative experience due to an outage that was prevented
- A happier development team that enjoys building new software and new automations much more than repeatedly working on the same issues. This builds their coding skills, which is great for everyone.
- A more productive infrastructure team with low turnover that saves tons of unnecessary hiring time across our management group, which is especially hard in today’s competitive job market.
- The addition of each new region means more ops work as cloud regions are set up separately. However, as automations are rolled out to all regions, we are scaling the business more effectively and efficiently.
For all these reasons, as we bring each automation online, we briefly celebrate the benefits it will bring, then quickly shift to asking, what can we automate next?
Looking to the future, we see an opportunity to start using Shoreline Notebooks to provide access to our support team to get information and take pre-approved, safe actions on our production cloud. This would enable them to handle more tasks without giving them full access to the infrastructure. We’ll increase their competence, while enforcing best practices with full auditability and permission management.