
Slow Database Queries
Fire an alert when there’s too many slow queries in the db cluster
The problem
When we first build the app, there’s usually not much data, so all database actions run really fast. In time, we build up data, and what we thought would scale suddenly doesn’t in unexpected ways. This can cause performance issues in production.
So why doesn’t the developer notice this and fix it in the next version? In the non-production environment the test data is probably not representative of real-world data. There’s a lot less, it’s usually 27 versions named “test”, and likely not much history. So in development, the queries still run fast.
Users are really sensitive to page render time. If the site is slow – even by a few hundred milliseconds – users will leave to your competitors. Google uses page speed to rank search results, and if you don’t get onto page 1 of Google’s results, you’ll likely not get discovered by new users.
Obviously maintaining fast query speed is essential. But it’s often difficult to identify slow queries unless you’re specifically looking for them, and even harder to reproduce the slow results. Was it the result of a noisy neighbor? Was the query deadlocked against another query running at the same time? It takes a lot of manual labor to discover and reproduce query performance bottlenecks.
The solution
This Shoreline automation watches query performance in the db cluster via the db’s CLI. When the number of queries taking longer than the configured threshold is exceeded, an alert is triggered in Shoreline and other task tracking systems like OpsGenie or PagerDuty. In the alert is the detail of all recent slow queries. The user can click a link in the alert to launch a Shoreline runbook allowing them to restart the cluster, a node in the cluster, or adjust partitioning, statistics, or indexes to increase query performance.