How to Solve the Challenges of MELT Data at Scale

Recently, I read a paper by Slack on managing MELT challenges at scale.

(MELT stands for 4 data types: metrics, events, logs, and traces. The paper is definitely worth the read.)

I feel like their approach – which combines Prometheus, Kafka, Secor, S3, Spark, Elastic search, and Presto – is too complicated.

Because:

it's super expensive
there are a lot of parts that can break down
you have to keep in mind a lot of things just to run a query

The core issue is that they're converting a distributed systems problem into a centralized problem by pushing data to a central location.

This approach is fundamentally broken because it requires storing terabytes of data and pushing lots of traffic over a network when you don’t even need it 99.99% of the time.

And the bigger the data set, the slower it is to analyze.

So when you need it, it's slow.

Further, it's fragile because the time you need to observe your system is exactly the time when something's gone wrong.

In a network event, for example, this is often when you’ve stopped getting telemetry.

Finally, you have to be a wizard to predict what you'll care about in the future because if there's a new event, you won't have a dashboard/log handy.

For example, at AWS, one of the large-scale events I ran into was due to a BIOS upgrade by EC2.

But there was no way I’d be logging or metricizing what version of the BIOS I have.

For such things, you need to be able to execute a query at scale across your fleet and see what's going on in the live environment.

That’s why, at Shoreline, we favor modeling the distributed system as a distributed system.

We keep the data locally at the edge and process it locally.

We invest in sophisticated data query processing to execute commands in a parallel distributed manner across data, in real-time, with fault tolerance.

So we have an agent that collects data, analyzes it, and takes action when necessary.

Here are its advantages:

It scales with your fleet size by using a tiny bit of resources on each node.
As you increase the number of nodes, it gets scaled automatically.
There's no network latency because you don't have to push data to some central location.
Since it's running at the edge, the mean time to diagnose and repair for automated actions can be reduced to seconds.
It's intrinsically fault-tolerant because the edge node can take its local actions autonomously.

So, whether you use Shoreline or not, you must consider building these systems in a distributed, fault-tolerant manner.