Back to solutions
Major outage

Intermittent JVM Memory Issues

JVMs often face memory issues that can lead to hours of SSH-ing into box after box trying to catch the issue as it happens.
High
Customer experience impact
Up to 30 minutes of poor performance
High
Occurrence frequency
Frequently until the root cause is identified
2-4 hours
Time to repair manually
2-4 manual hours are required
1-2 minutes
Shoreline time to repair
Shoreline repair takes 1-2 minutes
Time to diagnose manually
Cost impact
Security

The problem

Java virtual machines (JVMs) can often face memory issues. Usually this is because certain requests, payloads or jobs consume more memory than was anticipated. The Java garbage collector is actually quite robust, so eventually the situation is resolved, but while it is occurring, garbage collection takes priority and latency often spikes leading to poor customer experiences. Permanently fixing this type of issue often requires heap dump and garbage collection statistics that are only available while the issue is occurring. What makes this situation even harder is that very few people understand how garbage collection works, making it even more tricky to diagnose. SREs are frequently asked to capture the debug data for this situation, which can lead to hours of SSH-ing into box after box trying to catch a JVM experiencing the memory issue.

The solution

With Shoreline, customers can set an alarm that looks for a heap size that exceeds a certain threshold. Once the alarm fires, a script can be executed that runs stdout to run jcmd, jstack, jstat and jmap to get a heap dump, thread dump, GC stats and heap stats. Once this data is collected, it is pushed to a cloud storage service and then the JVM is restarted. This is all done in seconds, ensuring the least possible impact on the customer experience. This also saves the SRE hours of exploratory work and ensures that engineering has everything they need to fix the root cause of the issue.