Our service was eating memory alive.
512% above allocation. Emergency rolling restarts every few hours. Grafana dashboards screaming red. And the worst part — we had no idea why.
What started as a routine monitoring alert turned into a 72-hour detective story that taught us everything we thought we knew about Go resource management was wrong. If you’ve ever stared at a memory graph climbing toward the ceiling at 3 AM, this story might save your service — and your sanity.
The Setup: A Service That Shouldn’t Have Been This Hard
Our file service was supposed to be simple. Upload files, download files, generate public URLs. Built in Go, deployed on Kubernetes, backed by S3-compatible storage. The kind of service you’d expect to set up once and forget about.
It worked with multiple storage providers — Minio, Google Cloud Storage, local NVMe — which made it flexible. It also made it complex. And complexity, as we’d soon learn, is where memory leaks hide.
The Crisis: When Everything Starts Bleeding
The symptoms hit us like a slow-motion car crash:
- Memory climbed past 2GB and kept going
- Rolling restarts gave us maybe 30 minutes of relief before the climb resumed
- Every dependent service started showing the same pattern — like a contagion
We tried the obvious fixes first. Restart the deployment. Bump the memory limits. Check for obvious leaks in the upload path. Nothing worked. The memory kept climbing.
That’s when we knew: this wasn’t a simple leak. This was something we weren’t seeing.
The Investigation: Following the Breadcrumbs
We turned to our Grafana dashboards and found our first real clue: goroutine count was climbing alongside memory. Not spiking — steadily, relentlessly increasing. Every request added goroutines. Some of them never left.
“When memory grows but garbage collection can’t reclaim it, look to your goroutines. They’re often holding onto resources long after their work is done.”
We set up K6 stress tests, monitored /metrics in real-time, and fired up pprof. The profiler pointed to two suspects:
- MongoDB connection pool — holding onto connections
- HTTP server goroutines — not properly closing
MongoDB was a false alarm. It only touched metadata logs. The HTTP goroutines were the real problem — but figuring out why they weren’t closing would take us down a rabbit hole we didn’t see coming.
The Twist: It Wasn’t What We Thought
We focused on uploads first. The upload flow opened a file, copied it to calculate size, then sent it to Minio. That copy step was creating a duplicate in memory on every request. We removed it, passed -1 to Minio to let the API determine size, and deployed.
Memory improved slightly. But it still climbed.
That partial win told us something critical: uploads were contributing, but they weren’t the main problem. The real leak was hiding in the download path.
The Breakthrough: A File That Never Closed
Here’s what we found in the download handler:
- Service downloads file from S3
- Forwards it directly to the client
- Never closes the file handle
The challenge was subtle. You can’t just call defer file.Close() after starting a stream — that would terminate the response before the client finished downloading. So we left it open. And every download request left another file handle dangling.
How do you close a resource after the client finishes downloading, when you don’t know when they’re done?
The Fix: Context Cancellation to the Rescue
Two insights cracked it:
- Change
io.Readertoio.ReadCloser— giving us explicit control over when the resource closes - Use Go’s context cancellation to detect when a client disconnects or the request completes
Here’s the simplified version:
func downloadHandler(w http.ResponseWriter, r *http.Request) {
// Get file as ReadCloser, not just Reader
fileReader, err := storage.GetFileAsReadCloser(r.Context(), filename)
if err != nil {
http.Error(w, "File retrieval failed", http.StatusInternalServerError)
return
}
// Clean up when the request context is done
go func() {
<-r.Context().Done()
fileReader.Close()
}()
// Stream to the client
io.Copy(w, fileReader)
}
That small goroutine monitoring r.Context().Done() is the key. When the client disconnects or the transfer completes, the context fires, and the file closes. No more dangling handles. No more memory bleeding.
The Results: From 512% to Stable
After deploying the fix, we ran the same stress tests. The difference was night and day:
- Memory fluctuated normally — rose under load, dropped after
- Goroutine count stabilized — no more relentless climb
- Zero emergency restarts — the service just worked
The graph went from a staircase to a heartbeat.
Five Lessons That Changed How We Write Go
1. Always Close What You Open (Even the Hard Cases)
It sounds obvious until you’re staring at a 2GB memory leak. Go’s garbage collector handles memory, not file handles, network connections, or goroutines. If you open it, you own it.
2. Context Cancellation Is Your Cleanup Trigger
Go’s context package isn’t just for timeouts. It’s a lifecycle coordination tool. Listen for ctx.Done() and you can trigger cleanup even when you can’t directly control the operation’s end.
3. Choose Interfaces That Expose What You Need
The difference between io.Reader and io.ReadCloser was the difference between a leak and a fix. Ask yourself: does this interface give me the operations I need for proper resource management?
4. Monitor Everything — Especially Goroutines
Without Grafana showing goroutine count alongside memory, we’d have chased red herrings for days. Comprehensive monitoring doesn’t just alert you — it gives you the clues to solve the mystery.
5. Test Incrementally and Measure Everything
Our step-by-step approach — test, fix, measure, repeat — was crucial. The first fix didn’t solve everything, but it validated our understanding and narrowed the search. Don’t guess. Measure.
The Real Takeaway
Memory leaks in Go are rarely about memory. They’re about resources — goroutines, file handles, connections — that the garbage collector doesn’t know about. Understand how Go’s concurrency model interacts with system resources, and you’ll build services that stay stable under load.
The next time you see memory climbing without bounds, don’t just check allocations. Check your goroutines. Trace your resource usage. Make sure everything that opens, closes.
Hit a similar wall with Go resource management? What patterns have worked for you? Share your experience in the comments or reach out — I’m always interested in how other teams handle these problems.