How Loud Should Your Systems Be When They Break?
In Notify on Success? Notify on Failure? No — Notify Intelligently, I shared a story about a side project that failed silently for months — a scraper that was supposed to run every 10 minutes, but stopped without a trace. I had notifications set up, but only if the job ran. When it didn’t, I got nothing.
That experience burned into my brain something I didn’t have words for at the time:
Not all failures are equal — and not all systems need to scream when they fail.
Some systems will scream. Others will die quietly in a corner unless you check on them. And if you don’t design your observability to match that reality, you’ll either miss something important… or drown in noise.
🔊 Systems with Inherently Loud Failures vs 🔇 Systems without Inherently Loud Failures
Over the years, I’ve started mentally sorting systems into two categories:
🔊 1. Systems where Failures are Inherently Loud
These are the systems that you’ll hear about whether you want to or not.
Think: critical data pipelines, production APIs, batch jobs that feed dashboards used by actual humans.
Example:
I once worked on a nightly SAP → PySpark → Athena → Tableau pipeline. If the job failed, it technically triggered an alert — but we didn’t really need it. By 8:00am PST, we’d have a dozen Slack messages from business users asking where the data was.
The failure wasn’t invisible. It was socially surfaced by its consumers.
For systems like this:
- You’ll know when it breaks, alert or no alert
- Observability is still valuable, but mostly for debugging, not detection
- Your main goal is fast root cause and mean time to recovery
🔇 2. Systems where Failures are not Inherently Loud
These are the systems that fail quietly. Systems that do a job in the background, without anyone checking in regularly.
Think:
- Cron jobs that sync backups or data to S3
- Scripts that send monthly summary reports
- Side scrapers pulling data every X minutes
- Low-priority analytics pipelines
If these stop working, nobody will notice. There’s no dashboard refresh. No angry Slack message. No built-in social feedback loop.
These are the systems that break… and stay broken… until you randomly check one day and realize they haven’t run in three months.
For systems like this:
- Failure needs to be detected externally
- Logging isn’t enough — you need presence of heartbeat, not just logs of activity
- Uptime monitoring, beacon files, and dashboards are your best tools
Silent systems require a different kind of observability — not just logs, but proof of life.
🧭 How to Choose the Right Level of Observability
When you’re building (or inheriting) a system, ask yourself:
- If this breaks, how soon would I know?
- Who depends on this — and when?
- How bad is it if it fails silently for a day? A week? A month?
- Would anyone message me? Or would no one know?
- Would I remember to check it regularly? (Be honest.)
If your answers suggest no one would notice, then you need to treat this system like a silent failure candidate — and design for it explicitly.
🛠️ Patterns That Work for Silent Systems
If you’re working with background jobs, scrapers, or quiet data tools, here are some simple patterns that make a big difference:
✅ Beacon File
- On every successful run, drop a file (or DB row, or object) with a timestamp
- Separate cron job checks freshness
- If no update in expected interval, send alert
✅ External Watchdog
- Job writes status → external service checks for new writes
- Decouples monitoring from the system being monitored
✅ Status Dashboard
- Aggregates recent runs and last-success timestamps
- Can be static, serverless, or just a flat JSON viewer
✅ Heartbeat Monitoring
- Simple pings to a service like healthchecks.io
- Automatically alerts if heartbeat goes missing
You don’t need full-blown distributed tracing — you need a way to know that your system is still alive, even if it hasn’t done anything interesting lately.
🧠 Final Thought
Not everything needs dashboards, logs, alerts, or metrics.
But everything needs a way to tell you it's alive.
Some systems will scream when they break. Some will whisper.
Design accordingly.