How Loud Should Your Systems Be When They Break?

In Notify on Success? Notify on Failure? No — Notify Intelligently, I shared a story about a side project that failed silently for months — a scraper that was supposed to run every 10 minutes, but stopped without a trace. I had notifications set up, but only if the job ran. When it didn’t, I got nothing.

That experience burned into my brain something I didn’t have words for at the time:

Not all failures are equal — and not all systems need to scream when they fail.

Some systems will scream. Others will die quietly in a corner unless you check on them. And if you don’t design your observability to match that reality, you’ll either miss something important… or drown in noise.

🔊 Systems with Inherently Loud Failures vs 🔇 Systems without Inherently Loud Failures

Over the years, I’ve started mentally sorting systems into two categories:

🔊 1. Systems where Failures are Inherently Loud

These are the systems that you’ll hear about whether you want to or not.

Think: critical data pipelines, production APIs, batch jobs that feed dashboards used by actual humans.

Example:

I once worked on a nightly SAP → PySpark → Athena → Tableau pipeline. If the job failed, it technically triggered an alert — but we didn’t really need it. By 8:00am PST, we’d have a dozen Slack messages from business users asking where the data was.

The failure wasn’t invisible. It was socially surfaced by its consumers.

For systems like this:

You’ll know when it breaks, alert or no alert
Observability is still valuable, but mostly for debugging, not detection
Your main goal is fast root cause and mean time to recovery

_{※ Worth noting: this was a client transitioning from medium-business norms (~200–1000 employees) to enterprise-level expectations. We didn’t have true on-call rotations or formal SLAs. In a perfect world, yes — teams would be paged before the Slack DMs started. But that just wasn’t the reality at the time.}

🔇 2. Systems where Failures are not Inherently Loud

These are the systems that fail quietly. Systems that do a job in the background, without anyone checking in regularly.

Think:

Cron jobs that sync backups or data to S3
Scripts that send monthly summary reports
Side scrapers pulling data every X minutes
Low-priority analytics pipelines

If these stop working, nobody will notice. There’s no dashboard refresh. No angry Slack message. No built-in social feedback loop.

These are the systems that break… and stay broken… until you randomly check one day and realize they haven’t run in three months.

For systems like this:

Failure needs to be detected externally
Logging isn’t enough — you need presence of heartbeat, not just logs of activity
Uptime monitoring, beacon files, and dashboards are your best tools

Silent systems require a different kind of observability — not just logs, but proof of life.

🧭 How to Choose the Right Level of Observability

When you’re building (or inheriting) a system, ask yourself:

If this breaks, how soon would I know?
Who depends on this — and when?
How bad is it if it fails silently for a day? A week? A month?
Would anyone message me? Or would no one know?
Would I remember to check it regularly? (Be honest.)

If your answers suggest no one would notice, then you need to treat this system like a silent failure candidate — and design for it explicitly.

🛠️ Patterns That Work for Silent Systems

If you’re working with background jobs, scrapers, or quiet data tools, here are some simple patterns that make a big difference:

✅ Beacon File

On every successful run, drop a file (or DB row, or object) with a timestamp
Separate cron job checks freshness
If no update in expected interval, send alert

✅ External Watchdog

Job writes status → external service checks for new writes
Decouples monitoring from the system being monitored

✅ Status Dashboard

Aggregates recent runs and last-success timestamps
Can be static, serverless, or just a flat JSON viewer

✅ Heartbeat Monitoring

Simple pings to a service like healthchecks.io
Automatically alerts if heartbeat goes missing

You don’t need full-blown distributed tracing — you need a way to know that your system is still alive, even if it hasn’t done anything interesting lately.

🧠 Final Thought

Not everything needs dashboards, logs, alerts, or metrics.
But everything needs a way to tell you it's alive.

Some systems will scream when they break. Some will whisper.
Design accordingly.

🔊 Systems with Inherently Loud Failures vs 🔇 Systems without Inherently Loud Failures​

🔊 1. Systems where Failures are Inherently Loud​

🔇 2. Systems where Failures are not Inherently Loud​

🧭 How to Choose the Right Level of Observability​

🛠️ Patterns That Work for Silent Systems​

✅ Beacon File​

✅ External Watchdog​

✅ Status Dashboard​

✅ Heartbeat Monitoring​

🧠 Final Thought​