Skip to main content

How Loud Should Your Systems Be When They Break?

In Notify on Success? Notify on Failure? No — Notify Intelligently, I shared a story about a side project that failed silently for months — a scraper that was supposed to run every 10 minutes, but stopped without a trace. I had notifications set up, but only if the job ran. When it didn’t, I got nothing.

That experience burned into my brain something I didn’t have words for at the time:

Not all failures are equal — and not all systems need to scream when they fail.

Some systems will scream. Others will die quietly in a corner unless you check on them. And if you don’t design your observability to match that reality, you’ll either miss something important… or drown in noise.


🔊 Systems with Inherently Loud Failures vs 🔇 Systems without Inherently Loud Failures

Over the years, I’ve started mentally sorting systems into two categories:


🔊 1. Systems where Failures are Inherently Loud

These are the systems that you’ll hear about whether you want to or not.

Think: critical data pipelines, production APIs, batch jobs that feed dashboards used by actual humans.

Example:

I once worked on a nightly SAP → PySpark → Athena → Tableau pipeline. If the job failed, it technically triggered an alert — but we didn’t really need it. By 8:00am PST, we’d have a dozen Slack messages from business users asking where the data was.

The failure wasn’t invisible. It was socially surfaced by its consumers.

For systems like this:

  • You’ll know when it breaks, alert or no alert
  • Observability is still valuable, but mostly for debugging, not detection
  • Your main goal is fast root cause and mean time to recovery
※ Worth noting: this was a client transitioning from medium-business norms (~200–1000 employees) to enterprise-level expectations. We didn’t have true on-call rotations or formal SLAs. In a perfect world, yes — teams would be paged before the Slack DMs started. But that just wasn’t the reality at the time.

🔇 2. Systems where Failures are not Inherently Loud

These are the systems that fail quietly. Systems that do a job in the background, without anyone checking in regularly.

Think:

  • Cron jobs that sync backups or data to S3
  • Scripts that send monthly summary reports
  • Side scrapers pulling data every X minutes
  • Low-priority analytics pipelines

If these stop working, nobody will notice. There’s no dashboard refresh. No angry Slack message. No built-in social feedback loop.

These are the systems that break… and stay broken… until you randomly check one day and realize they haven’t run in three months.

For systems like this:

  • Failure needs to be detected externally
  • Logging isn’t enough — you need presence of heartbeat, not just logs of activity
  • Uptime monitoring, beacon files, and dashboards are your best tools

Silent systems require a different kind of observability — not just logs, but proof of life.


🧭 How to Choose the Right Level of Observability

When you’re building (or inheriting) a system, ask yourself:

  • If this breaks, how soon would I know?
  • Who depends on this — and when?
  • How bad is it if it fails silently for a day? A week? A month?
  • Would anyone message me? Or would no one know?
  • Would I remember to check it regularly? (Be honest.)

If your answers suggest no one would notice, then you need to treat this system like a silent failure candidate — and design for it explicitly.


🛠️ Patterns That Work for Silent Systems

If you’re working with background jobs, scrapers, or quiet data tools, here are some simple patterns that make a big difference:

Beacon File

  • On every successful run, drop a file (or DB row, or object) with a timestamp
  • Separate cron job checks freshness
  • If no update in expected interval, send alert

External Watchdog

  • Job writes status → external service checks for new writes
  • Decouples monitoring from the system being monitored

Status Dashboard

  • Aggregates recent runs and last-success timestamps
  • Can be static, serverless, or just a flat JSON viewer

Heartbeat Monitoring

  • Simple pings to a service like healthchecks.io
  • Automatically alerts if heartbeat goes missing

You don’t need full-blown distributed tracing — you need a way to know that your system is still alive, even if it hasn’t done anything interesting lately.


🧠 Final Thought

Not everything needs dashboards, logs, alerts, or metrics.
But everything needs a way to tell you it's alive.

Some systems will scream when they break. Some will whisper.
Design accordingly.