Notify on Success? Notify on Failure? No — Notify Intelligently.
Back in 2017 or 2018, I wrote a tool to scrape cryptocurrency prices. It ran every 10 minutes on a Windows VM I didn’t fully trust, and I wanted peace of mind that it was working. So I did the obvious thing:
I made it email me after every successful run.
This wasn’t a production-grade system — it was a scrappy side tool I’d built to collect high-frequency price data for analysis. At the time, free APIs only offered historical prices at 1-hour or 1-day resolution. That wasn’t good enough. I wanted the sideways movement, the in-between candle ticks that would never show up in OHLC data — the kind of noise that might hold patterns worth exploiting.
So I wrote a job that hit the API every 10 minutes and saved the price.
And since this was one of many small tools I’d build alongside my day job, the goal was to set it and forget it. Just let it run in the background and quietly do its thing.
At first, it worked. But after a few days, the noise became unbearable: 144 success emails per day. I reduced the interval, then eventually turned off notifications completely.
I still assumed it was running fine.
Months later, I checked the data — and found a massive gap. The scraper had silently stopped working, and I had no idea when or why. The emails were gone. The visibility was gone. And so was about three months of data.
Looking back, I'm not entirely sure of the exact order of how each iteration of this little side project platyed out I hadn’t adopted Git yet (we were on TFS at work), and I only migrated the project to GitHub in 2019. What I do remember is imagining the “real” architecture: a frontend dashboard, a second system that monitored this one, clean decoupling of signal and job logic...
But this was before I knew how to build any of that well. I would've likely been stuck battling CORS errors on some serverless function for days, and honestly... it just wasn’t worth it for a side project that scraped crypto prices.
At the time, I was still doing GUI-based cloud setups, remoting into Windows boxes and dragging and dropping my files to get then transferred over, just starting to learn about modern frontend frameworks. I wasn’t using Linux, IaC, SSH, or even the terminal all that much besides for starting/testing the applications I was building. Today, that’s second nature — I’ve built up a system where I can spin up an entire project with a single command. My create_project.sh script scaffolds the full directory layout, injects project names into Docker Compose files, environment configs, deployment scripts, and even scp
+ ssh
wiring. My scripts handles it all: build, ship, launch, with safety logic for dev vs prod. One command — the whole enchilada.
And I have years of experience building reactive frontends in every major framework.
Years later, that “ultimate system” I had envisioned — decoupled jobs, clean frontends, automated infra — is just a normal Tuesday, but back then, though, I was just trying to make something work.
Even when the emails were working, my observability strategy was broken.
🔁 The Classic Monitoring Pattern
It’s tempting to frame your job logic like this:
try:
run_job()
notify("success")
except Exception as e:
notify("failure: " + str(e))
At a glance, this seems reasonable. You’ll know when things break, right?
Only... what if run_job()
never gets called?
What if:
- The cron job silently stops working
- The VM restarts and doesn’t recover
- Your job scheduler is misconfigured
- A dependency breaks before the script even starts
You won’t catch that in your try/except
. Because your code never ran.
❌ The Hidden Failure Class
This is the blind spot with "notify on failure" setups: They only work if the code runs in the first place.
But many of the worst, hardest-to-detect bugs happen outside your job logic:
- Cron jobs silently disabled
- Permissions changed
- Environment variables lost
- Script removed or renamed
- Network outages that prevent notification delivery itself
✅ The Better Pattern: External Confirmation
So what do I do now?
Instead of having the job email me, I have it leave behind a breadcrumb. Every time it runs, it drops a status file somewhere external — like S3.
Example:
{
"job": "crypto-scraper",
"status": "ok",
"timestamp": "2025-06-26T04:40:00Z"
}
Then, a completely separate cron job (or serverless function) monitors that external location.
If more than 30 minutes go by and no new file appears, then I get an alert.
This does a few important things:
- Decouples job execution from notification
- Makes it easy to spot “silent failures” (when the job never runs at all)
- Avoids flooding your inbox
- Gives you an audit trail (you can list past runs, runtime durations, etc.)
🔧 Minimal Tools, Maximum Confidence
This setup doesn’t require anything fancy:
- ✅ Cron job writes to a cloud bucket (S3, GCS, etc.)
- ✅ Watchdog script runs on a schedule and checks timestamps
- ✅ Optional: Web dashboard to show status of all jobs at a glance
It doesn’t matter if the job is on a flaky Windows VM, an old Raspberry Pi, or a modern cloud instance. As long as it can leave a breadcrumb, you’ll know it’s alive.
🧠 Takeaways
- “Notify on success” creates noise and fatigue
- “Notify on failure” misses invisible failure modes
- A decoupled, external check is often more reliable than internal logging or emails
- Monitoring that the job ran is different from monitoring what the job did
- Good observability isn’t about sophistication — it’s about coverage
👉 What Next?
If this resonates with you, I’ve written a follow-up piece that digs deeper:
How Much Observability Is Enough? — Designing systems with the right kind of feedback loop based on how you’ll notice failure.