Observability as a First-Class System
I started my career in BI and analytics, where understanding system performance meant defining KPIs, tracking distributions, and making decisions from data. Sports teams use this. Businesses use this. It’s the foundation for understanding how anything is actually performing.
That foundation still shapes how I approach observability today.
I treat observability as a first-class system—something you design alongside the application, not bolt on after it breaks.
What Observability Means to Me
At a high level, observability is about answering:
- What is the system doing?
- Is it behaving as expected?
- If not, why?
It’s not just metrics, logs, or dashboards.
It’s about designing systems so they can explain themselves in production.
There’s a difference between:
- Knowing that something failed
- Understanding why it failed
- Knowing whether it’s still working at all
Good observability gives you all three.
How I Build Observability Into Systems
I think about observability in layers:
1. Instrumentation & Metrics
I’ve implemented request-level instrumentation (middleware) to track:
- Latency distributions (not just averages)
- Outliers and tail latency
- Endpoint-level performance
These metrics feed into dashboards used for debugging and performance tuning.
Prometheus and Grafana are part of my standard backend project templates, so metrics and alerting are built in from the start:
https://github.com/Trones21/proj-template/blob/main/python-django/src/compose_deploy.yaml
2. System-Level Monitoring
Beyond request metrics, I design systems to monitor themselves.
In QLIR, I built a monitoring service that watches other processes:
- Health and liveness
- Memory usage
- Process state
This service acts as a central observer and drives alerts through a notification layer:
https://github.com/Trones21/qlir/blob/main/src/qlir/servers/ops_watcher/___ops_watcher_README.md
3. Custom Observability Tooling
When existing tools don’t provide the right signals, I build them.
I created a custom Prometheus exporter:
dir_exporter
https://github.com/Trones21/dir_exporter
It recursively scans directory trees and exposes:
- File counts
- Byte totals
As time-series metrics.
This enables:
- Detecting stalled ingestion pipelines
- Tracking dataset growth over time
- Capacity planning
This is an example of turning system state into observable signals.
4. Alerting & Feedback Loops
On AWS, I’ve implemented event-driven alerting (SNS) for data pipelines, focusing on:
- Failure detection
- Performance degradation
More importantly, I design alerting systems to avoid blind spots:
- Not just "notify on failure"
- But ensuring systems prove they are still running
Mental Models I Use
Over time, I’ve developed a few mental models that guide how I design observability systems:
🔊 Loud vs 🔇 Silent Systems
Some systems will scream when they break. Others will fail quietly.
You need to design observability differently for each.
❌ Notify on Failure vs ✅ External Confirmation
“Notify on failure” only works if the code runs.
Some failures happen before your code even executes.
External watchdogs and “proof of life” signals are often more reliable.
❤️ Proof of Life vs Logs
Logs tell you what happened.
They don’t tell you if the system is still alive.
For background jobs and pipelines, you need explicit signals that confirm continued execution.
Related Articles
I’ve written more detailed breakdowns of these ideas through real systems and failure modes:
- How Loud Should Your Systems Be When They Break?
- Notify on Success? Notify on Failure? No — Notify Intelligently
These focus on:
- Silent failures
- Alert fatigue
- Designing the right level of feedback for different systems
Final Thought
Good observability isn’t about having more data.
It’s about having the right signals when it matters.
And designing systems so you can understand why something is happening—not just that it failed.