Observability as a First-Class System

I started my career in BI and analytics, where understanding system performance meant defining KPIs, tracking distributions, and making decisions from data. Sports teams use this. Businesses use this. It’s the foundation for understanding how anything is actually performing.

That foundation still shapes how I approach observability today.

I treat observability as a first-class system—something you design alongside the application, not bolt on after it breaks.

What Observability Means to Me

At a high level, observability is about answering:

What is the system doing?
Is it behaving as expected?
If not, why?

It’s not just metrics, logs, or dashboards.

It’s about designing systems so they can explain themselves in production.

There’s a difference between:

Knowing that something failed
Understanding why it failed
Knowing whether it’s still working at all

Good observability gives you all three.

How I Build Observability Into Systems

I think about observability in layers:

1. Instrumentation & Metrics

I’ve implemented request-level instrumentation (middleware) to track:

Latency distributions (not just averages)
Outliers and tail latency
Endpoint-level performance

These metrics feed into dashboards used for debugging and performance tuning.

Prometheus and Grafana are part of my standard backend project templates, so metrics and alerting are built in from the start:

https://github.com/Trones21/proj-template/blob/main/python-django/src/compose_deploy.yaml

2. System-Level Monitoring

Beyond request metrics, I design systems to monitor themselves.

In QLIR, I built a monitoring service that watches other processes:

Health and liveness
Memory usage
Process state

This service acts as a central observer and drives alerts through a notification layer:

https://github.com/Trones21/qlir/blob/main/src/qlir/servers/ops_watcher/___ops_watcher_README.md

3. Custom Observability Tooling

When existing tools don’t provide the right signals, I build them.

I created a custom Prometheus exporter:

dir_exporter
https://github.com/Trones21/dir_exporter

It recursively scans directory trees and exposes:

File counts
Byte totals

As time-series metrics.

This enables:

Detecting stalled ingestion pipelines
Tracking dataset growth over time
Capacity planning

This is an example of turning system state into observable signals.

4. Alerting & Feedback Loops

On AWS, I’ve implemented event-driven alerting (SNS) for data pipelines, focusing on:

Failure detection
Performance degradation

More importantly, I design alerting systems to avoid blind spots:

Not just "notify on failure"
But ensuring systems prove they are still running

Mental Models I Use

Over time, I’ve developed a few mental models that guide how I design observability systems:

🔊 Loud vs 🔇 Silent Systems

Some systems will scream when they break. Others will fail quietly.

You need to design observability differently for each.

❌ Notify on Failure vs ✅ External Confirmation

“Notify on failure” only works if the code runs.

Some failures happen before your code even executes.

External watchdogs and “proof of life” signals are often more reliable.

❤️ Proof of Life vs Logs

Logs tell you what happened.

They don’t tell you if the system is still alive.

For background jobs and pipelines, you need explicit signals that confirm continued execution.

I’ve written more detailed breakdowns of these ideas through real systems and failure modes:

These focus on:

Silent failures
Alert fatigue
Designing the right level of feedback for different systems

Final Thought

Good observability isn’t about having more data.

It’s about having the right signals when it matters.

And designing systems so you can understand why something is happening—not just that it failed.

Comments

No comments yet. Be the first!

What Observability Means to Me​

How I Build Observability Into Systems​

1. Instrumentation & Metrics​

2. System-Level Monitoring​

3. Custom Observability Tooling​

4. Alerting & Feedback Loops​

Mental Models I Use​

🔊 Loud vs 🔇 Silent Systems​

❌ Notify on Failure vs ✅ External Confirmation​

❤️ Proof of Life vs Logs​

Related Articles​

Final Thought​