3 AM. Your on-call engineer is staring at a blank dashboard while your company's payment processing is failing. The monitoring alerts are screaming, but they only tell you that "API latency is high." Is it the database? The cache layer? A dependency? The CPU? Nobody knows. By the time someone figures it out—usually by adding logging to production—you've lost $50K in transactions. This is what happens when you confuse monitoring with observability.
Most teams don't understand this distinction. And frankly, it costs them.
The Myth of Metrics
I spent five years as an SRE at a fintech startup in Ho Chi Minh City, and I watched us make every beginner mistake in the book. We had 127 Prometheus metrics. Beautiful dashboards in Grafana. Alerts for everything. Yet somehow, when something broke, we were essentially blind.
The problem? We were monitoring—checking predefined conditions we thought might matter. We weren't building observability—the ability to ask arbitrary questions about our systems without having written the code to answer them first.
Here's what I mean: with monitoring, you ask, "Is this metric above threshold X?" With observability, you ask, "Why are 0.3% of our payments stuck in a pending state?"
The distinction matters because you can't predict everything that will go wrong. Your system is too complex. The interactions are too numerous. Emergent behavior happens. That's why observability is fundamentally about giving you the raw signal richness to investigate anything.
Three Pillars, One Philosophy
People will tell you observability has three pillars: metrics, logs, and traces. That's true but incomplete. The real answer is: metrics answer 'how much,' logs answer 'what happened,' and traces answer 'in what sequence.'"
But here's what nobody talks about: they're useless in isolation.
I've seen teams with world-class logging that couldn't correlate log events across a distributed system because they didn't instrument their applications with trace IDs. I've seen impeccable metrics that missed entire categories of failures because nobody was recording the right dimensions. I've seen detailed traces that were too expensive to sample broadly, so when rare bugs occurred, they went untraced.
Share this post
Related Posts
Need technology consulting?
The Idflow team is always ready to support your digital transformation journey.
The magic happens when these three work together. You see a spike in error rates (metrics). You drill into logs from that time window (logs). You pick a failing request and follow its entire path across 12 services (traces). Suddenly, you know exactly what happened.
The Economics Nobody Mentions
Here's what your CFO won't ask but should: How much is observability costing you, and what's the break-even point?
A properly instrumented system with comprehensive observability can cost 2-3x your actual infrastructure spend. At my fintech startup, we spent $40K/month on infrastructure and $80K/month on Datadog. Executives questioned it constantly. Until one incident in 2019 where we detected a memory leak in production within 6 minutes instead of the 3+ hours it would have taken with basic monitoring. That six-minute difference prevented an estimated $200K in chargebacks and regulatory fines.
Calculate your own break-even. How much does an hour of downtime cost your business? If it's $10K/hour and you have one incident per quarter, observability that costs you $30K/month is a rounding error.
The Vietnam Market Peculiarity
Something interesting happens in Vietnam: rapid growth without legacy infrastructure. I've consulted with fintech and logistics startups here that built observability correctly from day one. They scaled to handling 10x user growth without the observability debt that plagues companies built on older patterns.
But there's a catch. Vietnamese developers often gravitate toward open-source solutions (Prometheus, Grafana, OpenTelemetry) because cost matters. And that's smart—but it requires discipline. Open-source tooling won't hold your hand. You need a strong platform engineering culture to make it work. Many Vietnamese startups solve this by bringing in experienced practitioners or partnering with firms that specialize in this.
The Practical Reality
If you're starting observability work, understand what's actually hard:
Instrumentation is boring but critical. Adding @opentelemetry/auto to your Node.js app takes 30 seconds. Getting every database query, every Redis call, and every external API dependency *properly traced with business context* takes weeks. Most teams do the minimum and wonder why traces tell them nothing.
Cardinality is your enemy. If you're tagging every metric with user ID and request ID, congratulations—you've just created infinite time series. Your observability system becomes expensive and unusable. You need discipline about what you tag and why.
High cardinality + sampling = blindness. If you only sample 1% of traces, and a rare bug affects 0.5% of requests, you'll probably never see it. This is where purposeful sampling and tail-based sampling come in, but most teams don't know these techniques exist.
What Expert Practitioners Actually Do
After talking to dozens of site reliability engineers across Vietnamese and Southeast Asian tech companies, here's what separates the good from the mediocre:
1They instrument for business outcomes, not infrastructure. Not "track CPU usage" but "track payment success rate by geography by payment method by customer tier."
1They build alerting on real SLOs, not arbitrary thresholds. "Error rate exceeds 0.1%" means nothing. "We're violating our SLO of 99.9% availability" means act now.
1They treat observability as a product. The best teams have a shared observability platform that developers actually *want* to use because it's easy and answers questions quickly.
1They understand that perfect instrumentation is the enemy of good. You don't need 100% trace sampling. You need smart sampling, good log levels, and the right metrics. Perfection gets expensive fast.
The Uncomfortable Truth
Most observability implementations fail not because the tools are bad, but because teams underestimate the engineering effort required. They think buying a tool like Datadog or New Relic solves the problem. It doesn't. The tool is 20% of the solution. The other 80% is deciding what to measure, how to instrument code, what to alert on, and building the muscle memory to actually *use* the tools to solve problems.
Observability is less about technology and more about culture and discipline.
---
If you're building systems that matter—especially in high-growth environments like Vietnam's fintech and e-commerce scene—observability isn't optional. It's the difference between sleeping at night and living in fear of that 3 AM page. Tools like Prometheus and Grafana work great if you have the engineering depth, but increasingly, teams are recognizing that the human cost of building and maintaining this in-house is too high. Companies like Idflow Technology are helping Southeast Asian startups instrument their systems properly without needing to hire a dedicated platform team—which, frankly, is a smart move if you're not yet at the scale where that investment makes sense.
Build observability into your systems early. Your future self will thank you.