REST API Monitoring & Observability
The three pillars of observability — metrics, logs, and distributed traces — with OpenTelemetry instrumentation, health checks, and alerting best practices for production REST APIs.
Monitoring vs Observability
Monitoring tells you when something is wrong — dashboards, alerts, uptime checks. Observability lets you understand why something is wrong, even for failure modes you didn't anticipate, by providing rich queryable telemetry.
| Aspect | Monitoring | Observability |
|---|---|---|
| Question answered | "Is the API up?" | "Why is request #4829 slow?" |
| Data | Predefined metrics & alerts | Metrics + Logs + Traces (any query) |
| Unknown failures | Misses novel failure modes | Can debug anything with enough data |
| Tools | Pingdom, UptimeRobot, PagerDuty | OpenTelemetry, Grafana, Datadog, Jaeger |
You need both. Monitoring catches known problems fast; observability helps you understand and fix unknown problems. The industry standard in 2026 is OpenTelemetry (OTel) — a CNCF graduated project that unifies metrics, logs, and traces under a single vendor-neutral SDK.
The Four Golden Signals
Google SRE defines four metrics every API must track:
| Signal | What to Measure | Alert Threshold (example) |
|---|---|---|
| Latency | p50, p95, p99 response time (distinguish success vs error latency) | p99 > 500ms for 5 min |
| Traffic | Requests per second (RPS), broken down by endpoint and method | RPS drops >50% vs 1h average |
| Errors | 5xx error rate; also track 4xx separately (client vs server errors) | 5xx rate > 1% for 2 min |
| Saturation | CPU, memory, DB connection pool, queue depth — how "full" is the service? | DB pool > 90% for 3 min |
Metrics with Prometheus & Express
Prometheus is the de facto standard for API metrics collection. The prom-client library exposes metrics at a /metrics endpoint that Prometheus scrapes.
// npm install prom-client express
const express = require('express');
const promClient = require('prom-client');
const app = express();
// Collect default Node.js metrics (event loop lag, GC, heap, etc.)
promClient.collectDefaultMetrics();
// Custom histogram for HTTP request duration
const httpDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
});
// Counter for total requests
const httpRequests = new promClient.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
// Middleware — measure every request
app.use((req, res, next) => {
const end = httpDuration.startTimer();
res.on('finish', () => {
const labels = {
method: req.method,
route: req.route?.path || req.path,
status_code: res.statusCode
};
end(labels);
httpRequests.inc(labels);
});
next();
});
// Expose metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
res.set('Content-Type', promClient.register.contentType);
res.end(await promClient.register.metrics());
});
app.listen(3000);
With this in place, Prometheus scrapes /metrics every 15 seconds and Grafana can visualise latency percentiles, error rates, and throughput in real time.
Structured Logging
Logs should be machine-parseable JSON — not plain text strings. Structured logs enable filtering, aggregation, and correlation across millions of events in tools like Datadog, Loki, or CloudWatch.
// npm install pino
const pino = require('pino');
const logger = pino({ level: 'info' });
// Bad — plain text log, impossible to filter
console.log('POST /orders 201 42ms user=alice');
// Good — structured JSON log
logger.info({
method: 'POST',
path: '/orders',
status: 201,
duration_ms: 42,
user_id: 'user_alice',
request_id: 'req_8f3a2b',
order_id: 'order_98765'
}, 'Request completed');
// Output:
// {"level":30,"time":1714000000000,"method":"POST","path":"/orders",
// "status":201,"duration_ms":42,"user_id":"user_alice",
// "request_id":"req_8f3a2b","order_id":"order_98765","msg":"Request completed"}
Request ID Correlation
Always generate a unique request_id per incoming request and propagate it through every log line and downstream service call. This lets you reconstruct the full journey of a single request across your entire system:
const { randomUUID } = require('crypto');
app.use((req, res, next) => {
// Use incoming request-id (from upstream) or generate a new one
req.requestId = req.headers['x-request-id'] || randomUUID();
res.setHeader('X-Request-ID', req.requestId);
// Attach to every log in this request's scope
req.log = logger.child({ request_id: req.requestId });
next();
});
app.post('/orders', async (req, res) => {
req.log.info({ user_id: req.user.id }, 'Creating order');
const order = await orderService.create(req.body);
req.log.info({ order_id: order.id }, 'Order created');
res.status(201).json(order);
});
Distributed Tracing with OpenTelemetry
Distributed tracing shows the full execution path of a request across multiple services, databases, and external calls — each step represented as a span within a trace. OpenTelemetry is the CNCF standard for vendor-neutral tracing.
// npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
// tracing.js — load BEFORE anything else
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const sdk = new NodeSDK({
serviceName: 'orders-api',
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces'
}),
instrumentations: [
getNodeAutoInstrumentations() // auto-instruments Express, HTTP, pg, redis, etc.
]
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());
With auto-instrumentation, every Express route, database query, Redis call, and outbound HTTP request is automatically traced — no manual span creation required for the basics.
Manual Spans for Business Logic
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('orders-api');
async function processOrder(orderData) {
// Create a custom span for business logic
return tracer.startActiveSpan('process-order', async (span) => {
try {
span.setAttributes({
'order.user_id': orderData.userId,
'order.item_count': orderData.items.length,
'order.total_usd': orderData.total
});
const inventory = await checkInventory(orderData.items);
const payment = await chargePayment(orderData.payment);
const order = await createOrder({ ...orderData, paymentId: payment.id });
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (err) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
throw err;
} finally {
span.end();
}
});
}
Health Check Endpoints
Health checks allow load balancers, Kubernetes, and monitoring tools to determine whether your API is ready to receive traffic.
| Endpoint | Purpose | Returns healthy when... |
|---|---|---|
GET /health | General health check | Process is running |
GET /health/live | Liveness (Kubernetes) | Process hasn't deadlocked |
GET /health/ready | Readiness (Kubernetes) | DB connected, cache warm, dependencies up |
GET /health/startup | Startup probe | Initialisation complete |
// Health check implementation
app.get('/health/live', (req, res) => {
// Liveness: just confirm the process is alive
res.status(200).json({ status: 'ok', timestamp: new Date().toISOString() });
});
app.get('/health/ready', async (req, res) => {
const checks = {};
// Check database
try {
await db.query('SELECT 1');
checks.database = 'ok';
} catch (e) {
checks.database = 'error: ' + e.message;
}
// Check Redis cache
try {
await redis.ping();
checks.cache = 'ok';
} catch (e) {
checks.cache = 'error: ' + e.message;
}
const healthy = Object.values(checks).every(v => v === 'ok');
res.status(healthy ? 200 : 503).json({
status: healthy ? 'ok' : 'degraded',
checks,
uptime: process.uptime()
});
});
Return 200 OK for healthy, 503 Service Unavailable for unhealthy. Kubernetes uses liveness to restart stuck pods and readiness to stop routing traffic to pods that aren't ready.
Tooling: Grafana, Datadog, Jaeger
| Tool | Type | Best For | Pricing |
|---|---|---|---|
| Prometheus + Grafana | Metrics + Dashboards | Self-hosted, cost-effective at scale | Free (OSS) |
| Grafana Tempo | Distributed tracing | OSS, integrates with Grafana stack | Free (OSS) / Cloud paid |
| Jaeger | Distributed tracing | Self-hosted CNCF tracing | Free (OSS) |
| Datadog | Full observability platform | Enterprise, unified metrics/logs/traces | Paid (per host) |
| New Relic | Full observability platform | APM, browser, infrastructure | Free tier / Paid |
| OpenTelemetry Collector | Telemetry pipeline | Vendor-neutral collection + routing | Free (OSS) |
Recommended OSS Stack (2026)
# docker-compose.yml — full observability stack
services:
app:
build: .
environment:
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4318
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
# Receives from app, exports to Prometheus + Tempo + Loki
prometheus:
image: prom/prometheus:latest
# Scrapes metrics from otel-collector and /metrics endpoints
grafana:
image: grafana/grafana:latest
# Dashboards for metrics (Prometheus), traces (Tempo), logs (Loki)
tempo:
image: grafana/tempo:latest
# Stores and queries distributed traces
loki:
image: grafana/loki:latest
# Stores and queries structured logs
Alerting Best Practices
Alert on symptoms, not causes. "p99 latency > 1s" is a symptom alert — it fires when users are affected. "DB query count > 1000/s" is a cause alert — it may fire without impacting users.
Recommended Alert Rules
# Prometheus alerting rules (alerting.yml)
groups:
- name: api-alerts
rules:
# High error rate
- alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "API error rate > 1%"
# High p99 latency
- alert: HighLatency
expr: histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency > 1s"
# Service down (no requests for 2 minutes)
- alert: ServiceDown
expr: rate(http_requests_total[2m]) == 0
for: 2m
labels:
severity: critical
annotations:
summary: "API receiving no traffic"
Runbook Best Practices
- Every alert should link to a runbook with step-by-step diagnosis and remediation
- Include a "for" duration to avoid alerting on transient spikes (e.g.,
for: 2m) - Use severity levels: critical (page on-call now), warning (investigate within hours), info (log only)
- Review and tune alert thresholds quarterly — alert fatigue is as dangerous as no alerts
- Test your alerts: use
amtool alert addor Alertmanager test mode to verify routing
Related Topics
API Testing
Unit, integration, and load testing your REST API before it reaches production.
Error Handling
Consistent error responses that make debugging and monitoring easier.
Rate Limiting
Protect your API with throttling — a key part of saturation monitoring.
Microservices
Distributed tracing is especially important in microservices architectures.
gRPC vs REST
Observability considerations differ between gRPC and REST APIs.
HTTP Headers
Request-ID, tracing headers, and deprecation headers for observability.