REST API Monitoring & Observability

The three pillars of observability — metrics, logs, and distributed traces — with OpenTelemetry instrumentation, health checks, and alerting best practices for production REST APIs.

Last Updated:

Monitoring vs Observability

Monitoring tells you when something is wrong — dashboards, alerts, uptime checks. Observability lets you understand why something is wrong, even for failure modes you didn't anticipate, by providing rich queryable telemetry.

AspectMonitoringObservability
Question answered"Is the API up?""Why is request #4829 slow?"
DataPredefined metrics & alertsMetrics + Logs + Traces (any query)
Unknown failuresMisses novel failure modesCan debug anything with enough data
ToolsPingdom, UptimeRobot, PagerDutyOpenTelemetry, Grafana, Datadog, Jaeger

You need both. Monitoring catches known problems fast; observability helps you understand and fix unknown problems. The industry standard in 2026 is OpenTelemetry (OTel) — a CNCF graduated project that unifies metrics, logs, and traces under a single vendor-neutral SDK.

The Four Golden Signals

Google SRE defines four metrics every API must track:

SignalWhat to MeasureAlert Threshold (example)
Latency p50, p95, p99 response time (distinguish success vs error latency) p99 > 500ms for 5 min
Traffic Requests per second (RPS), broken down by endpoint and method RPS drops >50% vs 1h average
Errors 5xx error rate; also track 4xx separately (client vs server errors) 5xx rate > 1% for 2 min
Saturation CPU, memory, DB connection pool, queue depth — how "full" is the service? DB pool > 90% for 3 min

Metrics with Prometheus & Express

Prometheus is the de facto standard for API metrics collection. The prom-client library exposes metrics at a /metrics endpoint that Prometheus scrapes.

// npm install prom-client express
const express = require('express');
const promClient = require('prom-client');
const app = express();

// Collect default Node.js metrics (event loop lag, GC, heap, etc.)
promClient.collectDefaultMetrics();

// Custom histogram for HTTP request duration
const httpDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
});

// Counter for total requests
const httpRequests = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// Middleware — measure every request
app.use((req, res, next) => {
  const end = httpDuration.startTimer();
  res.on('finish', () => {
    const labels = {
      method: req.method,
      route: req.route?.path || req.path,
      status_code: res.statusCode
    };
    end(labels);
    httpRequests.inc(labels);
  });
  next();
});

// Expose metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});

app.listen(3000);

With this in place, Prometheus scrapes /metrics every 15 seconds and Grafana can visualise latency percentiles, error rates, and throughput in real time.

Structured Logging

Logs should be machine-parseable JSON — not plain text strings. Structured logs enable filtering, aggregation, and correlation across millions of events in tools like Datadog, Loki, or CloudWatch.

// npm install pino
const pino = require('pino');
const logger = pino({ level: 'info' });

// Bad — plain text log, impossible to filter
console.log('POST /orders 201 42ms user=alice');

// Good — structured JSON log
logger.info({
  method:      'POST',
  path:        '/orders',
  status:      201,
  duration_ms: 42,
  user_id:     'user_alice',
  request_id:  'req_8f3a2b',
  order_id:    'order_98765'
}, 'Request completed');

// Output:
// {"level":30,"time":1714000000000,"method":"POST","path":"/orders",
//  "status":201,"duration_ms":42,"user_id":"user_alice",
//  "request_id":"req_8f3a2b","order_id":"order_98765","msg":"Request completed"}

Request ID Correlation

Always generate a unique request_id per incoming request and propagate it through every log line and downstream service call. This lets you reconstruct the full journey of a single request across your entire system:

const { randomUUID } = require('crypto');

app.use((req, res, next) => {
  // Use incoming request-id (from upstream) or generate a new one
  req.requestId = req.headers['x-request-id'] || randomUUID();
  res.setHeader('X-Request-ID', req.requestId);

  // Attach to every log in this request's scope
  req.log = logger.child({ request_id: req.requestId });
  next();
});

app.post('/orders', async (req, res) => {
  req.log.info({ user_id: req.user.id }, 'Creating order');
  const order = await orderService.create(req.body);
  req.log.info({ order_id: order.id }, 'Order created');
  res.status(201).json(order);
});

Distributed Tracing with OpenTelemetry

Distributed tracing shows the full execution path of a request across multiple services, databases, and external calls — each step represented as a span within a trace. OpenTelemetry is the CNCF standard for vendor-neutral tracing.

// npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
// tracing.js — load BEFORE anything else

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const sdk = new NodeSDK({
  serviceName: 'orders-api',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces'
  }),
  instrumentations: [
    getNodeAutoInstrumentations() // auto-instruments Express, HTTP, pg, redis, etc.
  ]
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown());

With auto-instrumentation, every Express route, database query, Redis call, and outbound HTTP request is automatically traced — no manual span creation required for the basics.

Manual Spans for Business Logic

const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('orders-api');

async function processOrder(orderData) {
  // Create a custom span for business logic
  return tracer.startActiveSpan('process-order', async (span) => {
    try {
      span.setAttributes({
        'order.user_id':    orderData.userId,
        'order.item_count': orderData.items.length,
        'order.total_usd':  orderData.total
      });

      const inventory = await checkInventory(orderData.items);
      const payment   = await chargePayment(orderData.payment);
      const order     = await createOrder({ ...orderData, paymentId: payment.id });

      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      throw err;
    } finally {
      span.end();
    }
  });
}

Health Check Endpoints

Health checks allow load balancers, Kubernetes, and monitoring tools to determine whether your API is ready to receive traffic.

EndpointPurposeReturns healthy when...
GET /healthGeneral health checkProcess is running
GET /health/liveLiveness (Kubernetes)Process hasn't deadlocked
GET /health/readyReadiness (Kubernetes)DB connected, cache warm, dependencies up
GET /health/startupStartup probeInitialisation complete
// Health check implementation
app.get('/health/live', (req, res) => {
  // Liveness: just confirm the process is alive
  res.status(200).json({ status: 'ok', timestamp: new Date().toISOString() });
});

app.get('/health/ready', async (req, res) => {
  const checks = {};

  // Check database
  try {
    await db.query('SELECT 1');
    checks.database = 'ok';
  } catch (e) {
    checks.database = 'error: ' + e.message;
  }

  // Check Redis cache
  try {
    await redis.ping();
    checks.cache = 'ok';
  } catch (e) {
    checks.cache = 'error: ' + e.message;
  }

  const healthy = Object.values(checks).every(v => v === 'ok');
  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'ok' : 'degraded',
    checks,
    uptime: process.uptime()
  });
});

Return 200 OK for healthy, 503 Service Unavailable for unhealthy. Kubernetes uses liveness to restart stuck pods and readiness to stop routing traffic to pods that aren't ready.

Tooling: Grafana, Datadog, Jaeger

ToolTypeBest ForPricing
Prometheus + Grafana Metrics + Dashboards Self-hosted, cost-effective at scale Free (OSS)
Grafana Tempo Distributed tracing OSS, integrates with Grafana stack Free (OSS) / Cloud paid
Jaeger Distributed tracing Self-hosted CNCF tracing Free (OSS)
Datadog Full observability platform Enterprise, unified metrics/logs/traces Paid (per host)
New Relic Full observability platform APM, browser, infrastructure Free tier / Paid
OpenTelemetry Collector Telemetry pipeline Vendor-neutral collection + routing Free (OSS)

Recommended OSS Stack (2026)

# docker-compose.yml — full observability stack
services:
  app:
    build: .
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4318

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    # Receives from app, exports to Prometheus + Tempo + Loki

  prometheus:
    image: prom/prometheus:latest
    # Scrapes metrics from otel-collector and /metrics endpoints

  grafana:
    image: grafana/grafana:latest
    # Dashboards for metrics (Prometheus), traces (Tempo), logs (Loki)

  tempo:
    image: grafana/tempo:latest
    # Stores and queries distributed traces

  loki:
    image: grafana/loki:latest
    # Stores and queries structured logs

Alerting Best Practices

Alert on symptoms, not causes. "p99 latency > 1s" is a symptom alert — it fires when users are affected. "DB query count > 1000/s" is a cause alert — it may fire without impacting users.

Recommended Alert Rules

# Prometheus alerting rules (alerting.yml)
groups:
  - name: api-alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: rate(http_requests_total{status_code=~"5.."}[5m])
              / rate(http_requests_total[5m]) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "API error rate > 1%"

      # High p99 latency
      - alert: HighLatency
        expr: histogram_quantile(0.99,
              rate(http_request_duration_seconds_bucket[5m])) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency > 1s"

      # Service down (no requests for 2 minutes)
      - alert: ServiceDown
        expr: rate(http_requests_total[2m]) == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "API receiving no traffic"

Runbook Best Practices

  • Every alert should link to a runbook with step-by-step diagnosis and remediation
  • Include a "for" duration to avoid alerting on transient spikes (e.g., for: 2m)
  • Use severity levels: critical (page on-call now), warning (investigate within hours), info (log only)
  • Review and tune alert thresholds quarterly — alert fatigue is as dangerous as no alerts
  • Test your alerts: use amtool alert add or Alertmanager test mode to verify routing