What is the difference between monitoring and observability?

Monitoring tells you when something is wrong (alerts on known failure modes). Observability lets you understand why it is wrong — even for unknown failure modes — by providing rich telemetry data (metrics, logs, traces) that you can query freely. Modern API operations require both.

What is OpenTelemetry?

OpenTelemetry (OTel) is a CNCF project that provides vendor-neutral APIs, SDKs, and instrumentation for collecting metrics, logs, and traces. It is the industry standard for API observability in 2026, with native support in Node.js, Python, Go, Java, and most cloud platforms.

What metrics should I track for a REST API?

The four golden signals: Latency (p50, p95, p99 response time), Traffic (requests per second), Errors (4xx and 5xx error rate), and Saturation (CPU, memory, connection pool usage). Also track endpoint-specific metrics and downstream dependency latency.

REST API Monitoring & Observability Guide (2026)

Monitoring vs Observability

Monitoring tells you when something is wrong — dashboards, alerts, uptime checks. Observability lets you understand why something is wrong, even for failure modes you didn't anticipate, by providing rich queryable telemetry.

Aspect	Monitoring	Observability
Question answered	"Is the API up?"	"Why is request #4829 slow?"
Data	Predefined metrics & alerts	Metrics + Logs + Traces (any query)
Unknown failures	Misses novel failure modes	Can debug anything with enough data
Tools	Pingdom, UptimeRobot, PagerDuty	OpenTelemetry, Grafana, Datadog, Jaeger

You need both. Monitoring catches known problems fast; observability helps you understand and fix unknown problems. The industry standard in 2026 is OpenTelemetry (OTel) — a CNCF graduated project that unifies metrics, logs, and traces under a single vendor-neutral SDK.

The Four Golden Signals

Google SRE defines four metrics every API must track:

Signal	What to Measure	Alert Threshold (example)
Latency	p50, p95, p99 response time (distinguish success vs error latency)	p99 > 500ms for 5 min
Traffic	Requests per second (RPS), broken down by endpoint and method	RPS drops >50% vs 1h average
Errors	5xx error rate; also track 4xx separately (client vs server errors)	5xx rate > 1% for 2 min
Saturation	CPU, memory, DB connection pool, queue depth — how "full" is the service?	DB pool > 90% for 3 min

Metrics with Prometheus & Express

Prometheus is the de facto standard for API metrics collection. The prom-client library exposes metrics at a /metrics endpoint that Prometheus scrapes.

// npm install prom-client express
const express = require('express');
const promClient = require('prom-client');
const app = express();

// Collect default Node.js metrics (event loop lag, GC, heap, etc.)
promClient.collectDefaultMetrics();

// Custom histogram for HTTP request duration
const httpDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
});

// Counter for total requests
const httpRequests = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// Middleware — measure every request
app.use((req, res, next) => {
  const end = httpDuration.startTimer();
  res.on('finish', () => {
    const labels = {
      method: req.method,
      route: req.route?.path || req.path,
      status_code: res.statusCode
    };
    end(labels);
    httpRequests.inc(labels);
  });
  next();
});

// Expose metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});

app.listen(3000);

With this in place, Prometheus scrapes /metrics every 15 seconds and Grafana can visualise latency percentiles, error rates, and throughput in real time.

Structured Logging

Logs should be machine-parseable JSON — not plain text strings. Structured logs enable filtering, aggregation, and correlation across millions of events in tools like Datadog, Loki, or CloudWatch.

// npm install pino
const pino = require('pino');
const logger = pino({ level: 'info' });

// Bad — plain text log, impossible to filter
console.log('POST /orders 201 42ms user=alice');

// Good — structured JSON log
logger.info({
  method:      'POST',
  path:        '/orders',
  status:      201,
  duration_ms: 42,
  user_id:     'user_alice',
  request_id:  'req_8f3a2b',
  order_id:    'order_98765'
}, 'Request completed');

// Output:
// {"level":30,"time":1714000000000,"method":"POST","path":"/orders",
//  "status":201,"duration_ms":42,"user_id":"user_alice",
//  "request_id":"req_8f3a2b","order_id":"order_98765","msg":"Request completed"}

Request ID Correlation

Always generate a unique request_id per incoming request and propagate it through every log line and downstream service call. This lets you reconstruct the full journey of a single request across your entire system:

const { randomUUID } = require('crypto');

app.use((req, res, next) => {
  // Use incoming request-id (from upstream) or generate a new one
  req.requestId = req.headers['x-request-id'] || randomUUID();
  res.setHeader('X-Request-ID', req.requestId);

  // Attach to every log in this request's scope
  req.log = logger.child({ request_id: req.requestId });
  next();
});

app.post('/orders', async (req, res) => {
  req.log.info({ user_id: req.user.id }, 'Creating order');
  const order = await orderService.create(req.body);
  req.log.info({ order_id: order.id }, 'Order created');
  res.status(201).json(order);
});

Distributed Tracing with OpenTelemetry

Distributed tracing shows the full execution path of a request across multiple services, databases, and external calls — each step represented as a span within a trace. OpenTelemetry is the CNCF standard for vendor-neutral tracing.

// npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
// tracing.js — load BEFORE anything else

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const sdk = new NodeSDK({
  serviceName: 'orders-api',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces'
  }),
  instrumentations: [
    getNodeAutoInstrumentations() // auto-instruments Express, HTTP, pg, redis, etc.
  ]
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown());

With auto-instrumentation, every Express route, database query, Redis call, and outbound HTTP request is automatically traced — no manual span creation required for the basics.

Manual Spans for Business Logic

const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('orders-api');

async function processOrder(orderData) {
  // Create a custom span for business logic
  return tracer.startActiveSpan('process-order', async (span) => {
    try {
      span.setAttributes({
        'order.user_id':    orderData.userId,
        'order.item_count': orderData.items.length,
        'order.total_usd':  orderData.total
      });

      const inventory = await checkInventory(orderData.items);
      const payment   = await chargePayment(orderData.payment);
      const order     = await createOrder({ ...orderData, paymentId: payment.id });

      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      throw err;
    } finally {
      span.end();
    }
  });
}

Health Check Endpoints

Health checks allow load balancers, Kubernetes, and monitoring tools to determine whether your API is ready to receive traffic.

Endpoint	Purpose	Returns healthy when...
`GET /health`	General health check	Process is running
`GET /health/live`	Liveness (Kubernetes)	Process hasn't deadlocked
`GET /health/ready`	Readiness (Kubernetes)	DB connected, cache warm, dependencies up
`GET /health/startup`	Startup probe	Initialisation complete

// Health check implementation
app.get('/health/live', (req, res) => {
  // Liveness: just confirm the process is alive
  res.status(200).json({ status: 'ok', timestamp: new Date().toISOString() });
});

app.get('/health/ready', async (req, res) => {
  const checks = {};

  // Check database
  try {
    await db.query('SELECT 1');
    checks.database = 'ok';
  } catch (e) {
    checks.database = 'error: ' + e.message;
  }

  // Check Redis cache
  try {
    await redis.ping();
    checks.cache = 'ok';
  } catch (e) {
    checks.cache = 'error: ' + e.message;
  }

  const healthy = Object.values(checks).every(v => v === 'ok');
  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'ok' : 'degraded',
    checks,
    uptime: process.uptime()
  });
});

Return 200 OK for healthy, 503 Service Unavailable for unhealthy. Kubernetes uses liveness to restart stuck pods and readiness to stop routing traffic to pods that aren't ready.

Tooling: Grafana, Datadog, Jaeger

Tool	Type	Best For	Pricing
Prometheus + Grafana	Metrics + Dashboards	Self-hosted, cost-effective at scale	Free (OSS)
Grafana Tempo	Distributed tracing	OSS, integrates with Grafana stack	Free (OSS) / Cloud paid
Jaeger	Distributed tracing	Self-hosted CNCF tracing	Free (OSS)
Datadog	Full observability platform	Enterprise, unified metrics/logs/traces	Paid (per host)
New Relic	Full observability platform	APM, browser, infrastructure	Free tier / Paid
OpenTelemetry Collector	Telemetry pipeline	Vendor-neutral collection + routing	Free (OSS)

Recommended OSS Stack (2026)

# docker-compose.yml — full observability stack
services:
  app:
    build: .
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4318

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    # Receives from app, exports to Prometheus + Tempo + Loki

  prometheus:
    image: prom/prometheus:latest
    # Scrapes metrics from otel-collector and /metrics endpoints

  grafana:
    image: grafana/grafana:latest
    # Dashboards for metrics (Prometheus), traces (Tempo), logs (Loki)

  tempo:
    image: grafana/tempo:latest
    # Stores and queries distributed traces

  loki:
    image: grafana/loki:latest
    # Stores and queries structured logs

Alerting Best Practices

Alert on symptoms, not causes. "p99 latency > 1s" is a symptom alert — it fires when users are affected. "DB query count > 1000/s" is a cause alert — it may fire without impacting users.

Recommended Alert Rules

# Prometheus alerting rules (alerting.yml)
groups:
  - name: api-alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: rate(http_requests_total{status_code=~"5.."}[5m])
              / rate(http_requests_total[5m]) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "API error rate > 1%"

      # High p99 latency
      - alert: HighLatency
        expr: histogram_quantile(0.99,
              rate(http_request_duration_seconds_bucket[5m])) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency > 1s"

      # Service down (no requests for 2 minutes)
      - alert: ServiceDown
        expr: rate(http_requests_total[2m]) == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "API receiving no traffic"

Runbook Best Practices

Every alert should link to a runbook with step-by-step diagnosis and remediation
Include a "for" duration to avoid alerting on transient spikes (e.g., for: 2m)
Use severity levels: critical (page on-call now), warning (investigate within hours), info (log only)
Review and tune alert thresholds quarterly — alert fatigue is as dangerous as no alerts
Test your alerts: use amtool alert add or Alertmanager test mode to verify routing

REST API Monitoring & Observability

Table of Contents

Monitoring vs Observability

The Four Golden Signals

Metrics with Prometheus & Express

Structured Logging

Request ID Correlation

Distributed Tracing with OpenTelemetry

Manual Spans for Business Logic

Health Check Endpoints

Tooling: Grafana, Datadog, Jaeger

Recommended OSS Stack (2026)

Alerting Best Practices

Recommended Alert Rules

Runbook Best Practices

Related Topics

API Testing

Error Handling

Rate Limiting

Microservices

gRPC vs REST

HTTP Headers