Course/Reliability & Resilience Patterns/Health Endpoint Monitoring

Health Endpoint Monitoring

Expose health checks for load balancers and orchestrators: liveness vs readiness probes, deep health checks, and dependency monitoring.

8 min read

What Is a Health Endpoint?

A health endpoint is a dedicated HTTP endpoint (typically `GET /health` or `GET /healthz`) that reports whether a service instance is operating correctly. Load balancers, Kubernetes, and monitoring systems poll this endpoint to decide whether to route traffic to the instance, restart it, or alert on-call engineers.

A well-designed health endpoint checks the service's actual ability to serve requests — not just that the process is running — by verifying connectivity to databases, caches, message brokers, and other critical dependencies.

Liveness vs Readiness Probes

Kubernetes distinguishes two health probe types with different semantics and consequences:

Probe	Question It Answers	Failure Action	Checks
Liveness	Is this container alive (not stuck/deadlocked)?	Kill and restart the container	Process responsive, no deadlock. Avoid heavy dependency checks — a DB outage should NOT restart the container
Readiness	Is this container ready to serve traffic?	Remove from load balancer pool (don't restart)	Database reachable, migrations complete, warm-up done, dependency health acceptable
Startup (optional)	Has the container finished initializing?	Kill if startup takes too long	One-time check; prevents liveness from killing slow-starting containers

⚠️

Don't Check External Dependencies in Liveness Probes

If your liveness probe checks the database and the database goes down, Kubernetes will restart every pod — causing a thundering-herd restart storm that makes recovery much harder. Liveness probes should only check that the application process itself is responsive (e.g., can it accept HTTP connections and respond to a simple endpoint).

Anatomy of a Deep Health Check

A deep health check returns the status of each critical dependency individually, enabling operators to quickly identify which component is failing. The response should include overall status plus per-component status.

json

// GET /health/ready response
{
  "status": "degraded",
  "version": "2.3.1",
  "uptime": 3642,
  "checks": {
    "database": {
      "status": "healthy",
      "latency_ms": 4
    },
    "redis": {
      "status": "healthy",
      "latency_ms": 1
    },
    "payment_api": {
      "status": "unhealthy",
      "error": "Connection refused",
      "latency_ms": null
    },
    "disk_space": {
      "status": "healthy",
      "free_gb": 45.2
    }
  }
}

The overall status is `degraded` (not fully `healthy`) because the payment API is unreachable. The service might still be able to serve read requests, so it returns a non-fatal status rather than marking itself unhealthy entirely. Define per-component severity: critical dependencies (DB) failing = `unhealthy`; optional dependencies failing = `degraded`.

Health Check Implementation

typescript

import express from "express";
import { db } from "./db";
import { redis } from "./cache";

const app = express();

// Liveness: is the process alive?
app.get("/health/live", (req, res) => {
  res.json({ status: "ok" });
});

// Readiness: can this instance serve traffic?
app.get("/health/ready", async (req, res) => {
  const checks: Record<string, object> = {};
  let overall: "healthy" | "degraded" | "unhealthy" = "healthy";

  // Database check
  try {
    const start = Date.now();
    await db.query("SELECT 1");
    checks.database = { status: "healthy", latency_ms: Date.now() - start };
  } catch (err) {
    checks.database = { status: "unhealthy", error: String(err) };
    overall = "unhealthy"; // critical dependency
  }

  // Redis check
  try {
    const start = Date.now();
    await redis.ping();
    checks.redis = { status: "healthy", latency_ms: Date.now() - start };
  } catch (err) {
    checks.redis = { status: "degraded", error: String(err) };
    if (overall === "healthy") overall = "degraded"; // non-critical
  }

  const statusCode = overall === "unhealthy" ? 503 : 200;
  res.status(statusCode).json({ status: overall, checks });
});

Health Checks in Load Balancers

AWS ALB and NLB poll a configured health check endpoint every N seconds. An instance is considered healthy after passing M consecutive checks and unhealthy after failing N consecutive checks (hysteresis prevents flapping). Unhealthy instances are removed from the target group and no traffic is sent to them. The health check path, port, protocol, interval (default 30 s), timeout, and thresholds are all configurable.

Loading diagram...

Load balancer health check cycle: automatic traffic removal on failure and re-addition on recovery

Health Check Standards

The Health Check API pattern (from Microsoft Azure patterns) and the IETF draft for `application/health+json` (draft-inadarei-api-health-check) both define standard response formats. Spring Boot Actuator (`/actuator/health`) and Node.js `@godaddy/terminus` provide batteries-included health check frameworks that integrate with Kubernetes probes out of the box.

💡

Interview Tip

Health endpoints come up when discussing operational excellence and Kubernetes deployments. Key distinctions to make: liveness (restart the pod) vs readiness (remove from LB pool), why you must never check external dependencies in liveness probes, and how health checks enable zero-downtime deployments (readiness probe prevents traffic until the new pod is fully initialized). Mention the 'thundering herd restart' anti-pattern as a concrete failure mode.

Priority Queue Pattern

Cache-Aside Pattern