GitHub Engineering·March 3, 2026

Rebuilding GitHub Enterprise Server's Search Architecture for High Availability with Elasticsearch CCR

GitHub re-engineered its search architecture for GitHub Enterprise Server (GHES) to improve high availability and reduce administrative overhead. The previous design, which clustered Elasticsearch across primary and replica GHES nodes, led to complex maintenance issues and potential downtime. The new architecture leverages Elasticsearch's Cross Cluster Replication (CCR) to enable independent, single-node Elasticsearch clusters on each GHES instance, significantly simplifying data replication and ensuring durability.

Databases & Storage Distributed Systems Performance & Scaling

Read original on GitHub Engineering

GitHub's search functionality is critical, powering not just explicit search bars but also features like issue counts, release pages, and project views. Historically, managing search indexes in GitHub Enterprise Server (GHES) High Availability (HA) setups was challenging due to the intricate integration with older Elasticsearch versions. Administrators faced issues with corrupted or locked indexes if maintenance or upgrade steps weren't followed precisely.

Challenges with Previous Elasticsearch Architecture

The prior GHES HA model used a leader/follower pattern, with a primary node handling writes and replicas for failover. Elasticsearch, however, didn't natively support this pattern in the desired way. GitHub engineers had to create an Elasticsearch cluster spanning the primary and replica GHES nodes. While this initially offered straightforward data replication and local search performance, it introduced significant operational complexities.

Split-Brain Scenarios: Elasticsearch could move primary shards (responsible for writes) to replica nodes. If that replica was then taken down, GHES could enter a locked state where Elasticsearch couldn't become healthy until the replica rejoined, and the replica couldn't rejoin until Elasticsearch was healthy.
Complex Maintenance: Managing Elasticsearch clusters across disparate primary/replica GHES nodes made maintenance and upgrades highly error-prone, often requiring specific sequences to avoid data integrity issues.

Solution: Embracing Elasticsearch Cross Cluster Replication (CCR)

The pivotal change was adopting Elasticsearch's native Cross Cluster Replication (CCR) feature. Instead of a single logical Elasticsearch cluster spread across GHES nodes, the new approach utilizes several independent, single-node Elasticsearch clusters. Each GHES server instance now runs its own single-node Elasticsearch cluster.

ℹ️

How CCR Addresses HA Challenges

CCR allows for controlled and native replication of index data between these independent Elasticsearch clusters. Data is replicated only after it has been durably persisted to Lucene segments (Elasticsearch's underlying storage), ensuring strong consistency. This effectively enables a leader/follower pattern for search data that aligns with GHES's HA model, preventing critical data from being marooned on read-only nodes during outages or maintenance.

Custom Workflows for Lifecycle Management

While Elasticsearch CCR handles document replication, GitHub engineered custom workflows for the rest of the index's lifecycle, including failover, index deletion, and upgrades. A key part of the migration involves a bootstrap step to attach followers to existing indexes, followed by setting up auto-follow policies for future indexes. This ensures a seamless transition and continuous replication.

python

function bootstrap_ccr(primary, replica):
  # Fetch the current indexes on each
  primary_indexes = list_indexes(primary)
  replica_indexes = list_indexes(replica)

  # Filter out the system indexes
  managed = filter(primary_indexes, is_managed_ghe_index)

  # For indexes without follower patterns we need to
  # initialize that contract
  for index in managed:
    if index not in replica_indexes:
      ensure_follower_index(replica, leader=primary, index=index)
    else:
      ensure_following(replica, leader=primary, index=index)

  # Finally we will setup auto-follower patterns
  # so new indexes are automatically followed
  ensure_auto_follow_policy(
    replica,
    leader=primary,
    patterns=[managed_index_patterns],
    exclude=[system_index_patterns]
  )

ElasticsearchHigh AvailabilityCross Cluster ReplicationSearch ArchitectureGitHub Enterprise ServerDistributed DatabasesReplicationSystem Resilience

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly available and scalable search system for an enterprise code hosting platform like GitHub Enterprise Server, focusing on the architecture for index replication across primary and replica nodes using Elasticsearch's Cross Cluster Replication (CCR). Detail the setup, failover mechanisms, and custom workflows required for managing the index lifecycle.

Practice Interview

Focus: Highly available search index replication using Elasticsearch CCR

Other design angles

· Design a multi-tenant search service that ensures data isolation and high availability for each tenant's search indexes.· Explain the trade-offs and implementation considerations of using a shared Elasticsearch cluster versus independent single-node clusters with CCR for an HA setup.· Design a full-text search system for a large-scale application, incorporating various replication strategies beyond CCR, and compare their suitability for different HA requirements.