GitHub re-engineered its search architecture for GitHub Enterprise Server (GHES) to improve high availability and reduce administrative overhead. The previous design, which clustered Elasticsearch across primary and replica GHES nodes, led to complex maintenance issues and potential downtime. The new architecture leverages Elasticsearch's Cross Cluster Replication (CCR) to enable independent, single-node Elasticsearch clusters on each GHES instance, significantly simplifying data replication and ensuring durability.
Read original on GitHub EngineeringGitHub's search functionality is critical, powering not just explicit search bars but also features like issue counts, release pages, and project views. Historically, managing search indexes in GitHub Enterprise Server (GHES) High Availability (HA) setups was challenging due to the intricate integration with older Elasticsearch versions. Administrators faced issues with corrupted or locked indexes if maintenance or upgrade steps weren't followed precisely.
The prior GHES HA model used a leader/follower pattern, with a primary node handling writes and replicas for failover. Elasticsearch, however, didn't natively support this pattern in the desired way. GitHub engineers had to create an Elasticsearch cluster spanning the primary and replica GHES nodes. While this initially offered straightforward data replication and local search performance, it introduced significant operational complexities.
The pivotal change was adopting Elasticsearch's native Cross Cluster Replication (CCR) feature. Instead of a single logical Elasticsearch cluster spread across GHES nodes, the new approach utilizes several independent, single-node Elasticsearch clusters. Each GHES server instance now runs its own single-node Elasticsearch cluster.
How CCR Addresses HA Challenges
CCR allows for controlled and native replication of index data between these independent Elasticsearch clusters. Data is replicated only after it has been durably persisted to Lucene segments (Elasticsearch's underlying storage), ensuring strong consistency. This effectively enables a leader/follower pattern for search data that aligns with GHES's HA model, preventing critical data from being marooned on read-only nodes during outages or maintenance.
While Elasticsearch CCR handles document replication, GitHub engineered custom workflows for the rest of the index's lifecycle, including failover, index deletion, and upgrades. A key part of the migration involves a bootstrap step to attach followers to existing indexes, followed by setting up auto-follow policies for future indexes. This ensures a seamless transition and continuous replication.
function bootstrap_ccr(primary, replica):
# Fetch the current indexes on each
primary_indexes = list_indexes(primary)
replica_indexes = list_indexes(replica)
# Filter out the system indexes
managed = filter(primary_indexes, is_managed_ghe_index)
# For indexes without follower patterns we need to
# initialize that contract
for index in managed:
if index not in replica_indexes:
ensure_follower_index(replica, leader=primary, index=index)
else:
ensure_following(replica, leader=primary, index=index)
# Finally we will setup auto-follower patterns
# so new indexes are automatically followed
ensure_auto_follow_policy(
replica,
leader=primary,
patterns=[managed_index_patterns],
exclude=[system_index_patterns]
)