Menu
ByteByteGo·March 3, 2026

Building a Unified Financial Data Pipeline for Consistency and Quality

Agoda transitioned from disparate data pipelines to a Financial Unified Data Pipeline (FINUDP) to overcome inconsistencies and ensure a single source of truth for critical financial data. This article details the architectural decisions, technical practices like shadow testing and proactive monitoring, and challenges faced in building a robust, high-quality data system using Apache Spark.

Read original on ByteByteGo

Agoda faced significant challenges with multiple, independently developed data pipelines for financial data. Each team's pipeline, while offering simplicity and clear ownership initially, led to duplicate data sources, inconsistent definitions and transformations, and a lack of centralized monitoring and quality control. This resulted in discrepancies in financial reporting, undermined data reliability, and wasted computational resources.

The FINUDP Architecture

To address these issues, Agoda developed the Financial Unified Data Pipeline (FINUDP), a centralized system built on Apache Spark for distributed processing. The architecture comprises:

  • Source Tables: Raw data from upstream systems (bookings, payments).
  • Execution Layer: Apache Spark for data processing, integrating with GoFresh for schedule monitoring and internal alerting for job failures.
  • Data Lake: Storage for processed data with built-in validation mechanisms.
  • Downstream Consumers: Finance, Planning, and Ledger teams consuming the validated data.
ℹ️

Key Non-Functional Requirements

FINUDP was designed with three critical non-functional requirements in mind: data freshness (hourly updates, monitored by GoFresh), reliability (automated data quality checks with immediate alerts), and maintainability (strong peer-reviewed designs, mandatory code reviews, and shadow testing).

Technical Practices for Data Quality Assurance

Agoda implemented several technical practices to ensure the reliability and quality of FINUDP, crucial for any production-grade data system:

  • Shadow Testing: Running new pipeline code alongside the old version on production data in a test environment and comparing outputs to catch side effects.
  • Staging Environment: A production-mirroring environment for comprehensive testing of new features, logic, and schema changes before deployment.
  • Proactive Monitoring: Daily snapshots, partition count checks, anomaly detection, and a multi-level alerting system (email, Slack, GoFresh, NOC) for rapid issue detection.
  • Data Integrity Verification: Using third-party tools (e.g., Quilliup) with SQL queries to compare source and target data and alert on significant deviations.
  • Data Contracts: Formal agreements with upstream teams defining data rules and structure, with both detection (real-time monitoring) and preventative (CI pipeline integration) contracts.

These practices collectively enhance data reliability, ensure consistent data quality, and provide robust mechanisms for identifying and resolving issues quickly, significantly improving trust in financial data.

data pipelineETLApache Sparkdata qualitydata consistencyshadow testingdata governancefinancial data

Comments

Loading comments...