This article discusses architectural considerations for building high-performance, low-latency machine learning inference pipelines, specifically for real-time algorithmic trading. It highlights challenges in feature engineering and model execution for time-sensitive predictions, advocating for in-memory data structures and multiprocessing to avoid bottlenecks.
Read original on Dev.to #systemdesignAchieving low-latency in machine learning inference pipelines, especially in domains like algorithmic trading, is critical. Traditional batch processing or synchronous model execution can introduce unacceptable delays, making predictions stale and reducing their value. The core challenge lies in transforming raw, continuous data streams into features and performing model inference within sub-millisecond windows.
Machine learning models require structured feature vectors, not raw data streams. A feature engine is responsible for converting a continuous firehose of raw text (e.g., WebSocket JSON frames from an order book) into stationary rolling windows of statistical features on the fly. This process must be highly optimized to avoid latency.
Sliding Ring-Buffer Pattern
Instead of relying on heavy database aggregation queries, high-throughput pipelines can employ an in-memory Sliding Ring-Buffer Pattern. This pattern allows for efficient, continuous computation of micro-structural features (like Order Book Imbalance) directly in RAM using tools like Redis or fixed-size NumPy arrays, achieving sub-millisecond recalculation intervals.
Synchronously executing a deep learning prediction within the main data ingestion thread is a major bottleneck. It can block network sockets, lead to buffer overflows, and cause frame drops. Decoupling data ingestion from model execution is essential for reliable, high-speed performance.
Decoupling Inference with Multiprocessing
To achieve reliable execution speeds, architects can use a Multiprocessing Worker Pool. This pattern allows feature vectors to be passed to an isolated inference process using non-blocking shared memory queues, preventing the incoming data feed from being bottlenecked. Additionally, compiling model weights to optimized serialized layers (e.g., ONNX Runtime, TensorRT) further improves inference speed by leveraging C++ memory spaces.
import multiprocessing as mp
import numpy as np
import onnxruntime as ort
def inference_worker_loop(task_queue, execution_queue, model_path):
# Initialize the high-performance inference session within the isolated worker process
session = ort.InferenceSession(model_path)
input_name = session.get_inputs()[0].name
while True:
# Pull the next feature vector from the non-blocking shared memory queue
features = task_queue.get()
if features is None:
break
# Run execution pass in optimized C++ memory space
prediction = session.run(None, {input_name: features.astype(np.float32)})