Distributed Stream Processing Systems underpin real-time analytics at ByteDance, with Apache Flink managing one of the world’s largest production deployments. Yong Fang, Yuxing Han, and Meng Wang, all from ByteDance Inc., alongside Yifan Zhang, Yue Ma, and Chi Zhang, detail StreamShield, a novel resiliency solution designed to address the challenges of maintaining operational stability and recovering from failures within such a massive system. This research is significant because it moves beyond theoretical approaches to present a production-proven system, incorporating runtime optimisation, a hybrid replication strategy, and high availability features for external systems. StreamShield also introduces a robust testing pipeline, demonstrably improving both the efficiency and effectiveness of Flink clusters at ByteDance.
Achieving resiliency and stability, essential for meeting strict Service Level Objectives (SLOs), remains challenging in large-scale production environments.
StreamShield is designed along complementary perspectives of the engine, cluster, and release. From the engine perspective, StreamShield enhances runtime resiliency through adaptive optimizations and fine-grained fault-tolerance mechanisms, narrowing the recovery scope. It accelerates job startup by optimizing job parsing and task deployment, and incorporates a HotUpdate mechanism that reuses existing resources to reduce restart overhead and improve deployment efficiency.
Adaptive Shuffle enables dynamic and load-aware data redistribution, while Autoscaling adjusts resource allocation in response to workload surges. WeakHash relaxes strict key-to-task binding, mapping each key to a bounded set of candidate tasks to alleviate data skew. Single-task Recovery narrows the recovery scope to individual tasks, restoring failed tasks without restarting neighbors, substantially reducing recovery latency.
From the cluster perspective, StreamShield adopts a hybrid replication strategy, balancing recovery latency and operational overhead. It employs passive replication as the default and selectively switches to active replication for latency-critical workloads with stringent availability requirements. Robustness against failures in external dependencies is strengthened through high-availability configuration, backoff and retry strategies, and multi-layer fault tolerance techniques, localizing failures and preserving cluster-wide stability.
From the release perspective, StreamShield incorporates a robust testing and deployment pipeline, including systematic chaos testing and performance benchmarking to validate stability under fault-prone and high-load conditions. This pipeline bridges controlled validation with real-world deployment, automating defect detection and preventing costly post-deployment failures.
This paper makes several contributions through the design and deployment of StreamShield. These include engine-level resiliency techniques that accelerate job startup, adapt execution dynamically, and enhance fault tolerance. Cluster-level resiliency is achieved through a hybrid replication strategy and high-availability mechanisms for external dependencies.
A comprehensive testing and deployment pipeline, integrating chaos testing, micro- and macro-benchmarking, and online probe tasks, is also presented. Its architecture comprises a Client, a JobManager, and a set of TaskManagers. The Client compiles the job into a logical dataflow graph and submits it to the JobManager, which converts it into a physical execution plan.
The JobManager identifies available resources by querying registered TaskManagers and assigns tasks to suitable task slots. TaskManagers instantiate tasks and begin processing data streams, while the JobManager coordinates checkpointing and monitors execution progress. A job represents the complete streaming application, expressed as a logical dataflow of operators.
An operator denotes a high-level transformation or computation. During execution, each operator is parallelized into multiple tasks, which are the actual runtime units scheduled on TaskManagers and assigned to task slots.
Evaluating StreamShield resilience through automated functional and chaos engineering is critical for ensuring consistent performance
A robust testing and release pipeline forms the cornerstone of the StreamShield solution for resilient stream processing. This pipeline integrates multi-layer validation to ensure both functional correctness and production reliability of the Apache Flink engine. Initiated with continuous integration and unit testing upon version packaging, automated test suites verify core functionalities and maintain backward correctness.
Daily builds then undergo chaos testing to examine system resilience under diverse failure scenarios, while micro benchmarking regularly measures the performance of operator execution and state backends. To comprehensively assess fault tolerance, chaos testing simulates anomalies at both the hardware and process levels.
Hardware-level perturbations introduce increased network latency, bandwidth throttling, CPU interference, and disk I/O contention, replicating resource heterogeneity and transient infrastructure bottlenecks. Process-level failures proactively terminate TaskManager and JobManager instances during active job execution, emulating node crashes or container evictions.
These tests systematically evaluate the system’s ability to recover execution state from checkpoints, reassign resources, and restore operator pipelines without compromising correctness. Complementing fault-resilience validation, a structured performance testing framework employs both micro and macro benchmarking.
Micro benchmarking focuses on critical computational paths, specifically operator execution, job scheduling, and state management. Evaluations of state management investigate the latency of read/write operations, scalability under large key-space expansion, and recovery efficiency across diverse checkpointing configurations, all conducted on single-node environments to minimise external variability.
Macro benchmarking deploys synthetic streaming queries at cluster scale, measuring end-to-end throughput and resource utilisation under realistic workloads. Pre-production and online probe tasks, lightweight streaming jobs emulating representative workloads, continuously validate engine stability and interoperability with external dependencies.
StreamShield achieves scale and stability through dynamic partitioning and fault tolerance mechanisms
Scientists have engineered a resilient stream processing solution, StreamShield, currently sustaining over 70,000 concurrent streaming jobs and managing more than 11 million resource slots. This demonstrates the effective operation of the system at substantial scale within ByteDance’s Flink clusters. The architecture incorporates runtime optimisation, fine-grained fault tolerance, a hybrid replication strategy, and high availability features designed to address challenges inherent in large-scale deployments.
WeakHash, a partitioning strategy implemented within StreamShield, dynamically selects execution targets for each record, associating keys with a bounded group of candidate TaskManagers. This diffuses the load of heavily skewed keys across multiple tasks, preventing hotspots and improving overall workload balance, particularly in data warehousing workloads involving dimension table lookups.
The system proves effective even when minor data loss is tolerable, stabilising query latency under heavy skew. AutoScaling within StreamShield analyses operator-level signals, including input rate, processing rate, backlog growth, and operator fan-out, to identify scaling bottlenecks. Smoothing and compensation strategies are incorporated to improve model accuracy under metric distortion caused by transient fluctuations or workload skew.
Automatic rollback upon failed adjustments, business-driven shrinkage policies, and protective measures such as rate limiting further enhance stability during large-scale reconfiguration. Region checkpointing refines the recovery process in large-scale stateful workloads, reducing recovery latency and disruption.
This approach allows for more targeted recovery, improving the overall resiliency of the streaming cluster. The system integrates several key techniques, including runtime optimisation, fine-grained fault tolerance, a hybrid replication strategy, and mechanisms to address failures in external systems.
These components work in concert to minimise disruption and maintain service availability during adverse events. The effectiveness of StreamShield is demonstrated through extensive evaluations on a production cluster, notably sustaining over 70,000 concurrent streaming jobs and managing more than 11 million resource slots.
This scale highlights the system’s ability to operate efficiently within a demanding, real-world environment. A fine-grained recovery mechanism further reduces downtime by avoiding unnecessary reinitialisation of unaffected tasks, ensuring continuous throughput even during failures. While the current implementation focuses on reactive recovery, future development will explore proactive diagnosis and recovery utilising machine learning to anticipate and mitigate potential issues before they impact critical applications.
👉 More information
🗞 StreamShield: A Production-Proven Resiliency Solution for Apache Flink at ByteDance
🧠 ArXiv: https://arxiv.org/abs/2602.03189
