Serverless ETL Orchestration with Apache Airflow and AWS Step Functions: A Comparative Study

Authors

  • Chiranjeevi Devi LinkedIn Corp, USA Author
  • Naveen Kumar Siripuram CVS Health, USA Author
  • Amsa Selvaraj Amtech Analytics, USA Author

Keywords:

serverless, ETL, orchestration, Apache Airflow, AWS Step Functions, Lambda, SCD Type 2, operational complexity

Abstract

This research thoroughly compares Apache Airflow with AWS Step Functions serverless ETL orchestration in large retail data pipelines. Two semantically identical Slowly Changing Dimension Type 2 (SCD Type 2) ingestion pipelines were created: one using containerised Airflow and the other via Step routines to trigger AWS Lambda routines. Cold-start delay, execution cost, operational complexity, and scalability with different workloads are then assessed. Real-world data shows that traditional workflow schedulers and serverless orchestration frameworks have pros and cons in governance, observability, retry semantics, state durability, and IaC integration. Airflow has better DAG-level control and plugin extensibility, while Step Functions are cheaper and more flexible. Findings help business teams plan how to migrate traditional batch ETL workloads to newer ones, indicating cost-performance trade-offs and organisational compliance.

Downloads

Download data is not yet available.

References

A. Baldini et al., "Serverless Computing: Current Trends and Open Problems," in Proc. of the Research Advances in Cloud Computing, Springer, 2017, pp. 1–20.

E. Jonas et al., "Cloud Programming Simplified: A Berkeley View on Serverless Computing," University of California Berkeley Technical Report, UCB/EECS-2019-3, 2019.

M. Mao and M. Humphrey, "A Performance Study on the VM Startup Time in the Cloud," in Proc. of the 5th IEEE Int. Conf. on Cloud Computing, Honolulu, HI, 2012, pp. 423–430.

L. Wang et al., "Peeking Behind the Curtains of Serverless Platforms," in Proc. of the USENIX ATC, 2018, pp. 133–146.

D. Jackson, D. Wright, and G. Paton, "Data Lake Engineering with Apache Airflow," in Proc. of the IEEE Int. Conf. on Big Data (BigData), 2020, pp. 3185–3194.

P. Castro, V. Ishakian, V. Muthusamy, and A. Slominski, "The Rise of Serverless Computing," Communications of the ACM, vol. 62, no. 12, pp. 44–54, Dec. 2019.

G. Malawski, "Towards Serverless Execution of Scientific Workflows–Experiences and Challenges," in Proc. of the Workflows in Support of Large-Scale Science (WORKS), 2016, pp. 25–33.

S. Hendrickson et al., "Serverless Computation with OpenLambda," in Proc. of the 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), 2016.

M. Roberts, Designing Data-Intensive Applications, 1st ed., Sebastopol, CA: O’Reilly Media, 2017.

AWS, "Step Functions Developer Guide," Amazon Web Services

Apache Software Foundation, "Apache Airflow Documentation,"

H. Garcia-Molina, J. D. Ullman, and J. Widom, Database Systems: The Complete Book, 2nd ed., Pearson, 2008.

D. Agrawal et al., "Challenges and Opportunities with Big Data ETL: A Community White Paper," in Proc. of the VLDB Endowment, vol. 7, no. 6, pp. 421–426, Feb. 2014.

Y. Simmhan, B. Plale, and D. Gannon, "A Survey of Data Provenance Techniques," ACM SIGMOD Record, vol. 34, no. 3, pp. 31–36, Sep. 2005.

K. Hwang, J. Dongarra, and G. Fox, Distributed and Cloud Computing: From Parallel Processing to the Internet of Things, Morgan Kaufmann, 2012.

R. Buyya, C. S. Yeo, and S. Venugopal, "Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as Computing Utilities," in Proc. of the 10th IEEE Int. Conf. on High Performance Computing and Communications, 2008, pp. 5–13.

G. DeCandia et al., "Dynamo: Amazon's Highly Available Key-value Store," in Proc. of the ACM SIGOPS Symposium on Operating Systems Principles, 2007, pp. 205–220.

M. Fowler, "Serverless Architectures," martinfowler.com, 2016. [Online]. Available: https://martinfowler.com/articles/serverless.html

D. Johnston and D. Ping, "AWS Lambda Observability: Building Distributed Tracing at Scale," in Proc. of the AWS re:Invent, 2021.

M. Zaharia et al., "Discretized Streams: Fault-Tolerant Streaming Computation at Scale," in Proc. of the 24th ACM Symposium on Operating Systems Principles (SOSP), 2013, pp. 423–438.

Downloads

Published

16-06-2025

How to Cite

[1]
Chiranjeevi Devi, Naveen Kumar Siripuram, and Amsa Selvaraj, “Serverless ETL Orchestration with Apache Airflow and AWS Step Functions: A Comparative Study”, European Journal of Quantum Computing and Intelligent Agents, vol. 9, pp. 15–52, Jun. 2025, Accessed: Jun. 11, 2026. [Online]. Available: https://ejqcia.org/index.php/publication/article/view/32