A Governed Lakehouse DataOps Architecture: Design and Evaluation in Healthcare
Downloads
Healthcare organizations increasingly require secure, governed and AI-ready data pipelines capable of handling heterogeneous and sensitive data sources. This study aims to design and evaluate a unified DataOps reference architecture that operationalizes the full data lifecycle through a governed Medallion Lakehouse model. Methodologically, the proposed architecture integrates data-centric CI/CD, Infrastructure as Code, workflow orchestration, governance and metadata management, monitoring, and explicit promotion contracts across Bronze, Silver, and Gold layers. The framework was implemented and evaluated in a controlled healthcare testbed using approximately 3.5 GB of multi-source clinical data over a 25-day workload. The findings show that the proposed architecture achieved a DataOps Operational Excellence Index (DOEI) of 0.92, an ingestion throughput of approximately 100 MB/s, a data quality score of 97.87% and a 72% reduction in infrastructure provisioning time, from 3 hours to 50 minutes. The main novelty of this work lies in combining a governed Lakehouse-based DataOps architecture with explicit promotion contracts and a composite benchmarking index for assessing operational maturity. This improvement provides a reproducible, auditable, and scalable framework for secure data operations in regulated environments such as healthcare.
Downloads
[1] Oktavia, T., & Wijaya, E. (2025). Strategic Metadata Implementation: A Catalyst for Enhanced BI Systems and Organizational Effectiveness. HighTech and Innovation Journal, 6(1), 21–41. doi:10.28991/HIJ-2025-06-01-02.
[2] Alexandrov, I. A., Kuklin, V. Z., Chervyakov, L. M., & Sheptunov, S. A. (2024). Development of a Technique for Discrete-Logical Decision-Making in Medical Information Systems. HighTech and Innovation Journal, 5(4), 1008–1023. doi:10.28991/HIJ-2024-05-04-010.
[3] Fannouch, A., Gahi, Y., & Gharib, J. (2024). Unified Data Framework for Enhanced Data Management, Consumption, Provisioning, Processing and Movement. ACM International Conference Proceeding Series, 3659836. doi:10.1145/3659677.3659836.
[4] Alselami, N., Aati, K., Mutnbak, M., Alrasheed, K. A., & Basit Khan, M. (2025). Impact of the Application of Smart Sensor Networks for the Construction Management of Geotechnical Activities. Civil Engineering Journal, 11(1), 346–368. doi:10.28991/CEJ-2025-011-01-020.
[5] Mishra, S., & Misra, A. (2018). Structured and Unstructured Big Data Analytics. International Conference on Current Trends in Computer, Electrical, Electronics and Communication, CTCEEC 2017, 740–746. doi:10.1109/CTCEEC.2017.8454999.
[6] Hamouda, S., & Zainol, Z. (2019). Semi-structured data model for big data (SS-DMBD). DATA 2019 - Proceedings of the 8th International Conference on Data Science, Technology and Applications, 348–356. doi:10.5220/0007957603480356.
[7] Ntinopoulos, V., Rodriguez Cetina Biefer, H., Tudorache, I., Papadopoulos, N., Odavic, D., Risteski, P., Haeussler, A., & Dzemali, O. (2025). Large language models for data extraction from unstructured and semi-structured electronic health records: A multiple model performance evaluation. BMJ Health and Care Informatics, 32(1), 101139. doi:10.1136/bmjhci-2024-101139.
[8] Saleem, A., Shah, S., Iftikhar, H., Zywiołek, J., & Albalawi, O. (2025). A Comprehensive Systematic Survey of IoT Protocols: Implications for Data Quality and Performance. IEEE Access, 13, 196206–196235. doi:10.1109/ACCESS.2024.3486927.
[9] Yu, S., Chen, T., Han, L., Demartini, G., & Sadiq, S. (2023). DataOps-4G: On Supporting Generalists in Data Quality Discovery. IEEE Transactions on Knowledge and Data Engineering, 35(5), 4668–4681. doi:10.1109/TKDE.2022.3151605.
[10] Yin, Z., Zhou, S., Zhou, J., Tian, M., Lin, M., & Liu, S. (2023). Research on DataOps Capability - Practice and Development. Proceedings - 2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom/BigDataSE/CSE/EUC/iSCI 2023, 2170–2174. doi:10.1109/TrustCom60117.2023.00303.
[11] Morjane, W., Bannari, R., & Gharib, J. (2023). Overview of Project Management Methodologies: Traditional Versus Agile Approach. Proceedings of the 5th European International Conference on Industrial Engineering and Operations Management, 783–794. doi:10.46254/eu05.20220157.
[12] Mansour, I. J. S., Mat Rejab, M. B., & Mahdin, H. Bin. (2024). Review in Adoption of DevOps, AIOps, DataOps, GitOps, MLOps in IT MLEs in Germany. International Journal of Engineering Trends and Technology, 72(12), 64–76. doi:10.14445/22315381/IJETT-V72I12P106.
[13] Chung, E. S., & Molléri, J. S. (2025). A Multivocal Literature Review on DataOps—Concepts, Benefits, and Challenges. Lecture Notes in Networks and Systems, 1255, 213–226. doi:10.1007/978-981-96-1747-0_18.
[14] Fannouch, A., Gharib, J., & Gahi, Y. (2025). Enhancing DataOps practices through innovative collaborative models: A systematic review. International Journal of Information Management Data Insights, 5(1), 100321. doi:10.1016/j.jjimei.2025.100321.
[15] Bahaa, S., Ghalwash, A. Z., & Harb, H. (2023). DataOps Lifecycle with a Case Study in Healthcare. International Journal of Advanced Computer Science and Applications, 14(1), 136–144. doi:10.14569/IJACSA.2023.0140115.
[16] Garriga, M., Aarns, K., Tsigkanos, C., Tamburri, D. A., & Heuvel, W. Van Den. (2021). DataOps for Cyber-Physical Systems Governance: The Airport Passenger Flow Case. ACM Transactions on Internet Technology, 21(2), 1–25. doi:10.1145/3432247.
[17] Tamburri, D. A., Heuvel, W. J. Van Den, & Garriga, M. (2020). DataOps for Societal Intelligence: A Data Pipeline for Labor Market Skills Extraction and Matching. Proceedings - 2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science, IRI 2020, 391–394. doi:10.1109/IRI49571.2020.00063.
[18] Scrocca, M., Grassi, M., Carenini, A., Anicic, D., Calbimonte, J. P., & Celino, I. (2026). A DataOps Toolbox Enabling Continuous Semantic Integration of Devices for Edge-Cloud AI Applications. Lecture Notes in Computer Science: Vol. 16141 LNCS, 379–397. doi:10.1007/978-3-032-09530-5_22.
[19] Pestana, G., Almeida, M., & Martins, N. (2025). Tracking Secondary Raw Material Operational Framework—DataOps Case Study. Ceramics, 8(1), 12. doi:10.3390/ceramics8010012.
[20] Mishra, T. (2025). 451 Research: DataOps unlocks the value of data. Hitachi Digital Services, California, United States. Available online: https://www.hitachids.com/vn-english/pdf/451-research-dataops-unlocks-the-value-of-data/ (accessed on May 2026).
[21] Chia, J. (2026). What is DataOps? Definition, principles, and benefits. Alation, California, United States. Available online: https://www.alation.com/blog/what-is-dataops/ (accessed on May 2026).
[22] DataKitchen. (2021). Eight top DataOps trends for 2022. DataKitchen Marketing Team, Lexington, United States. Available online: https://datakitchen.io/eight-top-dataops-trends-for-2022/ (accessed on May 2026).
[23] Thusoo, A., & Sarma, J. S. (2017). Creating a data-driven enterprise with DataOps: Insights from Facebook, Uber, LinkedIn, Twitter, and eBay. O’Reilly Media, California, United States.
[24] Grand View Research. (2024). DataOps platform market size & share report, 2024–2030. Grand View Research, California, United States. Available online: https://www.grandviewresearch.com/industry-analysis/dataops-platform-market-report (accessed on May 2026).
[25] ISG. (2025). AI adoption drives interest in DataOps, ISG study finds. Information Services Group, Stamford, United States. Available online: https://ir.isg-one.com/news-market-information/press-releases/news-details/2025/AI-Adoption-Drives-Interest-in-DataOps-ISG-Study-Finds/default.aspx (accessed on May 2026).
[26] Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E., McDonald, S., … Moher, D. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ, 372. doi:10.1136/bmj.n71.
[27] Zahid, H., Mahmood, T., & Ikram, N. (2018). Enhancing dependability in big data analytics enterprise pipelines. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Vol. 11342 LNCS, 272–281. doi:10.1007/978-3-030-05345-1_23.
[28] Pinkel, C., Schwarte, A., Trame, J., Nikolov, A., Bastinos, A. S., & Zeuch, T. (2015). DataOps: Seamless End-to-End anything-to-RDF data integration. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9341, 123–127. doi:10.1007/978-3-319-25639-9_24.
[29] Tu, D., He, Y., Cui, W., Ge, S., Zhang, H., Han, S., Zhang, D., & Chaudhuri, S. (2023). Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 4991–5003. doi:10.1145/3580305.3599776.
[30] Bayram, F., Ahmed, B. S., Hallin, E., & Engman, A. (2023). DQSOps: Data Quality Scoring Operations Framework for Data-Driven Applications. ACM International Conference Proceeding Series, 32–41. doi:10.1145/3593434.3593445.
[31] Spine Model. (2026). The Spine Model. Spine Model Documentation and Wiki. Available online: https://spinemodel.info/ (accessed on May 2026).
[32] Carthen, C., Zaremehrjardi, A., Le, V., Cardillo, C., Strachan, S., Tavakkoli, A., Harris, F. C., & Dascalu, S. M. (2023). Orchestrating Apache NiFi/MiNiFi within a Spatial Data Pipeline. Proceedings - 2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications, SERA 2023, 366–371. doi:10.1109/SERA57763.2023.10197731.
[33] Armbrust, M., Das, T., Sun, L., Yavuz, B., Zhu, S., Murthy, M., Torres, J., van Hovell, H., Ionescu, A., Łuszczak, A., Świtakowski, M., Szafrański, M., Li, X., Ueshin, T., Mokhtar, M., Boncz, P., Ghodsi, A., Paranjpye, S., Senster, P., … Zaharia, M. (2020). Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proceedings of the VLDB Endowment, 13(12), 3411–3424. doi:10.14778/3415478.3415560.
[34] Papadopoulos, A. N., Sioutas, S., Zaroliagis, C., & Zacharatos, N. (2019). Efficient distributed range query processing in apache spark. Proceedings - 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019, 569–575. doi:10.1109/CCGRID.2019.00073.
[35] Soares, G. H., & Brito, M. A. (2023). Business Intelligence over and above Apache Superset. Iberian Conference on Information Systems and Technologies, CISTI, 2023-June, 1–6. doi:10.23919/CISTI58278.2023.10211907.
[36] Lucas, C. (2023). Evaluate Apache Ranger to provide comprehensive security across the CERN Hadoop ecosystem. CERN Repository, CERN Repository. Available online: https://repository.cern/records/hsh6d-2xh86 (accessed on May 2026).
[37] Lambert, F., Odier, J., Fulachier, J., Jaume, M., & Delsart, P. A. (2024). Deploying the ATLAS Metadata Interface (AMI) stack in a Docker Compose or Kubernetes environment. EPJ Web of Conferences, 295, 1017. doi:10.1051/epjconf/202429501017.
[38] Tian, L., Sedona, R., Mozaffari, A., Kreshpa, E., Paris, C., Riedel, M., Schultz, M. G., & Cavallaro, G. (2023). End-To-End Process Orchestration of Earth Observation Data Workflows With Apache Airflow on High Performance Computing. International Geoscience and Remote Sensing Symposium (IGARSS), 711–714. doi:10.1109/IGARSS52108.2023.10283416.
[39] Fairbanks, J., Tharigonda, A., & Eisty, N. U. (2023). Analyzing the Effects of CI/CD on Open Source Repositories in GitHub and GitLab. Proceedings - 2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications, SERA 2023, 176–181. doi:10.1109/SERA57763.2023.10197778.
[40] Gupta, M., Chowdary, M. N., Bussa, S., & Chowdary, C. K. (2021). Deploying Hadoop Architecture Using Ansible and Terraform. 2021 5th International Conference on Information Systems and Computer Networks, ISCON 2021, 1–6. doi:10.1109/ISCON52037.2021.9702299.
[41] Ahmed, F., Jahangir, U., Rahim, H., Ali, K., & Agha, D. E. S. (2020). Centralized Log Management Using Elasticsearch, Logstash and Kibana. ICISCT 2020 - 2nd International Conference on Information Science and Communication Technology, 1–7. doi:10.1109/ICISCT49550.2020.9080053.
- This work (including HTML and PDF Files) is licensed under a Creative Commons Attribution 4.0 International License.





















