Vol. 1, Issue 1, Part A (2024)

Automated data cleaning using machine learning: A scalable framework

Author(s):

Lukas Reinhardt, Anna Müller, Tobias Schneider and Elisa Weber

Abstract:

Ensuring the reliability and accuracy of large-scale data is a critical prerequisite for effective machine learning and analytics. Traditional data cleaning approaches reliant on static rules and manual interventions are increasingly inadequate for today’s heterogeneous and high-volume data ecosystems. This study presents a scalable machine learning-driven framework for automated data cleaning, designed to unify error detection, imputation, duplication resolution, and label-error auditing within a single adaptive orchestration layer. Using four benchmark datasets spanning finance, e-commerce, healthcare, and web log domains, the proposed system was evaluated against leading methods including rule-based cleaning, ActiveClean, HoloClean, and BoostClean. Statistical analysis of performance metrics such as residual error rate, detection F1-score, imputation accuracy, and runtime efficiency revealed significant improvements across all datasets. The framework achieved an average 38% reduction in residual errors and a 2-5% increase in downstream classification accuracy while maintaining shorter execution times due to distributed orchestration via Spark MLlib. Results confirm that machine learning-orchestrated cleaning substantially enhances data quality and model reliability without incurring scalability penalties. The discussion highlights the importance of adaptive ensemble detection, probabilistic linkage, and non-parametric imputation in addressing complex, multi-type data inconsistencies. The study concludes that automated, intelligent data cleaning should be treated as an integrated component of modern analytics pipelines rather than a peripheral preprocessing step. Practical recommendations emphasize the need for explainable automation, distributed computing adoption, continuous validation loops, and label-quality auditing in organizational data governance. Collectively, the findings provide a scalable blueprint for industries seeking to maintain high-quality data streams capable of supporting trustworthy AI and data-driven decision-making across diverse domains.

Pages: 33-37  |  5 Views  3 Downloads

How to cite this article:
Lukas Reinhardt, Anna Müller, Tobias Schneider and Elisa Weber. Automated data cleaning using machine learning: A scalable framework. J. Mach. Learn. Data Sci. Artif. Intell. 2024;1(1):33-37.