AUTOMATION OF DATA CLEANING PROCESS BY UNDERSTANDING NUANCES IN COVID-19 DATA

Jillian Yasmin Chua

Abstract


Data has become important in helping governments and healthcare organizations create effective responses to mitigate the spread of the COVID-19 virus. Using data as a basis for decision making leads to better and more grounded policies and response implementations. But with the immense scale of data and unprepared eHealth systems, data quality is often overlooked during data collection. Data cleaning is the most crucial, important, and the most time-consuming part in data mining. This study looks into understanding the nuances in COVID-19 data and presents data cleaning and validation methods to monitor and improve data quality without consuming too much time. It has been observed that data format inconsistency stems from data submissions coming from multiple systems. Data quality issues in COVID-19 health data were categorized into validity, consistency, completeness, and uniqueness. The categories of data quality issues were observed in analysing the data to identify the issues and challenges that cause poor data quality to define the data cleaning workflows. The data cleaning process framework has been designed to develop the data cleaning and validation scripts to resolve data quality issues, improving the data quality of the COVID-19 data. The overall data quality of the COVID-19 data used in the study is 77%, where data is 86% valid, 60% consistent, and 90% complete with 7% duplicate data.


Keywords


Data cleaning, Data validation, Data quality

Full Text:

Abstract

Refbacks

  • There are currently no refbacks.