A multi-layer approach for data cleaning in the healthcare domain
Abstract
It is an undeniable fact that nowadays there exists a plethora of sources that can generate data with complex and, most of the time, error-prone nature, as well as multiple origins. Those sources may be of different complexity, but most of them share a common characteristic: the lack of performing quality checks on the collected data. The aforementioned implies that, in every platform that utilizes data originating from those sources, there should be a mechanism that is responsible for assuring the reliability of the collected data, thus providing to the rest of the platform's mechanisms (e.g., risk analysis and prediction mechanisms) data of high quality that could lead to the best knowledge extraction possible for decision making. The need for this kind of mechanism is even greater when it comes to the healthcare domain because the clean data, which a data cleaning mechanism produces, are essential to bring consistency to healthcare data that might be inaccurate, outdated, redundant or incomplete. Considering these challenges, in this paper it is being proposed a data cleaning mechanism for assuring the quality and the reliability of the data regardless of their origin. The mechanism consists of three (3) sub-components, being responsible for ingesting and storing the data, also including a set of cleaning actions. These actions, namely “Validation”, “Cleaning”, “Verification” and “Logging”, combine multiple well-established data cleaning techniques to ensure the effectiveness and the efficiency of the whole data cleaning procedure. Its evaluation process includes the usage of three (3) separate datasets from the healthcare domain that contain different types of data and errors in their corresponding records. The results of the mechanism (i.e., the cleaned data) are being compared with the ground truth of these datasets, resulting that the data cleaning mechanism was successfully and efficiently preformed, thus providing an extensive insight regarding the mechanism's capabilities.