DonNTU   Masters' portal


Abstract

Content

Introduction

In today's world, quality of information in large corporations play a key role. And many of the projects are directly dependent on the quality of data and implementation by the enterprise. It is therefore very important that each branch providing such data, has very minimal percentage of “bad” data, otherwise presence of bad data will grow exponentially when integration of same data is done into the data warehouse.

Clearing data (data cleaning, data cleansing or scrubbing) is engaged in identifying and removing errors and inconsistencies in the data in order to improve the quality of the data [5].

1. Relevance of the topic

All the huge corporations have received and processed a huge amount of data, particularly personal type of data, collected from all branches of the companyEvery branch has its own data structure, and after integration into a single data source like in DataWarehouse (DW), problem of data unreliability arises due to disparate data in different views, which has to used for analysis. These data are of poor quality, as they contains mistakes/corrupt data, and they become useless for analysis. Therefore, to get real analytical outcome from existing data, different method for their correction, de-duplication and cleaning need to be used.

2. Statement of the problem of data cleaning

There are many companies in market which offers their PROGRAM IN data cleaning, such as: Trillium Software, Group-1 Software, Innovative Systems, Vality / Ascential Software, First Logic, Deductor and others [7], which help detect and automatically to fix the most important types of personalized data (eg correction of names and addresses of people using the national directory for names and addresses). But these tools are not perfect. They are unable to work with all types of "dirty" data, and for this reason, not all companies use the existing tools available in the market. Also another factor is cost, which makes acquisition of these tools undesirable, and Insufficient handling of data quality, due to the fact that there is lack of complete knowledge and understanding of the types of pollution (that were imported into the data warehouse), their influence (in the future, they will affect the reliability of the information obtained from the data warehouse).

3. Classification of existing bugs

There are many types of errors that do not depend on the domain. There are six types of errors:

According to the above details we can classify these types of errors as: