Abstract

Content

Introduction
1. Relevance of the topic
2. Statement of the problem of data cleaning
3. Classification of existing errors
4. Methods and tools for data cleansing in the modern enterprise information systems
5. Type of errors. Gaps in data
6. Type of errors. The contradictory information.
7. Type of errors. Duplication
8. Type of errors. Inconsistency formats
Conclusion
List of sources

Introduction

In today's world, quality of information in large corporations play a key role. And many of the projects are directly dependent on the quality of data and implementation by the enterprise. It is therefore very important that each branch providing such data, has very minimal percentage of “bad” data, otherwise presence of bad data will grow exponentially when integration of same data is done into the data warehouse.

Clearing data (data cleaning, data cleansing or scrubbing) is engaged in identifying and removing errors and inconsistencies in the data in order to improve the quality of the data [5].

1. Relevance of the topic

All the huge corporations have received and processed a huge amount of data, particularly personal type of data, collected from all branches of the companyEvery branch has its own data structure, and after integration into a single data source like in DataWarehouse (DW), problem of data unreliability arises due to disparate data in different views, which has to used for analysis. These data are of poor quality, as they contains mistakes/corrupt data, and they become useless for analysis. Therefore, to get real analytical outcome from existing data, different method for their correction, de-duplication and cleaning need to be used.

2. Statement of the problem of data cleaning

There are many companies in market which offers their PROGRAM IN data cleaning, such as: Trillium Software, Group-1 Software, Innovative Systems, Vality / Ascential Software, First Logic, Deductor and others [7], which help detect and automatically to fix the most important types of personalized data (eg correction of names and addresses of people using the national directory for names and addresses). But these tools are not perfect. They are unable to work with all types of "dirty" data, and for this reason, not all companies use the existing tools available in the market. Also another factor is cost, which makes acquisition of these tools undesirable, and Insufficient handling of data quality, due to the fact that there is lack of complete knowledge and understanding of the types of pollution (that were imported into the data warehouse), their influence (in the future, they will affect the reliability of the information obtained from the data warehouse).

3. Classification of existing bugs

There are many types of errors that do not depend on the domain. There are six types of errors:

Conflicting information, which does not comply with laws, regulations, or reality. Firstly it needs to be decided, what must be regarded as controversial. For example, under the laws of Ukraine pension card is changed when changes in name is done, but same is not applicable on change of sex [3].
Abnormal values are those values that are generally knocked out of the picture. Most of these values are corrected manually. This is due to the fact that such means of forecasting/predicting lack knowledge about the nature of the processes. Therefore, any anomaly will be seen as a perfectly normal . Because of this it creates very distorted picture of the future. Some random failure or success will become norm [3].
Omission of data is a type error where fields are there to be filled or not filled at all till the end. This problem is very serious for the most of HD. Most prediction methods are based on the assumption that the data comes at a constant stream of uniformity. In practice, this is rare. Therefore, one of the most popular applications CD - prediction - is implemented poorly, or with significant limitations [3].
The noise - data where testimony is significantly higher or lower than optimal values. Often, when analyzing the data noise. It does not take any valuable information, and prevents a clear look at the picture.
Inconsistency of data formats are the same type of data with different formats.
Data entry errors or omissions in any data predominate, because introduced manually by a human. Errata - is this type of error where the data contains missing, extra characters or garbled data.
Duplication - is duplicate data. The repetition of the various data - is the most common error when data is inserted into the data warehouse.

According to the above details we can classify these types of errors as:

missing data;
overlap;
contradictory information;
inconsistency of data formats.

4. Methods and tools for data cleansing in the modern enterprise information systems

To date, there are many methods to clean up the data from errors and omissions. None of the experts will point out to any to be most effective, since each method has different approaches to the problem.

This problem is solved in three different ways:

methods that are based on the concepts of mathematical statistics;
means of ETL (from the English. Extract, Transform, Load - literally "extract, transform, load" - one of the key processes in managing the data warehouse [5]).

Simple methods (regular expressions, strict formal rules, etc.) are very primitive and can solve this problem only partially, so the researchers decided to use mathematical statistics, and intellectual methods.

Calculate the required performance for all data that is available, i.e. covers the whole range of existing values and accepted features. Based on these results alone methods can identify suspicious information that is very different from the others, and for others - to calculate the values that are expected resemble most. Thus, analyzing the data using statistical characteristics, assess the overall picture of data and is already on her background determines the possible errors and their subsequent correction to the selected similar to [2].

5. Type of errors. Gaps in data

This type of error can be solved in two different ways:

The method of the machine dictionary. It is an ordered set of linguistic information stored in computer memory in a certain sequence. This method uses the word to be checked in a pre-compiled dictionary engine. It should include all possible values taken by a given field, when dealing with personal information used by the classifiers. Classifier – is a dictionary of names of objects, groups to which they are divided based on the degree of similarity, and identify their codes. For example, the classifier telephone codes and mobile operators, the classifier addresses, and so on. With this classifier can get rid of gaps in the fields. Then the blank part of the information sought in the classifier from the available data. If it is found in only a suitable option, it is entered instead of the badge. Otherwise, all these values are given to the expert decision maker, who chooses which option is closer to the source [2].

Intelligent method. Sometimes the data is so, that they forget to specify the city or postal code in the address field, then you can use the "improvement". Improvement is adding to the already existing information, a number of facts, for example, can be added to country, region, district, longitude and latitude of this area, etc. You can also use this method to assign customers to the floor based on an analysis of his name, and other indicators of its profile. However, the most valuable supplement to the client profile is additional data, that is, third-party data, which contain demographic and psychographic information [2].

6. Type of errors. The contradictory information.

A simple method. With the qualifier identifying codes as "dirty" data. Even if one verifiable value not mapped to the code or codes for related data obtained, contradicting each other, then they probably made a mistake. To fix it, fields need to be checked separately in the presence of typos or additional values need to be considered, which may recovere lost data. Then again need to search for codes in the classifier to have new findings, until then, will be eliminated until this type of error is removed [2].

Validation. It sometimes happens that a person can enter the wrong code, which is home to the city or may not be comparable to the area of residence, etc. In this case it is necessary to use the intellectual means by which it is possible to carry out international recognition of valid addresses. Some applications are combined with software validation and file e-mail addresses, checking validity of international address data [1].

7. Type of errors. Duplication

The method of "hard" rules. This method involves comparison of the phase parameters of objects with the use of "hard" rules for calculating the coefficient of coincidence for each field being tested. The resulting coefficient of similarity between objects is calculated as the sum of the coefficients for each field, and if its value exceeds a predetermined threshold, then the objects are considered duplicates. In Figure 1, the mechanism of the submission.

Self-learning algorithm to search for duplicates. This method is based on the application of machine learning models to search for potential duplicates. The module consists of the following steps: training and application models. The first step is to prepare the sample data, which is training the model. After this step, the model is introduced into commercial operation. The application of this approach involves periodic retraining of the constructed models, which allows them to adapt to the changes in the data.

Comparison and analysis of the results. This module provides a comparison and evaluation of results obtained with the use of "hard" rules and machine learning models. In addition, forms the final set of potentially similar objects. Potential duplicates are then grouped according to rules which are always individual, depending on the tasks. One of the options available to duplicate associations is the formation of similar groups of customers who reside in the same neighborhood or city [4].

Coordination and consolidation. Reconciliation is necessary to prioritize between the fields (to be confirmed) and the control sequence comparison of the fields.

8. Type of errors. Inconsistency formats

Standardization. These names, phone numbers and addresses can be entered in different formats, which are grammatically correct. For example, the "Street", "street." And "Street" refers to the same concept as part of a clear address. Or phone numbers "(063) 111 11 11" "380631111111" and "38 (063) 1111111." In the postal and telephone service standards exist for these and other similar cases (so far only such service exists in the United States and Russia).The most important object of standardization are writing to customers whose accuracy can be significantly improved through the use of the harmonization process, as described below. Special programs are transforming the field of standardization in a certain pattern, suitable for postal and telephone service.

Conclusion

Despite the fact that there are a lot of platforms, systems, tools for data transformation and cleaning, they are still lacking in usefulness for Data Cleaning. These funds will not perfectly clean duplication, loss of data & inconsistencies. Therefore, experts are now trying to find optimal solutions for the variations in data cleaning

List of sources

1. Чубукова И.А. Статья: Процесс Data Mining. Начальные этапы [электронный ресурс] — Режим доступа: http://www.intuit.ru/...
2. Беликова Александра. Статья: Проблема обработки персональных данных [электронный ресурс] — Режим доступа: http://www.basegroup.ru/library/...
3. Арустамов Алексей. Статья: Предобработка и очистка данных перед загрузкой в хранилище [электронный ресурс] — Режим доступа: http://sysdba.org.ua/proektirovanie-bd/etl/predobrabotka-i-ochistka-dannyih-pered-zagruzkoy-v-hranilische.html
4. Basegroup. Статья: Технология обработки клиентских баз [электронный ресурс] — Режим доступа: http://www.dupmatch.com/...
5. Статья: ETL. [электронный ресурс] — Режим доступа: http://ru.wikipedia.org/wiki/ETL
6. Вон Ким. Статья: Три основных недостатка современных хранилищ данных [электронный ресурс] — Режим доступа: http://citforum.ru/data...
7. Роналд Фоурино. Статья: Электронное качество данных: скрытая перспектива очистки данных [электронный ресурс] — Режим доступа: http://www.iso.ru/р... - Электронный ресурс, хранящий статьи, которые были обублекованные в известных журналах

Julia Osipova

Faculty of computer science and technology (CST)

Department of Automated Control Systems

Speciality Information Control System and Technologies

Development of methods and tools in enterprise information systems based on the technology of data analysis

Scientific adviser: Ph.D. Anastas M. Fonotov