Chernov - Automatic Data Mining from Dataases

RUS | ENG | DonNTU> Master's portal

Abstract of thesis

Theme of master's work:

"Automatic Data Mining form data bases."

Maded by Chernov Ivan

Table of contents

Introduction
Actuality
Purpose of work
Scientific novelty
Practical value
What is Data mining?
Data Mining Method Review
Unsolved problems
Drawed results
Supposed results
Conclusion
List of references

Introduction

Databases today can range in size into the terabytes — more than 1,000,000,000,000 bytes of data. Within these masses of data lies hidden information of strategic importance. But when there are so many trees, how do you draw meaningful conclusions about the forest?

The newest answer is data mining, which is being used both to increase revenues and to reduce costs. The potential returns are enormous. Innovative organizations worldwide are already using data mining to locate and appeal to higher-value customers, to reconfigure their product offerings to increase sales, and to minimize losses due to error or fraud.

Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.

The first and simplest analytical step in data mining is to describe the data — summarize its statistical attributes (such as means and standard deviations), visually review it using charts and graphs, and look for potentially meaningful links among variables (such as values that often occur together). As emphasized in the section on THE DATA MINING PROCESS, collecting, exploring and selecting the right data are critically important.

But data description alone cannot provide an action plan. You must build a predictive model based on patterns determined from known results, then test that model on results outside the original sample. A good model should never be confused with reality (you know a road map isn’t a perfect representation of the actual road), but it can be a useful guide to understanding your business. The final step is to empirically verify the model. For example, from a database of customers who have already responded to a particular offer, you’ve built a model predicting which prospects are likeliest to respond to the same offer. Can you rely on this prediction? Send a mailing to a portion of the new list and see what results you get.

The algorithms used in Data Mining require a plenty of calculations. Earlier it was the retentive factor of the wide practical application Data Mining, however much today's growth of productivity of modern processors took off the sharpness of this problem. Now for acceptable time it is possible to conduct the high-quality analysis of hundred thousand and million records.

Actuality

The powerful computer systems keeping and managing enormous databases became the inalienable attribute of vital functions both large corporations and even small companies. Nevertheless, presence of data in itself unenough for the improvement of indexes of work. It is needed to be able to transform raw data in information useful to acceptance of decisions. Herein and there is basic destiny of the Data Mining technologies.

Purpose of work

Researches of different methods of extraction of knowledge’s. It is a fact that comparison of some modifications of decision trees and genetic algorithm. For comparison of efficiency the system of extraction of knowledge’s, which will extract knowledge’s by indicated higher methods, is developed. By possible completion works can to become development new method being in combinations or modifications already known methods.

Scientific novelty

A scientific novelty consists of conducting of deep analysis of quality of the extracted knowledges, estimation of quickness and exactness of work of algorithms, choice of the best algorithm for the designed system and also creations so of a new modification of method of decision trees on the basis of CART.

Practical value

The practical value of the given work consists in the analysis of existent methods of extraction of knowledge’s. Choice of the most optimum method or modification of existent methods for extraction of knowledge’s most complete entrance data reflecting a character. At first databases possess unbelievable sizes(to a 850 gigabyte). Secondly, extraction in some systems must take place in the real-time mode. Thirdly most modern databases are distributed. For this purpose it suppose creation of the distributed system of Data Mining that work with remote databases.

What is Data mining?

Terms are the Data Mining synonyms also « knowledge discovery in databases and «intellectual data analysis».[1] The origin of all these terms is related to a new coil in development of facilities and methods of the data processing. The necessity of the data processing is conditioned to those, that the huge streams of information in the most different regions were going in connection with perfection of technologies of record and hraneniya data. Activity of any enterprise is now accompanied by registration and record of all details of his activity.

Features of modern data:

Data have an unlimited volume.
Data are heterogeneous (quantitative, continues, text).
The extracted knowledges must be concrete and clear.
The instruments of Data minig must be simple in the use and work with «raw» data.

A purview Data Mining by nothing is unreserved - it everywhere, where some data are. But above all things the Data Mining methods today, softly speaking, intrigued business enterprises developing projects on the basis of informative depositories of data (Data Warehousing). Experience of many such enterprises shows that a return from the Data Mining use can arrive at 1000%. For example, the reports about an economic effect are known, in 10-70 times exceeding first expenses from 350 a to 750 thousand of dol. [19]. Information about a project in 20 million dol. which was covered a cost only for 4 months is known. Other example - annual economy a 700 thousand of dol. due to the Data Mining introduction in the network of universamov in Great Britain.

Data MIning Method Review

Data Mining is a multidisciplinary arising up and growing up region of the base of achievements of the applied statistics, pattern recognitions, methods artificial intelligence, theories of databases and dr. From here plenty of methods and algorithms realized in different operating systems Data Mining [1, 2, 16]. Many of such systems integrate in itself at once a few approaches. Nevertheless, as a rule, in every system there is some key component due to which it is possible to select the following classes of algorithms [17]:

Statistical methods

The subject-oriented analytical systems are very various. The Naibolee wide subclass of such systems, getting distribution in area of research of financial markets, carries the name «technical analysis». He is the aggregate of a few ten methods of prognosis of dinamiki prices and choice of optimum structure of investment brief-case, based on different empiric models of dynamics of market.[17,18] These methods often use a simple statistical vehicle, but the specific folded in the region is maximally taken into account (professional language, system of different indexes and pr.) The last versions almost all known statistical packages include also the Data Mining elements along with traditional statistical methods. But basic attention in them is spared however to the classic methods - to the regressive, factor analysis and dr.[9] The requirement is considered the lack of the systems of this class to the special preparation of user. Mark also, that powerful modern statistical packages are too «heavy» for mass application in finances and biznese[18, 9].

There is yet more serious of principle lack of statistical packages, limiting their application in Data Mining. Most methods entering in the complement of packages lean against a statistical paradigm usrednennie descriptions of selection serve in which as main figurantami. And these harakteristiki at research of the real difficult vital phenomena often are fictitious sizes [1].

Neuron networks algorithms

Neuron networks – it is the large class of the systems, architecture of which has an analogy with construction of nervous fabric from neurons. In one of the most widespread architectures, mnogosloynom perseptrone with reverse distribution of error, work of neurons in composition an ierarhicheskoy network is imitated, where every neuron of more high level is connected by the entrances with the outputs of neurons of nigelegashego layer. On the neurons of lowermost layer the values of entry parameters on the basis of which it is needed to make some decisions are given, to forecast development of situation and etc These values of rassmatrivayutsya as signals transmissible in a next layer, relaxing or increasing depending on the numerical values (scales) added to megneyronnim communications. As a result on the output of neuron of the most overhead layer virabativaetsya some value which is examined as an answer — reaction of all, networks on the entered values of entry parameters. In order that a network can be applied in future, she is before necessary to «coach» before these on got, which the values of entry parameters, and right answers, on them are known for. Training consists of selection of scales of interneuron communications, obespechivayushih most closeness of answers of network to the known right answers.[12, 5]

A necessity to have a very large teaching sample size is the basic lack of neyrosetevoy paradigm. Other substantial failing consists that even the coached neuron network represents a soboy «black box». The knowledges fixed as weights of several hundred megneyronnih communications are not quite added to the analysis and interpretation by a man [2].

Decision trees

Decision trees – it is the method of presentation of rules in a hierarchical, successive structure, where an unique knot giving the decision[1, 21]corresponds to every object. Under a rule the logical construction represented in a kind «if then else . .».

Пример построения дерева решений (Анимация)

Fig 1– Example of construction of decision tree (Animation 3 Kb)

An application domain decision trees is presently wide, but all tasks decided by this vehicle can be incorporated in following three classes[21]:

Data definition: Decision trees allow to keep information about data in a compact form, in place of them we can keep a decision tree which contains exact description of objects.
Classification: Decision trees fine get along at the tasks of classification, I.e. attributing of objects to one of the beforehand known classes. A having a special purpose variable must have the discrete values.
Regression: If a having a special purpose variable has the continuous values, decision trees allow to set dependence of having a special purpose variable on independent(entrance) variables. For example, the tasks of numeral prognostication(predictions of having a special purpose variable values) behave to this class.

Unsolved problems:

Most algorithms require the large number of calculations.That is slow speed of Data Mining.
The methods of extraction of knowledges are compared speculatively or with the small number of analogues. There is not exact comparison of methods and high-quality estimations of mined knowledges.
There isn't distributed systems of Data Mining, that work with remote databases.

Drawed results:

There is created block of Data Mining by the method of decision trees.
The modification of method of decision trees is developed,it is based on the CART algorithm.
Results are got comparison of modified and standard methods.

Supposed results:

Creation of the system of extraction of knowledges with the integrated functions of extraction, visualizations of knowledges, and also eekspertnoy system working on the basis of the extracted knowledges.
Genetic algorithms of extraction of knowledges will added.
Comparative analysis of different methods of extraction of knowledges on the example of job performances of the system.

Conclusion

Leaning against the existent state of affairs of market of software, a conclusion about expedience of creation of a similar system was done, but taking into account the lacks of the previous systems:

high price;
absences of Russian or Ukrainian interfaces;
orientation on the narrow circle of specialists in area of Data Mining;
absence of the possibility high-performance distributed compute;
orientation on local databases.

List of references

Kosko B. Fuzzy systems as universal approximators // IEEE Transactions on Computers, vol. 43, No. 11, November 1994. – P. 1329-1333.
Cordon O., Herrera F., A General study on genetic fuzzy systems // Genetic Algorithms in engineering and computer science, 1995. – P. 33-57.
Knowledge Discovery Through Data Mining: What Is Knowledge Discovery? - Tandem Computers Inc., 1996.
J. Ross Quinlan. C4.5: Programs for Machine learning. Morgan Kaufmann Publishers 1993.
S.Murthy. Automatic construction of decision trees from data: A Multi-disciplinary survey.1997.
W. Buntine. A theory of classification rules. 1992.
Machine Learning, Neural and Statistical Classification. Editors D. Mitchie et.al. 1994.
Holland J. H. Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press. 1975.
Akobir Shahidi , BaseGroup Labs - Mathematical Apparatus of Decision Trees. http://basegroup.ru/trees/math_c45_part1.en.htm http://basegroup.ru/trees/math_c45_part2.en.htm
Carvalho D. R. Alex A. F. - "A Hybrid Decision Tree/Genetic Algorithm Method for Data Mining" http://www.cs.kent.ac.uk/people/staff/aaf/pub_papers.dir/Info-Sci-J-2003.pdf

On top