RUS | ENG | DonNTU> Master's portal
Biography | Abstract | Library(Rus) | References(Rus) | Report about the search(Rus) | Individual task (Rus)
Table of contents
Databases today can range in size into the terabytes — more than 1,000,000,000,000 bytes of data. Within these masses of data lies hidden information of strategic importance. But when there are so many trees, how do you draw meaningful conclusions about the forest?
The newest answer is data mining, which is being used both to increase revenues and to reduce costs. The potential returns are enormous. Innovative organizations worldwide are already using data mining to locate and appeal to higher-value customers, to reconfigure their product offerings to increase sales, and to minimize losses due to error or fraud.
Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.
The first and simplest analytical step in data mining is to describe the data — summarize its statistical attributes (such as means and standard deviations), visually review it using charts and graphs, and look for potentially meaningful links among variables (such as values that often occur together). As emphasized in the section on THE DATA MINING PROCESS, collecting, exploring and selecting the right data are critically important.
But data description alone cannot provide an action plan. You must build a predictive model based on patterns determined from known results, then test that model on results outside the original sample. A good model should never be confused with reality (you know a road map isn’t a perfect representation of the actual road), but it can be a useful guide to understanding your business. The final step is to empirically verify the model. For example, from a database of customers who have already responded to a particular offer, you’ve built a model predicting which prospects are likeliest to respond to the same offer. Can you rely on this prediction? Send a mailing to a portion of the new list and see what results you get.
The algorithms used in Data Mining require a plenty of calculations. Earlier it was the retentive factor of the wide practical application Data Mining, however much today's growth of productivity of modern processors took off the sharpness of this problem. Now for acceptable time it is possible to conduct the high-quality analysis of hundred thousand and million records.
The powerful computer systems keeping and managing enormous databases became the inalienable attribute of vital functions both large corporations and even small companies. Nevertheless, presence of data in itself unenough for the improvement of indexes of work. It is needed to be able to transform raw data in information useful to acceptance of decisions. Herein and there is basic destiny of the Data Mining technologies.
Researches of different methods of extraction of knowledge’s. It is a fact that comparison of some modifications of decision trees and genetic algorithm. For comparison of efficiency the system of extraction of knowledge’s, which will extract knowledge’s by indicated higher methods, is developed. By possible completion works can to become development new method being in combinations or modifications already known methods.
A scientific novelty consists of conducting of deep analysis of quality of the extracted knowledges, estimation of quickness and exactness of work of algorithms, choice of the best algorithm for the designed system and also creations so of a new modification of method of decision trees on the basis of CART.
The practical value of the given work consists in the analysis of existent methods of extraction of knowledge’s. Choice of the most optimum method or modification of existent methods for extraction of knowledge’s most complete entrance data reflecting a character. At first databases possess unbelievable sizes(to a 850 gigabyte). Secondly, extraction in some systems must take place in the real-time mode. Thirdly most modern databases are distributed. For this purpose it suppose creation of the distributed system of Data Mining that work with remote databases.
Terms are the Data Mining synonyms also « knowledge discovery in databases and «intellectual data analysis».[1] The origin of all these terms is related to a new coil in development of facilities and methods of the data processing. The necessity of the data processing is conditioned to those, that the huge streams of information in the most different regions were going in connection with perfection of technologies of record and hraneniya data. Activity of any enterprise is now accompanied by registration and record of all details of his activity.
Features of modern data:
A purview Data Mining by nothing is unreserved - it everywhere, where some data are. But above all things the Data Mining methods today, softly speaking, intrigued business enterprises developing projects on the basis of informative depositories of data (Data Warehousing). Experience of many such enterprises shows that a return from the Data Mining use can arrive at 1000%. For example, the reports about an economic effect are known, in 10-70 times exceeding first expenses from 350 a to 750 thousand of dol. [19]. Information about a project in 20 million dol. which was covered a cost only for 4 months is known. Other example - annual economy a 700 thousand of dol. due to the Data Mining introduction in the network of universamov in Great Britain.
Data Mining is a multidisciplinary arising up and growing up region of the base of achievements of the applied statistics, pattern recognitions, methods artificial intelligence, theories of databases and dr. From here plenty of methods and algorithms realized in different operating systems Data Mining [1, 2, 16]. Many of such systems integrate in itself at once a few approaches. Nevertheless, as a rule, in every system there is some key component due to which it is possible to select the following classes of algorithms [17]:
The subject-oriented analytical systems are very various. The Naibolee wide subclass of such systems, getting distribution in area of research of financial markets, carries the name «technical analysis». He is the aggregate of a few ten methods of prognosis of dinamiki prices and choice of optimum structure of investment brief-case, based on different empiric models of dynamics of market.[17,18] These methods often use a simple statistical vehicle, but the specific folded in the region is maximally taken into account (professional language, system of different indexes and pr.) The last versions almost all known statistical packages include also the Data Mining elements along with traditional statistical methods. But basic attention in them is spared however to the classic methods - to the regressive, factor analysis and dr.[9] The requirement is considered the lack of the systems of this class to the special preparation of user. Mark also, that powerful modern statistical packages are too «heavy» for mass application in finances and biznese[18, 9].
There is yet more serious of principle lack of statistical packages, limiting their application in Data Mining. Most methods entering in the complement of packages lean against a statistical paradigm usrednennie descriptions of selection serve in which as main figurantami. And these harakteristiki at research of the real difficult vital phenomena often are fictitious sizes [1].
Neuron networks – it is the large class of the systems, architecture of which has an analogy with construction of nervous fabric from neurons. In one of the most widespread architectures, mnogosloynom perseptrone with reverse distribution of error, work of neurons in composition an ierarhicheskoy network is imitated, where every neuron of more high level is connected by the entrances with the outputs of neurons of nigelegashego layer. On the neurons of lowermost layer the values of entry parameters on the basis of which it is needed to make some decisions are given, to forecast development of situation and etc These values of rassmatrivayutsya as signals transmissible in a next layer, relaxing or increasing depending on the numerical values (scales) added to megneyronnim communications. As a result on the output of neuron of the most overhead layer virabativaetsya some value which is examined as an answer — reaction of all, networks on the entered values of entry parameters. In order that a network can be applied in future, she is before necessary to «coach» before these on got, which the values of entry parameters, and right answers, on them are known for. Training consists of selection of scales of interneuron communications, obespechivayushih most closeness of answers of network to the known right answers.[12, 5]
A necessity to have a very large teaching sample size is the basic lack of neyrosetevoy paradigm. Other substantial failing consists that even the coached neuron network represents a soboy «black box». The knowledges fixed as weights of several hundred megneyronnih communications are not quite added to the analysis and interpretation by a man [2].
Decision trees – it is the method of presentation of rules in a hierarchical, successive structure, where an unique knot giving the decision[1, 21]corresponds to every object. Under a rule the logical construction represented in a kind «if then else . .».
Fig 1– Example of construction of decision tree (Animation 3 Kb)
An application domain decision trees is presently wide, but all tasks decided by this vehicle can be incorporated in following three classes[21]:
Leaning against the existent state of affairs of market of software, a conclusion about expedience of creation of a similar system was done, but taking into account the lacks of the previous systems:
DonNTU> Master's portal >Biography | Abstract | Library(Rus) | References(Rus) | Report about the search(Rus) | Individual task (Rus)