The Abstract

for master's work "Software development for analyzing and optimizing subscription agency activity"


Author: Kravchenko Elena


INTRODUCTION

The modern world is the world of the information. It is known, that the volume of the information grows on exponent to the law, and it is obvious, that it puts the person before a task of rational and reliable storage of the information, with maintenance of convenient access and a guarantee of integrity. But maintenance of storage of the information it only a part of the task arising before the person. Besides that the information is necessary for accumulating, also it is necessary to process and take useful knowledge from a lump of the facts and data. This task is not trivial, and it has led to occurrence of such direction, as Data Mining - «extraction of knowledge». Extraction of knowledge is understood as extraction from the big volume of the information on some process or the phenomenon of new, unevident knowledge for the user which subsequently can be used. It is considered to be date of origin of this direction 80th years of the last century. The technology of extraction of knowledge was separated from the classical statistical analysis and has gradually turned to an independent direction.

In the given direction there is a set of various approaches, the theoretical calculations not enough connected among themselves (except for subjects and terminology), but not contradicting, and supplementing each other. During performance of work it is planned to carry out the comparative analysis of various methods and approaches of concept Data Mining with the subsequent choice of the most suitable (corresponding goals of a subject domain and meeting the requirements, put forward to the received decision) algorithm. Also construction on its base of the forecast for a researched subject domain is planned. During studying the given direction updating and specification of a task in view is probable.

1 THE REVIEW OF BASIC QUESTIONS DATA MINING

1.1 Sources of a direction.
The important stage in allocation Data mining in an independent direction was occurrence of the theory of the group account of arguments. Bases of this methodology have been incorporated by Kiev mathematics Ivakhnenko. GMDH as the embodiment of the inductive approach is an original method of construction of models on experimental data in conditions of uncertainty. The models of optimum complexity received on this method display unknown laws of functioning of researched object (process), the information about which implicitly contains in sample of data. In GMDH principles of automatic generation of variants, not final decisions are applied to construction of models and consecutive selection of the best models by external criteria. Efficiency of a method is confirmed by the decision of numerous real tasks in ecology, hydrometeorology, economy, technical equipment.

1.2 Types of models of extraction of knowledge.
Company IBM during researches in the field of Data Mining have allocated 2 big groups of models of extraction of knowledge: verification model and models of search of knowledge (discovery model). On classification of experts IBM the first group includes tasks of statement and check of some statistical hypothesis. The given group of tasks is rather sensitive to the human factor - as promotion of hypotheses to lay down on the expert. For the second groups of tasks this factor is insignificant, as search of knowledge is carried out automatically on the set data set. It is obvious, that for these tasks quality and volume of data (length of sample, presence of "blanks", etc.) is important. In figure the general plan of revealing of rules and connections in a DB is shown at sharing the hypotheses formulated by the expert and hypotheses, generated automatically.

1.3 Methods of extraction of knowledge
The central value in a direction of extraction of data is occupied with algorithms of extraction of data. One of the largest software producers, supporting technology Data Mining, BaseGroup Labs, gives following classification of algorithms of search of knowledge. The tasks solved by methods Data Mining, it is possible to break all into five classes conditionally.
1. Classification - reference of objects (supervision, events) to one of in advance known classes. It is made by means of the analysis of already classified objects and a formulation of some set of rules.
2. Clasterisation is a grouping of objects (supervision, events) on the basis of the given (properties) describing essence of objects. Objects inside claster should be "similar" against each other and differ from the objects which have entered into others clasters.
. 3. Regress, including a task of forecasting. This establishment of dependence of continuous target variables from entrance. Forecasting concerns to the same type of tasks temporary of some on the basis of historical data.
4. Association - revealing of laws between the connected events. As an example of such law the rule specifying serves, that from event X event Y follows. Such rules refer to associative. For the first time it is a task it has been offered for a finding of typical patterns of the purchases made in supermarkets, therefore sometimes it still name the analysis of a market basket (market basket analysis).
5. Consecutive patterns - an establishment of laws between the events connected in time. Generally, not essentially, which algorithm one of five tasks Data Mining will be solved - the main thing to have a method of the decision for each class of problems tasks.

CONCLUSIONS

The urgency described above a direction is so obvious, that there are no doubts in its rapid development in the future. Growth of the theoretical and practical works, devoted to technologies of extraction of knowledge in a global practice for the last some years, is bright to volume confirmation. The huge amount of classes of tasks to which technology Data Mining and constant growth of information of a society can be successfully applied is a guarantee of that the nearest 5-10 years the given technology will confidently take the place practically in all fields of activity. Therefore the primary goal facing to programmers, in the given sphere, research of efficiency of already existing algorithms and their optimization in a context of separate subject domains is working, first of all.

At present work above masrer's work is not finished. Planned time of the ending - December, 2007. It will be possible to familiarize with all results of the given work after the specified date at the author or the supervisor of studies.

The list of references:
  1. Вапник В.Н. Восстановление зависимостей по эмпирическим данным. М: Наука, 1979.
  2. Грешилов А.А., Стакун В.А., Стакун А.А., Математические методы построения прогнозов. М: Радио и связь, 1997
  3. Дюк В.А., Самойленко А.П. Data Mining: учебный курс – СПб: «Питер», 2001 – 368 с.
  4. Ивахненко А.Г. Индуктивный метод самоорганизации моделей сложных систем. К: Наукова Думка, 1982
  5. Ивахненко А.Г., Степашко В.С., Помехоустойчивость моделирования. К: Наукова Думка, 1982
  6. Круглов В.В., Дли М.И. Интеллектуальные информационные системы: компьютерная поддержка систем нечеткой логики и нечеткого вывода. – М.: Физматлит, 2002.
  7. Machine Learning, Neural and Statistical Classification. Editors D. Mitchie et.al. 1994.
  8. Muller, J.-A., F.Lemke: Self-Organising Data Mining. BoD Hamburg 2000.
  9. Pyle, D.: Data Preparation for Data Mining. Morgan Kaufman Publ. San Francisco 1999.
  10. Sharkey, A.J.C.: Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems. Springer: London 1999
  11. Хальд А, Математическая статистика с техническими приложениями. М: ИЛ, 1956.

    Master's portal

    Kravchenko Elena