DonNTU   Masters' portal

Abstract

Contents

Introduction

The problem of pattern recognition has long attracted the attention of psychologists, physiologists, engineers and mathematicians. In recent years, interest in it has increased significantly, as in many areas of science and technology has become acutely felt the need to address it. This is due to the development of a large variety of devices (robots, systems, technical and medical diagnostics, personal, mobile and hand-held computers), automatic operation is impossible without recognition of the current state of objects, processes, events and conditions to which these devices operate. Creating devices that serve as recognition of various objects, in many cases, makes it possible to replace the human element of a complex system as a specialized machine. Such substitution can significantly enhance the ability of different systems that perform complex information-logical problems. At the same time, the machine replacing it acts uniformly and always delivers the same quality, if it is defective.

Most modern applications to be solved by the construction of the recognition systems, characterized by a large volume of raw data and the ability to add new data is already in the process of systems.

1. Actuality of the theme

Currently, the main research focus is the research and development of algorithms and methods for constructing software data mining. The relevance of this topic is determined by the presence of a number of important practical problems, the solution of which is required to analyze large volumes of heterogeneous data complexly. In this case, the volume and complexity of such data is often not effectively use the traditional means of analysis based on the methods of statistical analysis, information retrieval and expertise that determines the need for data mining tool, based on the methods of machine learning and artificial intelligence. The volume of data is so impressive that a person simply can not afford to analyze their own, although the need for such an analysis is quite obvious, because in these raw data entered into knowledge that can be used in making decisions.

That is why the task of filtering the data is relevant scientific and technical task that requires the development of modern approaches to solving it.

2. Goal and tasks of the research

The object of study is the Data Processing recognition systems.

The subject of the study – an algorithm for constructing the weighted training samples.

The purpose of the final work is to develop a master's degree and study the algorithm for constructing training samples based on the methods of data clustering.

In operation, the need to solve the following tasks:

3. Algorithm MST (Algorithm based on Minimum Spanning Trees)

The cluster analysis (self-study, learning without a teacher, taxonomy) used in the automatic generation of a list of images for the training set. All objects of this sample are presented system without identifying which image they belong. This kind of problems are solved, for example, a person in the process of science learning. This experience should be used when creating the corresponding algorithms.

At the core of the cluster analysis on the hypothesis of compactness. It is assumed that the training sample in the feature space consists of a set of clusters (like the galaxies in the universe). The task of the system – identify and describe the formalized these clots.The geometric interpretation of the compactness hypothesis is as follows.

Objects belonging to the same taxon, are located close to each other compared with the objects belonging to different taxa. Closer can be understood more broadly than the geometric interpretation. For example, a pattern that describes the relationship of objects of one taxon is different from that in other taxa, as is the case in linguistic practices.

The use of cluster analysis in general is reduced to the following steps:

– the selection of the sample of objects for clustering;

– the definition of a set of variables on which the objects will be measured in the sample.If necessary – normalization of the values of variables;

– evaluation of the measure of similarity between objects;

– the use of cluster analysis to create groups of similar objects (clusters);

– presentation of the analysis results.

After receiving and analyzing the results of the adjustment can be selected metrics and clustering method to obtain optimal results.

Purpose of Algorithm MST (Algorithm based on Minimum Spanning Trees): clustering large sets of random data.

Description of the algorithm   [13]

Step 1: Construction of a minimum spanning tree:

A connected, undirected graph with weights on the edges of G (V, E), where V set of nodes (pins), and E is the set of their possible pairwise connections (edges) for each edge (u, v) is uniquely defined by a real number w (u, v)   the weight (length or cost of the connection).

Boruvka's algorithm

1. For each vertex of the graph findability edge with minimum weight.

2. Add the ribs to the Point of spanning tree, provided their safety.

3. Find and secure the edges to add unrelated to the spanning tree tops.

4. Total running time: O (ELogV).

Kruskal's algorithm:

1. Bypass edges ascending scale. Provided safety edges add it to the main tree.

2. Total running time: O (ELogE).

Prim's algorithm:

1. Selecting the root node.

2. Starting with the root rib to add secure spanning tree.

Total running time: O (ELogV).

Step 2: Separation of clusters. Arc with the largest weights are separated clusters. The principle of operation described above as groups of methods dendrogram shown in Fig. 3.1

Dendogramma of agglomerative and divizimnyh methods

Image 3.1 –  Dendogramma of agglomerative and divizimnyh methods (animation: 10 frames, 5 cycles of repetition, 106 kilobytes)

Conclusion

During the examined methods for constructing the weighted training samples. Were the main methods of data clustering. Installation, in recent years various methods have been developed data preprocessing. It was noted that because of large amounts of raw data, importance has preprocessing information.

According to the results of the analysis as a research object is selected the task of building the weighted training samples.

References

  1. Айзерман М.А., Браверман Э.М., Глушков В.М. и др. Теория распознавания образов и обучающих систем. – Изв. АН СССР, Техническая кибернетика № 5, 1963, с. 98-101.
  2. Вайнцвайг М.Н. Алгоритм обучения распознавания образов «Кора». В сб.: Алгоритмы обучения распознавания образов. – М.: 1972, с. 110-116.
  3. Васильев В.И. Распознающие системы: Справочник. / В.И. Васильев – К.: Наукова думка, 1983. – 423 с.
  4. Загоруйко Н.Г. Прикладные методы анализа знаний и данных / Н.Г. Загоруйко. – Новосибирск: Издательство института математики, 1999. – 270 с.
  5. Волченко Е.В. Сеточный подход к построению взвешенных обучающих выборок w-объектов в адаптивных системах распознавания // Вісник Національного технічного університету «Харківський політехнічний інститут». Збірник наукових праць. Тематичний випуск: Інформатика i моделювання. – Харків: НТУ «ХПІ», 2011. – № 36. – С. 12-22.
  6. Волченко Е.В. Модифицированный метод потенциальных функций / Е.В. Волченко II Бионика интеллекта. – 2006. – № 1 (64). – С. 86-92.
  7. Волченко Е. В., Кузьменко И. Ю. Анализ методов нахождения выбросов в обучающих выборках / Харьковский Политехнический Институт // Материалы ХI Международной научно-технической конференции/ Секция «Молодые ученые«. – Харьков, ХПИ – 2011, , с. 12-13.
  8. Автореферат магистерской работы Шкарпеткина Ю.Г. «Исследование и разработка метода заполнения пропусков в взвешенных обучающих выборках данных» [Электронный ресурс]. – Режим доступа: http://masters.donntu.ru/2012/iii/shkarpetkina/diss/index.htm
  9. Чубукова И.А. Data Mining. Учебное пособие. – М.: Интернет-Университет Информационных технологий; БИНОМ. Лаборатория знаний, 2006. – 382 с.: ил., табл. – (Серия «Основы информационных технологий»)
  10. Паклин Н. «Кластеризация категорийных данных: масштабируемый алгоритм CLOPE». [Электронный ресурс]. – Режим доступа: http://www.basegroup.ru/clusterization/clope.htm 
  11. Sudipto Guha, Rajeev Rastogi, Kyuseok Shim «CURE: An Efficient Clustering Algorithm for Large Databases». Proceedings of the 1998 ACM SIGMOD international conference on Management of data pp.. 73-84
  12. Tian Zhang, Raghu Ramakrishnan, Miron Livny «BIRCH: An Efficient Data Clustering Method for Very Large Databases». Proceedings of the 1996 ACM SIGMOD international conference on Management of data pp. 103-114
  13. Daniel Fasulo «An Analysis Of Recent Work on Clustering Algorithms». [Электронный ресурс]. – Режим доступа: http://logic.pdmi.ras.ru/ics/papers/aca.pdf
  14. Паклин Н. «Алгоритмы кластеризации на службе Data Mining». [Электронный ресурс]. – Режим доступа: http://www.basegroup.ru/clusterization/datamining.htm