RUS | UKR | ENG | ДонНТУ > Портал магистров ДонНТУ

Магистр ДонНТУ Кошелева Виктория Андреевна

Koshelyeva Viktoriya Andriivna

Faculty: Computing engineering and informatics
Speciality: Software of the automated systems
Department: Applied mathematics and informatics
Theme of final work: «Analysis of methods of automatic knowledge extraction out of relational databases»
Leader: associate professor Fedyayev Oleh Ivanovych


Materials on the theme of final work: Abstract | Library | References | Report on the search | Individual work
RUS | ENG


Clustering Analisys



One of the main approaches in Data Mining is clustering. Clustering is used for grouping (clustering) large volumes of data. These clusters are characterized by the fact that elements within each group have more "similarities" between them than the elements in the neighboring clusters do. In general, all methods of clustering can be divided into hierarchical and non-hierarchical. Non-hierarchical methods are mostly used in the analysis of large amounts of data, because they are faster. [8]


Cluster analysis of the data recovers the previously unknown regularities, which are virtually impossible to be explored in other ways and present them in a user-friendly form. Cluster analysis methods are used both as independent tools, and as a part of other means of Data Mining (for example, neural networks).


Cluster analysis is used for processing large amounts of data, from 10 thousand to millions of records, each of which may contain hundreds of attributes, and is widely used in pattern recognition, finance, insurance, demographics, trade, marketing research, medicine, chemistry, biology, etc.


To date there has been developed a large number of clustering techniques applicable to the type of numerical data. In the field of numerical (categorical) data there is much less generally accepted methods. (ROCK, DBSCAN, BIRTH, CP, CURE, etc.) Data processing of mixed type data at the moment causes great difficulty and is an area of research.


Suggested stages of the cluster analysis



In general, all phases of cluster analysis are interrelated, and the decisions taken at one of them, determine the actions at subsequent stages. [9]


Analysts should decide whether to use all the observations or delete certain data or a sample from the data set.


The choice of metric and the method of standartization of the baseline data.


Determination of the number of clusters (for iterative cluster analysis).


Determination of clustering method (rules of association or connection).


According to many experts’ view, the choice of clustering technique is decisive in determining the form and the specifics of clusters.


Analysis of the clustering results. This phase involves such issues: whether received splitting into clusters is random; whether the splitting is reliable and stable in the sub-sample data; whether there is a relationship between the clustering and variables that have not been involved in the process of clustering; whether it is possible to interpret the results clustering.


Testing clustering results. Clustering results should also be tested by formal and informal methods. Formal methods depend on the method that was used for clustering. Informal include following procedures for checking the quality of clustering:
  • clustering results analysis of the results obtained in certain samples of the data set;
  • cross-checking;
  • clustering the changed order observations from the data set;
  • clustering on the data set after deleting some observations;
  • clustering small samples of the data set.

One of the ways of checking clustering quality is using several methods and comparing the results. The lack of similarity will not mean incorrect results, but the presence of similar groups is considered a sign of clustering quality.


Like any other method, cluster analysis methods have certain weaknesses, that is some difficulties, problems and constraints.


In the cluster analysis it is important to take into account that the clustering results depend on the criteria of splitting the original data. In reducing the dimension of data some distortion may appear. Some individual characteristics of objects could be lost because of generalization.


There is a number of complications, which should be considered before the clustering.
  • The complexity of chosing the characteristics, which are the basis of clusterization. Unwise choice leads to inadequate partitions on clusters and as a consequence - to the wrong task solving.
  • The complexity of choosing the method of clustering. This choice requires a good knowledge of techniques and prerequisites for their use. To test the effectiveness of a specific technique to a certain subject area, it is advisable to use the following procedure: a priori considering several different groups among themselves, and mixing their representatives among themselves randomly. Then clustering is conducted to restore the original splitting into clusters. The part of matches of objects in the discovered and initial groups is an indicator of the effectiveness of the method.
  • The problem of choosing the number of clusters. If there is no information about the possible number of clusters, it is necessary to conduct some experiments and, as a result of running over different number of clusters, choose the optimum number.
  • The problem of interpretation of the clustering results. The form of the clusters in most cases is determined by the chosen method of choice. However, it should be considered that specific methods seek to create clusters of certain forms, even if the studied data set actually doesn’t have any cluseters.

Materials on the theme of final work: Abstract | Library | References | Report on the search | Individual work