M.G. Titarenko, I.A. Kolomoytseva, R.R. Gilmanova
Материалы международной научно-практическаой конференции «Программная инженерия: методы и технологии разработки информационновычислительных систем» (ПИИВС-2018) – Донецк: ДонНТУ, 2018.
M.G. Titarenko, I.A. Kolomoytseva, R.R. Gilmanova Review and analysis of effectiveness of the binary classification algorithms to classify information of countries’ international trade activity. The analysis of the existing classification algorithms is presented, the selection of features and test data is carried out, the classifiers are tested, the effectiveness rates of the binary classification algorithms to classify information of countries’ international trade activity are evaluated and compared.
Nowadays in the internet appear enormous amount of the news’ topics about countries’ international trade activity. But often these articles, notes and reviews are presented to user as general list which is usually sorted by the time added and doesn’t provide to evaluate the relevance of the information and find out is it corresponding with searching category. In this regard appears need of international trade automatic classification. This study is relevant to information retrieval systems that are oriented to search and processing international trade information.
The article provides the review of information classification algorithms and their comparison in working with countries’ international trade activity data.
Every classification is producing on the basis of some features. In order to classify text first of all is necessary to define the values of selected features for this text. Today one of the most effective measure for necessary features automatic defining is TF-IDF [1]. TF-IDF is statistical measure which is using for word value evaluation in document which is a part of documents collection. The word weight is proportional to its frequency in the document and inversely proportional to its frequency in the whole collection. The features count is selected as 10 randomly but considering the changing of f1 measure on the higher values of this parameter.
In purpose to test the classification algorithms is decided to use the set of classified articles from Reuters in amount of 10788 articles. The training set consists of 7769 articles and the test one from 3019. The articles are classified to 90 categories. In the study the binary classifier of articles on international trade features is implemented so other 89 categories were marked as other
.
In this work as classifiers are selected SVM (support vector machine), KNearestNeighbors, Gauss Classifier, Decision Tree, RandomForest and Naive Bayesian Classifier.
SVM (support vector machine) – is the set of classification algorithms which are transfer the source vectors to the space with higher dimension and find separating hyperplane that separates presented classes [2].
In the research this classifier is tested on the different values of kernel and penalty parameters. The metrics of precision, recall and f1 are evaluated. The results are presented in the table 1.
Parameters | Precision | Recall | F1 |
kernel = "linear",
C = 0.025
|
0.924 | 0.9612 | 0.9423 |
gamma = 2, C = 1 | 0.9578 | 0.9626 | 0.946 |
gamma = 3, C = 1 | 0.9522 | 0.9626 | 0.9477 |
According to received data and f1 measure the third parameter is optimal. In the next comparison will be used data from this row.
In the basis of KNearestNeighbors algorithm (kNN) is the rule that the tested object with its set of features belongs to the class that have majority of object’s k nearest neighbors [3].
In the research this classifier is tested on the different values of k parameters which are 3, 5 and 10 neighbors. The metrics of precision, recall and f1 are evaluated. The results are presented in the table 2.
k | Precision | Recall | F1 |
3 | 0.946 | 0.9559 | 0.95 |
5 | 0.9494 | 0.9603 | 0.9527 |
10 | 0.9528 | 0.9566 | 0.9498 |
According to received data and f1 measure the parameter of 5 is optimal. In the next comparison will be used data from this row.
The main idea of gauss classifier is in suggestion that the likelihood function (training set) is known for every class and is equal to the density of the Gaussian normal distribution [4].
In the research this classifier is tested on the different values of radial basis function argument. The metrics of precision, recall and f1 are evaluated. The results are presented in the table 3.
RBF(x) | Precision | Recall | F1 |
1.0 | 0.924 | 0.9612 | 0.9423 |
0.5 | 0.924 | 0.9612 | 0.9423 |
1.5 | 0.924 | 0.9612 | 0.9423 |
According to received data and f1 measure the argument of RBF has low influence on results.
Decision tree is a classifier that on the training set build the structure with nodes which are difference attributes, leaves in which objective function attributes are wrote and ribs with necessary array of attributes. The goal of the decision tree is to create model which predicts value of objective function on the basis of several inputs [4].
В исследовании проводилось тестирование данного классификатора на разных показателях вводимого аргумента максимальной глубины дерева. При этом были рассчитаны метрики точности, полноты и f1 метрика. Результаты приведены в таблице 4.
max | Precision | Recall | F1 |
5 | 0.9458 | 0.9573 | 0.9501 |
10 | 0.9421 | 0.9523 | 0.9465 |
15 | 0.943 | 0.95 | 0.9462 |
According to received data and f1 measure maximum tree depth of 5 is optimal. In the next comparison will be used data from this row.
RandomForest is a machine learning algorithm that consists in using homogeneous ensemble of decision trees. The main idea is using big ensemble of decision trees that due to big amount of them improves the result of classification [5].
In the research this classifier is tested on the different values of maximum tree depth. The metrics of precision, recall and f1 are evaluated. The results are presented in the table 5.
max | Precision | Recall | F1 |
5 | 0.924 | 0.9502 | 0.9487 |
10 | 0.9606 | 0.9626 | 0.9612 |
15 | 0.9419 | 0.953 | 0.9527 |
According to received data and f1 measure maximum tree depth of 10 is optimal. In the next comparison will be used data from this row.
In the basis of Naive Bayesian Classifier is Bayes theorem. This classifier became one of the standard universal methods of classification. Advantage of this classifier is relatively small amount of data necessary for training [6].
In the research this classifier is tested. The metrics of precision, recall and f1 are evaluated. The results are presented in the table 6.
Precision | Recall | F1 |
0.9551 | 0.6568 | 0.7602 |
After testing classifiers comparative analysis by precision, recall and f1 measure is carried out. In order to big enough collection was used and due to using TF-IDF features selection algorithm the results have quiet small differences and all of them have good rates of text classification of countries’ international trade activity. Exception is only Naive Bayesian Classifier that presented f1 measure level on 0.7602 which isn’t satisfactory result for binary classification. The results of comparison are presented on the picture 1. According to weighted estimate the homogeneous ensemble RandomForset showed itself best with the maximum tree depth of 10.
The analysis of classification algorithms as SVM, KNearestNeighbors, Gauss Classifier, Decision Trees, RandomForest and Naive Bayesian Classifier is carried out. The selection algorithm is implemented and the classification features are selected with TF-IDF. The algorithms with different parameters are tested. Optimal parameters for every algorithm on the basis of f1 measure are defined. The algorithms are compared by precision, recall and f1 measure. The homogeneous ensemble as optimal classifier for binary classification of information of countries’ international trade activity is defined. Not satisfactory results of classification with Naive Bayesian Classifier are presented.