Review and analysis of effectiveness of the binary classification algorithms to classify information of countries’ international trade activity

Authors

M.G. Titarenko, I.A. Kolomoytseva, R.R. Gilmanova

Source

Материалы международной научно-практическаой конференции «Программная инженерия: методы и технологии разработки информационновычислительных систем» (ПИИВС-2018) – Донецк: ДонНТУ, 2018.

Abstract

M.G. Titarenko, I.A. Kolomoytseva, R.R. Gilmanova Review and analysis of effectiveness of the binary classification algorithms to classify information of countries’ international trade activity. The analysis of the existing classification algorithms is presented, the selection of features and test data is carried out, the classifiers are tested, the effectiveness rates of the binary classification algorithms to classify information of countries’ international trade activity are evaluated and compared.

Abstract

Nowadays in the internet appear enormous amount of the news’ topics about countries’ international trade activity. But often these articles, notes and reviews are presented to user as general list which is usually sorted by the time added and doesn’t provide to evaluate the relevance of the information and find out is it corresponding with searching category. In this regard appears need of international trade automatic classification. This study is relevant to information retrieval systems that are oriented to search and processing international trade information.

The article provides the review of information classification algorithms and their comparison in working with countries’ international trade activity data.

Classification features selection

Every classification is producing on the basis of some features. In order to classify text first of all is necessary to define the values of selected features for this text. Today one of the most effective measure for necessary features automatic defining is TF-IDF [1]. TF-IDF is statistical measure which is using for word value evaluation in document which is a part of documents collection. The word weight is proportional to its frequency in the document and inversely proportional to its frequency in the whole collection. The features count is selected as 10 randomly but considering the changing of f1 measure on the higher values of this parameter.

Data selection for classification

In purpose to test the classification algorithms is decided to use the set of classified articles from Reuters in amount of 10788 articles. The training set consists of 7769 articles and the test one from 3019. The articles are classified to 90 categories. In the study the binary classifier of articles on international trade features is implemented so other 89 categories were marked as other.

In this work as classifiers are selected SVM (support vector machine), KNearestNeighbors, Gauss Classifier, Decision Tree, RandomForest and Naive Bayesian Classifier.

SVM

SVM (support vector machine) – is the set of classification algorithms which are transfer the source vectors to the space with higher dimension and find separating hyperplane that separates presented classes [2].

In the research this classifier is tested on the different values of kernel and penalty parameters. The metrics of precision, recall and f1 are evaluated. The results are presented in the table 1.

Table 1 – Metrics for SVC classifier

Parameters	Precision	Recall	F1
kernel = "linear", C = 0.025	0.924	0.9612	0.9423
gamma = 2, C = 1	0.9578	0.9626	0.946
gamma = 3, C = 1	0.9522	0.9626	0.9477

According to received data and f1 measure the third parameter is optimal. In the next comparison will be used data from this row.

KNearestNeighbours

In the basis of KNearestNeighbors algorithm (kNN) is the rule that the tested object with its set of features belongs to the class that have majority of object’s k nearest neighbors [3].

In the research this classifier is tested on the different values of k parameters which are 3, 5 and 10 neighbors. The metrics of precision, recall and f1 are evaluated. The results are presented in the table 2.

Table 2 – Metrics for kNN classifier

k	Precision	Recall	F1
3	0.946	0.9559	0.95
5	0.9494	0.9603	0.9527
10	0.9528	0.9566	0.9498

According to received data and f1 measure the parameter of 5 is optimal. In the next comparison will be used data from this row.

Gauss classifier

The main idea of gauss classifier is in suggestion that the likelihood function (training set) is known for every class and is equal to the density of the Gaussian normal distribution [4].

In the research this classifier is tested on the different values of radial basis function argument. The metrics of precision, recall and f1 are evaluated. The results are presented in the table 3.

Table 3 – Metrics for Gauss classifier

RBF(x)	Precision	Recall	F1
1.0	0.924	0.9612	0.9423
0.5	0.924	0.9612	0.9423
1.5	0.924	0.9612	0.9423

According to received data and f1 measure the argument of RBF has low influence on results.

Decision tree

Decision tree is a classifier that on the training set build the structure with nodes which are difference attributes, leaves in which objective function attributes are wrote and ribs with necessary array of attributes. The goal of the decision tree is to create model which predicts value of objective function on the basis of several inputs [4].

В исследовании проводилось тестирование данного классификатора на разных показателях вводимого аргумента максимальной глубины дерева. При этом были рассчитаны метрики точности, полноты и f1 метрика. Результаты приведены в таблице 4.

Table 4 – Metrics for Decision tree

max	Precision	Recall	F1
5	0.9458	0.9573	0.9501
10	0.9421	0.9523	0.9465
15	0.943	0.95	0.9462

According to received data and f1 measure maximum tree depth of 5 is optimal. In the next comparison will be used data from this row.

RandomForest

RandomForest is a machine learning algorithm that consists in using homogeneous ensemble of decision trees. The main idea is using big ensemble of decision trees that due to big amount of them improves the result of classification [5].

In the research this classifier is tested on the different values of maximum tree depth. The metrics of precision, recall and f1 are evaluated. The results are presented in the table 5.

Table 5 – Metrics for RandomForest

max	Precision	Recall	F1
5	0.924	0.9502	0.9487
10	0.9606	0.9626	0.9612
15	0.9419	0.953	0.9527

According to received data and f1 measure maximum tree depth of 10 is optimal. In the next comparison will be used data from this row.

Naive Bayesian Classifier

In the basis of Naive Bayesian Classifier is Bayes theorem. This classifier became one of the standard universal methods of classification. Advantage of this classifier is relatively small amount of data necessary for training [6].

In the research this classifier is tested. The metrics of precision, recall and f1 are evaluated. The results are presented in the table 6.

Table 6 – Metrics for Naïve Bayesian Classifier

Precision	Recall	F1
0.9551	0.6568	0.7602

Classifiers comparison

After testing classifiers comparative analysis by precision, recall and f1 measure is carried out. In order to big enough collection was used and due to using TF-IDF features selection algorithm the results have quiet small differences and all of them have good rates of text classification of countries’ international trade activity. Exception is only Naive Bayesian Classifier that presented f1 measure level on 0.7602 which isn’t satisfactory result for binary classification. The results of comparison are presented on the picture 1. According to weighted estimate the homogeneous ensemble RandomForset showed itself best with the maximum tree depth of 10.

Picture 1 – Classifiers comparison

Conclusion

The analysis of classification algorithms as SVM, KNearestNeighbors, Gauss Classifier, Decision Trees, RandomForest and Naive Bayesian Classifier is carried out. The selection algorithm is implemented and the classification features are selected with TF-IDF. The algorithms with different parameters are tested. Optimal parameters for every algorithm on the basis of f1 measure are defined. The algorithms are compared by precision, recall and f1 measure. The homogeneous ensemble as optimal classifier for binary classification of information of countries’ international trade activity is defined. Not satisfactory results of classification with Naive Bayesian Classifier are presented.

References

Salton, G. and Buckley, C. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 1988
Nello Cristianini, John Shawe-Taylor An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. — Cambridge University Press, 2000
Brett Lantz Machine Learning with R. Pack Publishing. Birmongham-Mumbai, 2013
Breiman, Leo; Friedman, J. H., Olshen, R. A., & Stone, C. J. Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software, 1984
Hastie, T., Tibshirani R., Friedman J. Chapter 15. Random Forests // The Elements of Statistical Learning: Data Mining, Inference, and Prediction. — 2nd ed. — Springer-Verlag, 2009. — 746 с.
Hand, DJ, & Yu, K. «Idiot’s Bayes — not so stupid after all?» International Statistical Review, 2001. - с 385—399.
Е.И. Большакова Автоматическая обработка текстов на естественном языке и компьютерная лингвистика: учеб. пособие / Большакова Е.И., Клышинский Э.С., Ландэ Д.В., Носков А.А., Пескова О.В., Ягунова Е.В. — М.: МИЭМ, 2011. — 272 с.