Abstract
- Introduction
- 1. Relevance of the topic
- 2. Purpose and objectives of the study, planned results
- 3. Overview of existing tools
- 4. Formalized problem statement
- 5. An overview of the document text preprocessing model
- 6. An Overview of Knowledge Representation Models
- 7. An overview of text classification models
- 7.1 Bayes method
- 7.2 Support Vector Machine (SVM)
- 7.3 k-nearest neighbors
- Conclusions
- List of sources
Introduction
Complaint – the name of the document under which the consumer's claim to the supplier of goods or services is hidden. The complaint is made in writing and is the basis for taking measures leading to the elimination of identified shortcomings, defects, marriages and other violations.
In the modern world, customer service issues, in particular the resolution of complaints, companies still pay undeservedly little attention, forgetting that it is about their reputation.
In order to learn how to manage complaints and use them to grow your business, you need to go beyond the current understanding of a customer complaint as simply an expression of dissatisfaction with them. To a rational solution of the complaint, satisfying both parties can only come in a friendly environment. It is necessary to see in the complaint a manifestation of the highest trust of the client and a way to improve the quality of the goods and services provided.
Complaint allows the buyer of a product or recipient of a service to claim that it was provided under improper conditions. The claim can be made in terms of quality, quantity, assortment, weight of any inventory items, unilateral change in their cost, delivery time and other parameters.
A complaint can be made on behalf of an individual or an organization. In the second case, this letter can be written by any employee of the company authorized to create such claims and having sufficient this level of knowledge, qualifications and familiarity with the law.
Today, this document does not have a unified sample that is mandatory for use, therefore it can be drawn up in any form.
An important task when working with claims is to classify them according to the type of claim and determine which department or specific employee should receive it in order to analyze and prevent the described errors in the future.
To solve this problem, it is proposed to create a decision support system for the production documentation management process (DSS) – computer automated system, some intelligent instrument that used by decision makers in difficult conditions for a complete and objective analysis of subject activity. DSS is designed to support multicriteria decisions in a complex information environment. At the same time, under the multicriteria it is understood that the results of the decisions made are evaluated not by one, but by the totality of many indicators (criteria) considered simultaneously.
1. Relevance of the topic
Due to the increased volume of electronic document management, it became difficult for sales department employees to process a large amount of information.
Today, the complaint does not have a unified sample that is mandatory for use, therefore it can be drawn up in any form and is a document in an unstructured form. There is a need to extract useful information and, in the future, the classification of complaints according to various criteria (for example, by type of complaint) and the identification of the department that allowed the defect. The task of developing a modern intelligent system for supporting the adoption of managerial decisions in the sales department.
The main activity of the enterprise in question is related to the production and marketing of cosmetic products. In the chain, the company – the consumer may find problems with the product: incorrectly pasted label, defective packaging, damage to goods during transportation, etc. In such cases, the client has the opportunity to contact the manufacturer in order to resolve the situation – prepare and submit a claim.
2. Purpose and objectives of the study, planned results
The purpose of creating an intelligent system for processing and classifying complaint texts at the enterprise is to increase the efficiency of the complaint processing process by reducing the time spent by employees on information analysis.
To do this, you need to complete the following tasks:
- analyze the complaint handling process in the enterprise;
- explore existing methods and models for the task of classifying documents;
- develop a module for importing documents from various sources;
- develop an algorithm for indexing (preprocessing) documents;
- develop an algorithm for classifying indexed documents;
- provide the user with recommendations for making decisions on how to fix problems in the future;
- test the developed system and analyze the results.
The object of research is the process of processing complaints in the sales department.
The subject of the work is the classification of complaint texts according to the type of problem using pre-processing of the document text, knowledge representation model and text classification methods.
Supposed scientific novelty:
- development of an ontological model of the subject area for working with complaints;
- development of an algorithm for classifying complaint texts.
3. Overview of existing tools
Let's consider several well-known tools similar to the theme of the system being developed:
- RCO Text Categorization – a solution that, based on lexical profiles, effectively determines whether a text belongs to a given set of categories, for each term from lexical profiles found in the text, receives the number of its occurrences in the text, as well as the position of the terms in the text. [1]
- OpenText Auto-Classification – an application that provides an orderly and secure classification of content. The application uses the OpenText Content Analytics engine, which processes each document, email or post on a social network, classifying the received data in accordance with corporate policy and legal requirements. [2]
- ABBYY FlexiCapture – universal platform for intelligent information processing. The system classifies any type of incoming documents both in terms of appearance and textual content. Classification by image is based on machine learning. With it, documents can be sorted by appearance or relative position of elements. Text classification is based on statistical and semantic analysis. [3]
Reviewed have advantages:
- Ability to work not only with electronic documents, but also with scans of documents.
- Processing different types of documents.
- Scalability and high performance.
Also, the tools have their drawbacks:
- Opacity – it is not specified which knowledge representation models and classification methods they use.
- Security – it is not known how much you can trust these tools, how securely documents will be stored and processed.
- Price – all of the above tools do not have a free version, so you will need to pay for their use.
- Injecting – you need to adjust the selected system to the existing document processing process.
Next, consider the models and methods used in existing software solutions.
4. Formalized problem statement
Let D – many documents, C – many categories, F – an unknown objective function that, given the pair [di , cj], tells whether the document di belongs to the category cj or not.
The task of classification is to build a classifier that is as close as possible to the function.
The task of exact classification is set, i.e. each document belongs to only one category.
5. An overview of the document text preprocessing model
The process of obtaining an indexed representation of the body of a document is called document indexing. Indexing is done in two steps, see on figure 1: [4]
- Extract terms – at this stage, the search and selection of the most significant terms in the entire set of documents is performed. The result of this stage is the set of terms T used to obtain the weight characteristics of documents.
- Weighing – the significance of the term for this document is determined. The weight of terms is given by a special weight function.

Figure 1 - Term extraction stage
(animation: 12 frames; 3 loops; 116 kilobytes)
Let's take a closer look at the term extraction stage:
- Graphematic analysis – all characters that are not letters are filtered out (for example, html tags and punctuation marks).
- Lemmatization – when building a text classifier, it makes no sense to distinguish between forms (conjugation, declension) of a word, since this leads to an excessive growth of the dictionary, increases resource consumption, and reduces the speed of algorithms. Lemmatization is the reduction of each word to its normal form.
- Reducing the dimension of the feature space – words that are not useful for the classifier are removed.
- Highlighting key terms – usually single words found in the document are used as terms. This can lead to a distortion or loss of meaning, which, for example, lies in phraseological units that are indivisible vocabulary units from the point of view of linguistics. Therefore, when processing abstracts, instead of individual words, phrases (key terms) specific to a given subject area are distinguished.
6. An Overview of Knowledge Representation Models
Knowledge Representation Model (KRM) – it is a way of setting knowledge (extracted information from documents) for storage, easy access and interaction with them, which fits the task of an intelligent system. [5]
Four main KRMs are common:
1. Production – it is based on the constructive part, production (rule):
IF Condition THEN Action
Pros of production models:
- deleting, changing, adding any product can be performed independently of all other products (does not lead to changes in other products). Knowledge is entered randomly, like in a dictionary or encyclopedia. Practice shows that this is a natural way for an expert to replenish his knowledge;
- if any rule is added or modified, then everything that was done earlier remains in force and does not apply to the new rule;
- The vast majority of human knowledge can be written as productions. Human knowledge is modular and therefore product systems are closer to represent and easier to read;
- Production systems can implement any algorithms, if necessary, and are capable of reflecting any procedural knowledge available to the computer.
Cons of the production system:
- with a large number of productions, it becomes difficult to check the consistency of the production system;
- Due to the inherent nondeterminism of the system (ambiguous choice of products to be performed from the front of activated products), there are fundamental difficulties in checking the correctness of the system.
2. Semantic web – the basis is a directed graph. Graph vertices – concepts, arcs – relationships between concepts.
Pros of semantic networks:
- universality, the semantic network allows you to represent any existing system in the form of a diagram;
- visibility of the knowledge system represented graphically;
- closeness of the network structure representing the knowledge system to the semantic structure of natural language phrases.
Cons of semantic networks:
- formation and modification of the semantic model is difficult;
- searching for a solution in the semantic network is reduced to the task of finding a fragment of the network corresponding to the subnet that reflects the query;
- the more relationships between concepts, the more difficult it is to use and modify knowledge.
3. Frame – The basis of the frame model is the frame. Frame – it is a frame, a template that describes an object of the subject area, using slots. Slot – it is an attribute of the object. A slot has a name, a value, a stored data type, a daemon. Demon – procedure automatically performed under certain conditions.
Pros of the knowledge frame model include:
- flexibility, that is, a structural description of complex objects;
- visibility, i.e. data on generic relationships are stored explicitly;
- property inheritance mechanism. Frames have the ability to inherit the values of the characteristics of their parents, which are at a higher level of the hierarchy, which ensures the wide distribution of languages of this type in intellectual systems.
Cons of the frame system are:
- high complexity of systems in general;
- lack of strict formalization;
- It's hard to change the hierarchy;
- Exception handling is difficult.
4. Formally, the logical – based on a first-order predicate. It is assumed that there is a finite, non-empty set of objects in the subject area. On this set, with the help of interpreter functions, links are established between objects. In turn, on the basis of these connections, all the laws and rules of the subject area are built.
Pros of the logical model:
- regardless of the number of formulas and procedures, the logical form will have only one output;
- due to the fact that the logical model uses mathematical formulas that have been extensively studied to date, the methods of the model can be accurately justified;
- due to the strict representation of formulas in the form of procedures, it is possible to uniquely implement the method using logical programming languages (for example: Prologue, Planner, Visual Prologue, Oz and others);
- due to the peculiarities of the process of deriving new knowledge, only a set of axioms can be stored in the knowledge base, which in turn greatly facilitates the database of future artificial intelligence.
Cons of the logical model:
- due to the fact that the facts (formulas) look very similar, the model is difficult to use for specific subject areas;
- due to the lack of certainty in some areas of science, it is difficult to add the necessary number of axioms to the logical model for the correct operation of the future system;
- Inference drawn from correct axioms may not make sense to the human mind. A program can make connections correctly, but get completely wrong output;
- Each axiom must have a strict derivation, often either
yes
orno
. This is very difficult to achieve in the humanities, and therefore the complexity of development increases exponentially.
Recently, a new way of representing knowledge in intelligent systems is gaining popularity – ontology. An ontology is understood as a system of concepts (concepts, entities), relations between them and operations on them in the considered subject area, in other words, ontology – it is a specification of the content of the domain. [6]
The use of ontologies allows avoiding the loss of computer time for the analysis of concepts that are not included in the subject area.
7. An overview of text classification models
7.1 Bayes Method
This algorithm is based on the maximum posterior probability principle. For the classified object, the likelihood functions of each of the classes are calculated, and the a posteriori probabilities of the classes are calculated from them. The object belongs to the class for which the posterior probability is maximum.
Pros:
- To use the method, knowledge of a priori information is sufficient;
- inferential statements are easy to understand;
- The method provides a way to use subjective probabilistic estimates.
Cons:
- determining all interactions in Bayesian networks for complex systems is not always feasible;
- Bayesian approach requires knowledge of a set of conditional probabilities, which are usually obtained by expert methods. The application of the software is based on expert judgment.
7.2 Support Vector Machine (SVM)
It is used to solve classification problems. The main idea of the method is to construct a hyperplane that separates the sample objects in an optimal way. The algorithm works under the assumption that the greater the distance between the separating hyperplane and objects of separable classes, the smaller the average classifier error will be. [7,10]
Pros:
- The Convex Quadratic Programming Problem is well studied and has a unique solution.
- SVM is equivalent to a two-layer neural network, where the number of neurons in the hidden layer is determined automatically as the number of support vectors.
- The principle of the optimal separating hyperplane leads to maximizing the width of the separating strip, and therefore to a more confident classification.
Cons:
- Instability to noise: outliers in the original data become reference objects-violators and directly affect the construction of the separating hyperplane.
- General methods for constructing kernels and rectifier spaces that are most suitable for a particular task are not described.
- No feature selection.
7.3 k-nearest neighbors
In order to find rubrics relevant to a document, this document is compared with all documents in the training set. For each document from the training sample, the distance is found - the cosine of the angle between the feature vectors. Coming from of the training sample, the documents closest to ours are selected. Relevance is calculated for each category. Categories with relevance above some given threshold are considered relevant to the document. [8,11]
Pros:
- tolerance to outliers and outliers, since records containing them are unlikely to be among the k-nearest neighbors. If this happens, then the impact on voting (especially weighted) is also likely to be insignificant, and, therefore, the impact on the results of the classification will also be small;
- The software implementation of the algorithm is relatively simple;
- The results of the algorithm are easy to interpret. The logic of the algorithm is understandable to experts in various fields.
Cons:
- this method does not create any models that generalize previous experience, and the classification rules themselves may be of interest;
- when classifying an object, all available data is used, so the KNN method is quite computationally expensive, especially in the case of large amounts of data;
- high labor intensity due to the need to calculate distances to all examples;
- Increased requirements for the representativeness of the source data.
All of the previously listed methods, except for the Bayesian method, use a vector representation of the document, in which the content is represented as a vector of terms contained in the document. The classifier is a special document whose vector is formed at the training stage and consists of the average values of the weights of the terms included in the documents of the training sample. These methods have quite a lot in common and differ only in the method of training and compiling a classifier vector. The classification itself is the calculation of the angle between two vectors, as the degree of their similarity.
If a domain ontology is used for classification, then the document vector can be compared with the vector of the ontology itself. This implies two important differences from classical machine learning methods: [9]
- The description of the subject area in the form of an ontology is itself a classifier, thus, time and computing resources are not wasted on building an average document from the training sample.
- With this approach, only those terms that are included in the considered ontology are included in the document vector. This means that those concepts that are not included in the set of ontology concepts leave the process of calculating the weights of terms.
Conclusions
At this stage of the master's work, the goal and objectives for the system were determined, similar tools were studied and analyzed on the subject of the master's work. Existing methods are described and analyzed knowledge representation and text preprocessing.
When writing this essay, the master's work has not yet been completed. Final Completion: May 2023. The full text of the work and materials on the topic can be obtained from the author or his supervisor after the specified date.
List of used sources
- RCO Text Categorization Engine [Electronic resource]. – Access mode: [Link]
- OpenText Auto-Classification [Electronic resource]. – Access mode: [Link]
- ABBYY FlexiCapture. Универсальная платформа для интеллектуаль¬ной обработки информации [Electronic resource]. Access mode: [Link]
- Леонова Ю. В., Федотов А. М., Федотова О. А. О подходе к классификации авторефератов диссертаций по темам // Вестн. НГУ. Серия: Информационные технологии. 2017. Т. 15, № 1. С. 47–58.
- Представления знаний в интеллектуальных системах, экспертные системы [Electronic resource]. – Access mode: [Link]
- Грушин М.А. Автоматическая классификация текстовых документов с помощью онтологий // ФГБОУ ВПО
МГТУ им. Н.Э. Баумана
. Эл No. ФС77-51038 - К. В. Воронцов. Лекции по методу опорных векторов [Electronic resource]. – Access mode: [Link]
- Классификация данных методом k-ближайших соседей [Electronic resource]. – Access mode: [Link]
- Данченков С.И., Поляков В.Н. Классификация текстов в системе узлов лексической онтологии // Физико-математические науки. Том 152, кн.1, 2010 г.
- Машина опорных векторов [Electronic resource]. – Access mode: [Link]
- Метод k взвешенных ближайших соседей (пример) [Electronic resource]. – Access mode: [Link]