UA   RU
DonNTU   Masters' portal

Abstract

Content

Introduction

Currently, there is a need to create a "living stylistic dictionary", as well as language corpus and principles of stylistic description of vocabulary on the example of a certain word. Currently, there is a need to create a "living stylistic dictionary of the Russian language." Demand is associated with the need for dictated changes in the vocabulary of the modern Russian language, which are associated, firstly, with socio-economic changes in the life of modern society, and secondly, with the movement of a significant amount of communication in the Internet space..

This goal is conditioned by the set tasks: define the criteria of the concept of natural language processing; identify the main tasks of natural language processing. the problem is more or less satisfactorily solved, while the second is still far from being solved.

This goal is determined by the objectives:

1. Methods of information retrieval in natural language text processing

The number of methods of Moore FSM's hardware optimization is known: minimization of the amount of states (state reduction) and their specific encoding (state assignment), using features of target basis and algorithm of functioning, multilevel logic circuit's implementation. Mentioned methods are quite effective, but for getting the most economic FSM's implementation they have to be used jointly.

Master's work is dedicated to the actual scientific task of development a unified approach to the synthesis of Moore FSM, which is directing on hardware amount reduction in resultant device and is including algorithmic, combinatory and circuitry optimizing techniques. FPGAs by Xilinx, which combine functionality, programmability and availability to consumers, are used as the target basis. CAD Xilinx ISE, Verilog HDL and Java SE are applied as tools of the research.

Significant place in text search technology is occupied by processing IT.

Under treatment of IT (Natural Language Processing, NLP) is understood to be the solution of problems related to understanding, analysis, performing various operations on the texts, and so same their generation [9]. Examples of similar tasks: classification, clustering of stored collections documents, in-depth analysis of texts, translation of documents from one language to another, etc. All the variety of methods of information retrieval based on the processing and analysis of texts index documents. Most IPS are systems with preprocessing-preprocessing (indexing) of all documents available in the system. Exceptions are metasearch systems [9]. We list the main difficulties encountered in the processing of texts on IT:

the Problem of synonymy. One concept can be expressed in different words. As a result, relevant documents that use synonyms for the concepts specified by the user in the query can The problem of homonymy and phenomena "related to homonymy". Grammatical homonyms are words of different meanings, but coinciding in spelling in separate grammatical forms. It can be words of one or different parts of speech. Lexical homonyms-the words of one part of speech, the same sound and writing, but different in lexical value. Stable word combinations. Collocations can have a meaning different from the meaning that words have individually. Morphological variations. In many natural languages, words have several morphological forms that differ in spelling.

2. Main tasks of language processing

natural language Processing — General direction of artificial intelligence and mathematical linguistics. It studies the problems of computer analysis and synthesis of natural languages. With regard to artificial intelligence analysis means understanding the language and the synthesis — generation of literate text. The solution to these problems will mean the creation of more convenient forms of interaction between computer and human. Understanding, recognition of natural language is a key task, because the recognition and recognition of the language of the living requires enormous knowledge of the language system, language system, their features and patterns.

there Are 5 main and most urgent problems of natural language processing.[1,8]

  1. one of the most important tasks is speech recognition. This process refers to the process leading to the conversion of the voice signal of the human voice into digital information. This can be used by people who are unable to type with their hands, or to simplify and speed up the process.
  2. text Analysis - the process of extracting meaningful, high-quality information from natural language text to automate the process of data extraction and analysis.
  3. Information search — the process of identifying information in documents contained in searchable databases that correspond to a given query on the subject.
  4. information Extraction — natural language processing tasks that automatically extract the necessary data from the source of information, text (usually unstructured).
  5. Machine translation, or automatic translation. Under this task of natural language processing means the process of translation of oral texts written in natural language, to another, too natural, language with the help of electronic computers in computer programs designed for this type of task.

3. Using machine learning to work with text

General interest in neural network technologies and deep learning has not spared computer linguistics — automatic processing of texts in natural language. At recent conferences of the Association of computational linguistics ACL, the main scientific forum in this field, the vast majority of reports were devoted to the use of neural networks for solving already known problems, and for the study of new ones that were not solved with the help of standard machine learning tools. The increased attention of linguists to neural networks is due to several reasons. The use of neural networks, firstly, significantly improves the quality of solutions to some standard tasks of classification of texts and sequences, and secondly, reduces the complexity when working directly with texts, and thirdly, allows you to solve new problems (for example, create chatbots). At the same time, neural networks cannot be considered a completely independent mechanism for solving linguistic problems.

one of the most popular applications of neural networks is the construction of word vectors related to the field of distributive semantics: it is believed that the meaning of a word can be understood by the meaning of its context, by surrounding words. Indeed, if we are unfamiliar with a word in a text in a known language, in most cases we can guess its meaning. The mathematical model of the word meaning is the vector of words: strings in a large matrix "word-context", built on a sufficiently large body of texts. As" contexts " for a particular word can be neighboring words, words that are included with the data in one syntactic or semantic structure, etc.in the cells of such a matrix can be written frequency (how many times the word met in this context), but more often use the coefficient of positive pairwise mutual information (Positive Pointwise Mutual Information, PPMI), showing how non-random was the appearance of the word in a particular context. Such matrices can be successfully used to cluster words or to search for words that are close in meaning to the desired word.

4. Application of machine learning in life

Methods of teaching with the teacher are used when we know the so-called answers for the existing objects of the training sample, and for new objects we want to predict them. The answers are also called dependent variable. In this class of tasks, in turn, there are several types. In the first type, the answers are the values of some numerical value, as it was in our history with coffee: for each object of the training sample, we knew the amount of coffee drunk, and for a new object Nikita model predicted this value. This type of problem, when the dependent variable is a real number (that is, it can take any value on the entire numeric line), is called the regression problem. problems of the second type answers belong to a limited set of possible categories (or classes). Let's continue our office analogies: imagine that office Manager Michael bought two types of gifts for colleagues for the New year-t-shirts and notebooks. In order not to spoil the surprise, Michael wants to build a model that would predict what kind of gift an employee wants to receive, based on data from personal profiles (an attentive reader will notice that in reality, to build a model, Mikhail would still have to ask about the desired gift from some colleagues to form a training sample). This type of task, when you want to classify objects into one of several possible categories, that is, when a dependent variable takes a finite number of values, is called a classification task. The example of gifts refers to a binary classification: there are only two classes – t-shirts and notebooks; otherwise, when there are more classes, there is talk of a multi-class classification. Perhaps the most relevant example of classification is the credit scoring problem. When deciding whether to give you a loan or not, your Bank focuses on predicting a model trained in a variety of features to determine whether you are able to return the requested amount. Such signs are age, salary level, various parameters of credit history. Another type of training with a teacher is the ranking task. It is solved when you search for something in a search engine like Google: there are many documents and you need to sort them in order of their relevance (semantic proximity) to the query. Unsupervised learning methods are used when there are no correct answers, there are only objects and their features, and the task is to determine the structure of the set of these objects. These include the task of clustering: there is a set of objects, and it is necessary to divide them into groups so that in one group there are objects similar to each other. This can be useful, for example, when there is a large collection of texts and it is necessary to structure it automatically, to divide texts by topics. Clustering can be used to divide users of an online store site into segments, for example, to offer different products to different groups based on their interests.Another example of unsupervised learning is the problem of finding anomalies, which we mentioned last time: there are many objects, and it is necessary to highlight those that are very different from most. Anomaly detection methods are used to detect atypical transactions, atypical behavior on the site in order to prevent fraud. They also help to detect breakdowns in different systems based on multiple sensors. In addition to teaching with and without a teacher, there are more refined types of tasks. For example, in partial training, answers are known only for a subset of the sample features.inside the above types of problems in machine learning, there are different algorithms. One of them we have already met: it is a linear regression that was applied in the problem of predicting the number of coffee. Linear regression is one of the most well-studied methods of statistics and machine learning. It is suitable for describing linear dependencies, that is, those that can be well approximated by a straight line. Now machine learning algorithms can be divided into traditional and deep learning methods (this is a common name for different types of multilayer neural networks). For the successful operation of traditional algorithms, such a stage of data preprocessing as feature engineering is very important (there is no conventional translation into Russian for this term; it can be roughly translated as feature construction). This is the process of formation and selection of features. As a rule, work with signs is a laborious, time – consuming process that requires deep immersion in the subject area of the problem to be solved. Jeremy Howard, one of the authors of the famous course about deep learning fast.ai, gives the following example. A team of specialists from Stanford led by scientist Andrew Beck was engaged in the study of breast cancer. To build a model capable of predicting whether a patient with a tumor will survive or not, they had to study a huge number of breast biopsies. Thus, they determined which patterns in the images may be associated with the death of the patient and formed hundreds of complex features, such as the relationship between adjacent epithelial cells. Then a team of programmers developed algorithms for the correct recognition of these features from the images.The fundamental difference between deep learning is that it is able to take most of the work on the formation of features on itself, using only uniformly presented input data without manually selected complex features. In the case of breast cancer death prediction, medical images can be presented simply as a sequence of individual pixel brightness. Multi-layered neural networks with each layer are able to combine pixels into increasingly useful levels of abstraction. Thus, they get an idea of the image as a whole, as well as its parts that affect the final prediction (for example, the tumor and its size).

Conclusion

Word processing problems arose almost immediately after the advent of computer technology. Despite the half-century history of research in the field of artificial intelligence, the accumulated experience of computational linguistics, a huge leap in the development of it and related disciplines, a satisfactory solution to most practical problems of text processing has not yet been found. However, the it industry has demanded a satisfactory solution to some word processing tasks. Thus, the development of data warehouses makes relevant the task of extracting information and the formation of correctly constructed text documents. The rapid development of the Internet has led to the creation and accumulation of huge amounts of text information that requires the creation of full-text search tools and automatic classification of texts (in particular, anti-spam software), and if the first problem is more or less satisfactorily solved, then to solve the second is still far.

When writing this essay master's work has not yet been completed. Final completion: may 2019. Full text of the work and materials on the topic can be obtained from the author or his supervisor after the specified date.

References

  1. Moore E.F. Gedanken-experiments on sequential machines / E.F. Moore // Automata studies, Annals of mathematical studies. – 1956. – vol. 34. – pp. 129-153.
  2. Гилл А. Введение в теорию конечных автоматов / А. Гилл. – М.: Наука, 1966. – 272 с.
  3. Миллер Р. Теория переключательных схем / Р. Миллер. – М.: Наука, 1971. – Том 2: Последовательностные схемы и машины. – 304 с.
  4. Минский М. Вычисления и автоматы / М. Минский. – М.: Мир, 1971. – 364 с.
  5. Хопкрофт Д. Введение в теорию автоматов, языков и вычислений / Д. Хопкрофт, Р. Мотвани, Д. Ульман. – М.: Издательский дом «Вильямс», 2002. – 528 с.
  6. Ito M. Algebraic theory of automata and languages / M. Ito. – World Scientific Publishing, 2004. – 199 pp.
  7. МАШИННОЕ ОБУЧЕНИЕ В ЗАДАЧАХ ОБРАБОТКИ ЕСТЕСТВЕННОГО ЯЗЫКА: ОБЗОР СОВРЕМЕННОГО СОСТОЯНИЯ ИССЛЕДОВАНИЙ-Режим доступа:https://cyberleninka.ru..
  8. Уилкинсон Б. Основы проектирования цифровых схем / Б. Уилкинсон. – М.: Издательский дом «Вильямс», 2004. – 320 с.
  9. Современные методы обработки естественного языка - Режим доступа:https://cyberleninka.ru..
  10. Breeding K. Digital design fundamentals / K. Breeding. – Prentice Hall, 1992. – 446 pp.
  11. 5 методов обработки естественного языка, которые стремительно меняют мир вокруг нас. Режим доступа:https://neurohive.io....
  12. Ясницкий, Л. Н. Введение в искусственный интеллект / Л. Н. Ясницкий. — М.: Издат. центр Академия, 2005. — 176 с.
  13. Как решить 90% задач NLP: пошаговое руководство по обработке естественного языка — Режим доступа:https://habr.com..
  14. Обработка естественного языка на Python — Режим доступа: https://proglib.io/p/fun-nlp/
  15. Как решить проблему машинного понимания естественного языка — Режим доступа: https://habr.com/post/271321/
  16. Современные методы обработки естественного языка - Режим доступа:https://periodicals.karazin.ua....