Abstract

Содержание

Introduction
1. Theme urgency
2. Goal and tasks of the research
3. Research and development overview
3.1 Overview of the basic principles of constructing computer morphology
3.2 Graphical analysis
3.3 Morphological analysis using a dictionary
4. Morphological analyzer
4.1 General requirements for the morphological analyzer
4.2 Morphological analysis of the word
4.3 Basic dictionaries
Conclusion
References

Introduction

In most natural languages, there is such a phenomenon as morphological variability of words. This phenomenon is strongly expressed in the Russian and Ukrainian languages, which belong to the group of inflectional languages with a complex system of inflections.

An information retrieval system or any other system that works with Russian or other inflected language must take into account this feature of the language, which is usually implemented using a special module of the system called the morphological analysis module.

The purpose of the work is to research and develop algorithms for constructing a morphological analyzer based on a dictionary with a formal description of the language. The main problem to be solved in this case is the development of methods for constructing the structure of the morphological dictionary and programming methods that allow solving the problem.

1. Theme urgency

The relevance of the problem of morphological analysis and synthesis of word forms is determined by the fact that the block of morphological analysis is a necessary part of the majority of programs working with natural language texts of various levels and purposes.

Today, the development of algorithms for constructing a morphological analyzer based on a dictionary is one of the urgent tasks in computational linguistics. With the help of already developed algorithms, it will be easier in the future to develop more complex algorithms that will already include more functionality in this topic.

2. Goal and tasks of the research

The purpose of the work is to research and develop algorithms for constructing a morphological analyzer based on a dictionary with a formal description of the language for a web platform. The main problem to be solved in this case is the development of methods for constructing the structure of the morphological dictionary and programming methods that allow solving the problem.

The main objectives of the study:

Conduct a study of algorithms for the formation of the structure of the morphological dictionary under the conditions of restrictions imposed by means of developing web platforms.
Design the database model.
Develop algorithms for constructing a morphological analyzer based on a dictionary.
Development of a method for recognizing words that are not in the dictionary.

The subject of the research is methods of creating websites, sales automation systems (CRM, BI), algorithms for the process of making managerial decisions.

The object of the research is the result of execution of algorithms for constructing a morphological analyzer based on a dictionary and their further optimization.

3. Research and development overview

Morphological analysis is the definition of the word form of the original word - lexeme, as well as the morphological characteristics of the given word form, such as gender, case, number, etc. The developed morphological analyzer will have to perform morphological analysis and identify the noun-concepts of the given text

3.1 Overview of the basic principles of constructing computer morphology

Among the methods of morphological analysis used in linguistic processors, one can single out methods with a declarative and procedural orientation. Declarative orientation methods are characterized by the presence of a complete dictionary of all possible word forms for each word. Moreover, each word form is supplied with complete and unambiguous morphological information, which includes both constant and variable morphological parameters. The task of morphological analysis in this case is reduced to finding the desired word form in the dictionary and copying morphological information corresponding to the found word form into the program.[11]

Procedural methods use probabilistic-statistical methods and lexicons of suffixes or quasi-suffixes, bases or quasi-bases, constructed empirically. Each word is divided into stem and affix, and the dictionary contains only the stems of the words, along with references to the corresponding rows in the table of possible affixes. The main criterion for splitting a word into a stem and an affix is that the stem must remain unchanged in all possible word forms of a given word. Since a large number of words in the Russian language have the same affixes, the total volume of the dictionary of bases and the dictionary of affixes turns out to be much less than the volume of the complete dictionary of all word forms used in declarative methods. However, the procedure for morphological analysis becomes more complicated: now, from the dictionary of bases, it is necessary to select one by one all bases that coincide with the initial letters of the analyzed word, and for each such base, sort out all possible affixes for it. In case of exact coincidence of the next variant "base + affix" with the analyzed word, the variant of the analysis is considered successful, and the morphological information corresponding to the given base and the given affix is transferred to the program. In this case, as a rule, constant morphological parameters are determined by the stem of the word, and variables - by the affix.

It is possible to use a combined version of morphological analysis. In this case, both a dictionary of word forms and a dictionary of basics are used. At the first stage, a search is carried out in the dictionary of word forms, and in case of a successful search, the analysis is completed. Otherwise, a vocabulary of fundamentals and a procedural analysis method are used.

3.2 Graphical analysis

Before proceeding with the morphological analysis of individual words, it is necessary to carry out a graphematic analysis of the input text.

The main purpose of the graphematic module is to obtain a selection of complete word forms from an array of database texts [5]. Graphical analysis works with the external representation of the text and uses a table of stop words. This table stores numbers, special characters and frequency words of the language that are irrelevant for text search and do not need morphological analysis.

Graphical analysis has three functions:

cutting off stop words in the text;

splitting data into three streams (full word forms, abbreviations, digital and symbolic complexes);

indexing of each stream.

The unit of graphematic analysis is a string of characters, separated from both sides by spaces. The selected character string is subjected to sequential processing by heuristic rules: cut off punctuation marks, check for the presence of vowels inside the chain, alternate upper and lower case, etc. Depending on the processing results, the resulting character string is sent to one of three data streams:

digital and symbolic complexes («кг», «ст.», «12.01.99»);

abbreviations - the names of states, organizations, enterprises («СССР», «ЮНЕСКО», «ДорСтройСервис»);

full word forms.

The first two data streams do not need morphological analysis. Complete word forms enter the input of morphological analysis, the purpose of which is to break the entire set of word forms into subsets based on belonging to a particular lexeme (a set of word forms that differ from each other only in inflectional meanings), to bring all elements of each such subset to a unique basis, to unambiguously determine the grammatical characteristics of the lexeme and index the texts according to the foundations found in them.

3.3 Morphological analysis using vocabulary

Models that use a vocabulary are capable of giving a more complete analysis of a word form (i.e., operate with a large number of grammatical features). The accuracy of this analysis is higher compared to models that do not use a vocabulary. But, in the space of real texts, systems using a dictionary often fail. This is because complete dictionaries do not exist. The vocabulary of the language is constantly replenished - new words appear. Each subject area has its own terminology, its own subset of the language vocabulary, and it is impossible to include all existing terminology in the general dictionary. It is equally impossible to list all existing names and surnames that have a regular declension.

This method copies the academic linguistic model of description, where the main paradigmatic classes, corresponding to the type of declension and conjugation, and the rules of regular alternations (phonetic alternations) are distinguished, and irregular forms, for example, strong verbs in German and English, are specified by enumeration. Lexicons of this type for the Russian language are compiled on the basis of a grammatical dictionary model, for example, A. Zaliznyak, A. Lebedev, etc., developing 8 classes of nominal declension and 16 verb conjugation, and alternations in the base and the verb topic are placed in a separate set of post-morphological rules alternations.

Indexing processes are the same as in the method of constructing morphology without a dictionary. From the beginning, the text undergoes graphematic analysis - the text is divided into words, then the words are fed to the input of the morphological analyzer.

The input parameter is the textual representation of the original word. The purpose and result of morphological analysis is to determine the morphological characteristics of a word and its main word form. The list of all morphological characteristics of words and the permissible meanings of each of them depend on the natural language. However, a number of characteristics (for example, the name of a part of speech) are present in many languages. The results of the morphological analysis of the word are ambiguous, which can be traced in many examples.

A. Zaliznyak's dictionary contains the basic word forms of words in the Russian language, for each of which a specific code is indicated [2]. There is a known system of rules with which you can construct all forms of a given word, starting from the initial word form and the code corresponding to it. In addition to building each word form, the system of rules automatically assigns morphological characteristics to it. When conducting a clear morphological analysis, it is necessary to have a dictionary of all words and all word forms of the language. This dictionary at the input takes the form of a word, and at the output it gives out its morphological characteristics. This dictionary can be built on the basis of A. Zaliznyak's dictionary or a similar one in an obvious algorithm: sort out all the words from the dictionary, for each of them determine all the lexemes and enter them into the emerging dictionary.

With this approach, to carry out a morphological analysis of a given word, you just need to find it in the dictionary, where the exact, definitively known values of all its morphological characteristics are already stored. For one and the same input word, several variants of the meanings of its morphological characteristics may occur at once.

Unfortunately, this method is not always applicable: the input words may not be included in the dictionary of all word forms. Such a situation may arise due to errors in the input of the original text, due to the presence of proper names in the text, etc. In the case when the method does not give the desired result, a fuzzy morphology is applied.

The purpose of morphemic analysis of a word is to separate a word into prefixes, roots, suffixes, and endings.

The dictionary of morphemes of the Russian language indicates the division of each word into separate parts, but the types of each of them are not indicated - which of them is a prefix, which is a root, etc.

The set of all the roots of words in the Russian language is open, but the set of all possible prefixes, suffixes and endings is limited; in addition, it is known that in any word prefixes come first, then roots, then suffixes and endings. Therefore, on the basis of the dictionary of morphemes of the Russian language, you can build another dictionary that will contain not only the division of each word into parts, but also the type of each of them. In this case, to carry out morphemic analysis of the word, you must refer to this dictionary.

Morphemic analysis is not limited to dictionary calls. In a situation where a word is absent in the dictionary, it is possible to directly conduct an analysis based on the standard structure of Russian words (prefix - root - suffix - ending) and the set of all prefixes, suffixes and endings.

Let us return to the morphological analysis of a word in a situation when it was not possible to determine the characteristics of a word using the methods of clear morphology, but it was possible to dismember it into parts. The presence of certain lexemes can determine the morphological characteristics of a word: you can build a system of rules that will rely on the presence or absence of any parts and issue one or more assumptions about morphological parameters. Such a set of rules can be constructed in two ways. The first is based on the morphemic analysis of words contained in the dictionary of all word forms and their morphological characteristics. Let us consider this problem more formally: pairs of meanings are known, consisting of the morphemic structure of a word and its morphological characteristics. This is nothing more than the "input" and "output" of the system of rules, which, according to the morphemic structure of a word, will determine its morphological characteristics. The task of building such a system of rules can be solved using a self-learning system. For its implementation, decision trees, programming based on inductive logic (ILP, Inductive Logic Programming) or other algorithms can be used.

4. Morphological analyzer

4.1 General requirements for the morphological analyzer

The main task of the morphological analyzer is to determine the morphological features of words in a text and canonical forms (CF) of words. The CF can be either a stem (or even a root), or the traditional normal form of a word (NF) - a certain initial form of a word, for example, for a verb - an infinitive, for a noun - a singular in the nominative case.

For the task of reduction, it is necessary to create a dictionary with which you can get the canonical form of the word and information about the morphological features of the word, its declension. It is necessary to highlight the main components, on the implementation of which the effectiveness of using the morphological analyzer depends.

1. Organization of the structure of the dictionary. The dictionary should be easy to implement, should not contain complex and bulky elements. It is necessary to take into account the minimum use of additional resources and the time spent on accessing the elements of the dictionary.

2. Algorithm for determining the canonical form and declension. The elements of the dictionary should contain the necessary information for casting and as complete as possible for the definition of tokens, which excludes the use of additional methods of pre- and post-processing.

3. Algorithms for determining the morphological characteristics of words that are absent in the dictionary. Natural languages are constantly updated with new words, so it is very important to update the dictionary in a timely manner.

4.2 Morphological analysis of a word

Morphological analysis can be divided into "forward" and "backward". Both methods are widely used in search engines.

The direct method consists in finding all its word forms from the normal form of a word. This operation is used when selecting documents. Since documents are selected that contain all forms of a word, the search result includes not only documents with a word in the form matching the request, but also other documents containing various forms of this word.

The algorithm for constructing all word forms and determining its morphological characteristics primarily depends on the structure of the morphological dictionary used by the analyzer.

The reverse method is to find the normal form from an arbitrary one. This operation is used when indexing text. Thus, a decrease in the size of the index and the search time for documents that satisfy the condition are achieved.

The algorithm for finding the normal form from an arbitrary one also depends on the structure of the dictionary.

4.3 Basic dictionaries

To build a morphological dictionary, you can use any of the grammatical dictionaries that describe the formal model of the language. For the Russian language, you can use the following dictionaries:

A. Lebedev's spelling dictionary of the Russian language;

grammatical dictionary A. Zaliznyak.

4.3.1 Dictionary of the Russian language by A. Lebedev

A. Lebedev's spelling dictionary consists of two files: the dictionary itself and the affix rules. The volume of the dictionary is 137.2 thousand words and more than 1.354 million word forms. This dictionary was developed specifically for the Ispell spell checker. But on its basis you can build a morphological dictionary for the analyzer.

The first file is a dictionary, a list of words in NF, separated by the "/" symbol from the groups of word formation rules identifiers.

Word forms for the word "dress" are built according to the rules for the group of rules C. It is worth paying particular attention to the fact that there can be several groups (join / BLWR). This means that all the rules from the indicated groups (B, L, W, R) are applicable to the given word. In general, this is a rather flexible principle, thus, redundancy is reduced, a number of rules are placed in separate groups and applied to the main group.

The affix rule file (* .aff) consists of flags and word formation rules.

It can be read as follows: in the word «играть» we replace the ending «-ть» with «ю» only if the word «играть» satisfies the mask «[АЕ]ть», where the mask is an ordinary PERL-compatible regular expression. Find out which groups of rules can be applied to a given word, and which cannot be helped by the flags indicated next to the word in the dictionary.

The morphology based on this dictionary is much inferior to commercial developments, since the dictionary was originally intended for spell checking. The words «приодень», «приоденьте» and «приоденьтесь» are completely different words, although in reality they may have the same stem. Also, in the dictionary you cannot find "new" words that came to us from other languages, for example, «менчердайзер».

Some disadvantages can be solved using A. Zaliznyak's dictionary, on the basis of which almost all modern systems working with the morphology of the Russian language were built.

4.3.2 Dictionary of the Russian language by A. Zaliznyak

A. Zaliznyak's grammar dictionary currently contains 161 thousand words [2]. This is a fundamental work on morphology, where for the first time a systematic approach to describing grammatical paradigms was proposed, including not only a change in the letter composition of words, but also stress.

The dictionary was first published in 1977, since then it has been reprinted several times. The electronic version of this dictionary formed the basis for most modern computer programs that work with Russian morphology (spell checking systems, automatic translation, abstracting, etc.).

Each word is represented in the dictionary by its original, or dictionary, form (which forms the so-called headword of the article). For the inflected parts of speech, this is the nominative case (if the word changes in numbers - singular; if it changes in masculine gender), for verbs - the infinitive, for unchangeable parts of speech - the only available form.

All words (both in their original and in other forms) are given in the usual spelling notation, but with an indication of stress.

The structure of a dictionary entry.

A dictionary entry generally consists (not counting the headword) of:

1) the main alphabetic character,

2) index,

3) additional marks and instructions (in special cases, one or another of these elements may be absent).
If different meanings of the word correspond to some differences in the formation of forms, the glossary article is divided into a corresponding number of parts ("sub-articles"), each of which is built as an independent article (that is, it has its own basic alphabetic symbol, its own index and additional notes and instructions). Each such subentry begins on a new line.

Basic alphabetic character.

The main alphabetic character (for all words except nouns and verbs) is a literal abbreviation that denotes a part of speech. In nouns, the main literal symbol consists of symbols of gender and animate or inanimate (or the symbol "plural"), in verbs - from symbols of the form and transitivity or intransition; part of speech in these cases is not indicated, since it is clear from the main alphabetic character.

Index.

Indexes are available only for the parts of speech that are changed. The elements that make up the index are as follows (only the first of them is required, any of the other elements may be absent).

1. Number (from 0 to 8 for names, from 1 to 16 for verbs) - the type of declension or conjugation.

2. A superscript asterisk (*) or a circle (°) with a number are symbols for the presence of certain alternations in the stem (in some special cases * and ° appear simultaneously).

3. Latin letter (or two letters separated by a slash) without strokes or with strokes - stress scheme. This index element is missing only after the digit 0.

4. Russian letter or letter sequence (between dashes, in brackets), for example, (-д-) when leading, (-им-) when separating, - an indication (possible only with a small part of verbs), which allows you to correctly form the base of the present tense.

5. One or more numbers in a circle (from 1 to 9) - symbols of the most common deviations from standard declension or conjugation.

6. The sign "ё" is a symbol of the presence of alternation at the base: ё (under stress) - e (without stress). A variant of this sign is the sign "o" (possible only for verbs to sing).

Examples of indices: 1а; За/с; 5*b; l*a/b2, ё.

For the overwhelming majority of words, the index consists only of the 1st and 3rd elements, that is, of a number and a Latin letter (or letters), for example, 1а, 4b, За/с (or even only one digit 0). In what follows, such indices are called simple.
The collection of the 1st, 2nd and 3rd elements of an index is called the main part of the index.

For some nouns and adjectives, the index (along with an additional gender or part of speech symbol) is enclosed in angle brackets. This means that the word is declined as indicated in angle brackets, although it itself belongs to a different genus or another part of speech. Examples:

- a man mo <жо 1а> (that is, this noun is masculine, inflected following the pattern of the feminine gender);

- comma ж <п 1b> (that is, this noun is declined as an adjective).

Building forms directly from the index.

The method of constructing forms is based on the fact that each element of the index has an independent morphological meaning. It is disclosed in the "Grammatical information", in the sections entitled "The meaning of alphabetic characters and index elements", A. Zaliznyak's dictionary. At the end of these sections, the very method of constructing forms directly by index is indicated. If the Dictionary entry contains additional notes or instructions, with this method, as well as with the previous one, it is necessary to make the appropriate amendments in the constructed forms.

Conclusion

In the process of performing this work, the following tasks were solved and considered:

1. Basic principles of constructing computer morphology.

2. Basic algorithms for creating an experimental model of a morphological analyzer based on a dictionary with a formal description of the language for the web platform.

3. The problem of developing methods for constructing the structure of the morphological dictionary and programming methods was solved, which would allow solving the set problem with minimal effort in relation to texts of arbitrary (or almost arbitrary) complexity.

4. Three structures of the morphological dictionary for the proposed analyzer model, differing from each other in the complexity of construction, speed and quality of analysis

5. The method for determining the morphological characteristics of words that are absent in the dictionary was considered and introduced into the proposed model of the analyzer.

6. As a result of the analysis, the problem of developing methods for constructing the structure of the morphological dictionary and programming methods was solved, which would allow solving the set problem with minimal effort in relation to texts of arbitrary (or almost arbitrary) complexity.

7. To implement the proposed model of the morphological analyzer and analyze the effectiveness of the developed dictionary structures, modern tools in the field of web programming, such as the high-level programming language PHP, were used.

This development is open - the proposed model of the morphological analyzer can be built into any computer systems for a web platform that process texts in natural language.

References

1. Goldsmith J. Unsupervised learning of the morphology of a natural language / J. Goldsmith. // University of Chicago. – 1998. – №1. – с. 1-46.

2. Зализняк А. А. Грамматический словарь русского языка. Словоизменение / А. А. Зализняк. – М. : русский язык, 1977. – 880 с.

3. Аношкина Ж.Г. Морфологический процессор русского языка / Ж.Г. Аношкина. // Альманах «Говор». – 1995. – №6. – с. 17-23.

4. Гершензон Л.М. Синтаксический анализ в системе РМЛ / Л.М. Гершензон, И.М. Ножов, Д.В. Панкратов, А.В. Сокирко [Электронный ресурс]. – режим доступа : http://www.Aot.Ru/docs/synan.html. 2003.

5. Ножов И.М. Процессор автоматизированного морфологического анализа без словаря. Деревья и корреляция. / Ножов И.М. // Диалог’2000. Труды конференции. – Протвино, 2000. – т.2. – с. 284-290.

6. Сокирко А.В. Морфологические модули на сайте / А.В. Сокирко. // диалог-2004. – 2004. – т.1. – с. 3-18.

7. Rabiner L. R. A tutorial on Hidden Markov Models and selected applications in speech recognition / L. R. Rabiner. – proc. Of the ieee, 1989. – 340 с.

8. Ермаков А.Е. Выделение объектов в тексте на основе формальных описаний / А.Е. Ермаков, В.В. Плешко, В.А. Митюнин. // информационные технологии. – 2003. – №12. – с. 1-6.

9. Анисимов А.В. Компьютерная лингвистика для всех. Мифы. Алгоритмы. Язык / А.В.Анисимов. – Киев : Наукова думка, 1991. – 447с.

10. Baeza-yates R. Modern information retrieval / Baeza-Yates R., Ribeiro-Neto B. – ACM Press. 1999. – 580 с.

11. Андреев А.М., Березкин Д.В., Брик А.В. Лингвистический процессор для информационно-поисковой системы. [Электронный ресурс]. - режим доступа: http://www.inteltec.ru/publish/articles/textan/art_21br.shtml

Krut Andrey

Institute of Computer Science and Technology

Department of Artificial Intelligence and Systems Analysis

Speciality Software technologies for intelligent systems

Development of algorithms for constructing a morphological analyzer based on a dictionary

Scientific adviser: Assoc. Olga Kopytova