Andrew Leonov

Faculty of Computer Science and Technology

Department of Artificial Intelligence

Speciality «Artificial Intelligence Systems»

Methods for automated correction of specialized natural language texts

Scientific adviser: Ph.D., Professor Roman Babakov
Abstract
Content
Introduction

Adjustment of textual information is one of the most important components of projects aimed at automating workflow. Nowadays, thanks to the use of computer technologies have been developed many methods for correcting the text information. So it became possible to create such systems, which would satisfy the basic requirements of workflow systems. However, before the application continues to place the problem of increasing the speed and quality adjustment, minimizing consumed memory, which requires additional research in this area.

Modern text editors when checking text information is not correct the error, and offer options to correct them. It requires no user intervention, which is not always convenient. Automatic correction of spelling errors can be a more effective means of minimizing typos and fixes for creation of electronic text files. This topic has become the purpose of research.

1. Relevance of the topic

Penetration of electronic information across all layers of human activity makes the library institutions to actively develop modern information technologies, to implement automated systems to create electronic libraries, to develop Internet services. Library civilized world perceives these changes as one of the characteristic trends of future society as a system that facilitates access to information resources.

With the increasing number of electronic scientific publications is constantly increasing number of publishers, publishing department of universities and research institutions. Either individual authors using electronic devices for writing various articles, dissertations, etc. The level of user training in computer science, knowledge of typographic rules and traditions remains low. These rules shall apply clearance headings, lists, tables, bibliography, formulas, numbers, and more. Errors associated with non-compliance with these rules, called typography. At the current level of technology fix such errors produced correctors manually, which is time-consuming. Most errors are typical, which creates prerequisites for process automation proofreading.

Automation proofreading stage in the preparation of scientific publications would significantly reduce the cost and time and improve the quality of e-text information. In this paper, this problem is posed as a problem in the automatic processing of specialized texts.

At the moment, there are qualitative tools to automatically find and fix spelling errors, using dictionaries and morphological analysis of word-forms of the text, but most of them are commercial.

Thus, there is need for a new study focusing directly on the automation of correcting spelling errors.

2. Research goals and objectives

Objects of study are structured text documents that can be described using a “syntax tree”. The subject of this study is structured algorithms for automatic correction of specialized test documents.

The purpose of research is to develop methods, algorithms and technology to create an automated system that allows multiple offsets to increase work efficiency when working with text documents.

To achieve this goal, the work, the problem of formalization of rules describing the correction of typographical and spelling errors, as well as the development of efficient algorithms for searching places errors in the documents and synthesis rules to correct them.

Many problems are caused by the fact that the manual processing of documents proofreaders hold enough formalized recommendations. And manually compiling a fairly complete description of a set of rules for automatic use of hard to implement. Some of the recommendations used are quite complex and highly context-dependent, which requires complex models to describe the adjustment rules.

The task of automatic correction of the text information is to construct a set of rules that can be used in search algorithms and error correction.

3. Error detection methods

It knows at least three methods of automated detection of spelling errors in the texts. There is statistical, polygraph and vocabulary [1]. In the statistical method of text one after another distinguished its constituent word forms, and a list of them in the course of inspection is ordered according to the frequency of occurrence. Upon completion of the text view ordered list is presented for the control of a person, for example, through the display screen. Spelling mistakes human beings in any competent text unsystematic and rare, so distorted the words they are somewhere at the end of the list. Seeing them here, the controlling person can find them in automated text and correct.

When all polygraph method [1] in the text two or three letter combinations (bigrams and threegrams) are checked against the table of their admissibility in the natural language. If the word form does not contain invalid polygraph, it is considered correct, and otherwise - of questionable, and then presented to the person for inspection and, if necessary, correction.

All the vocabulary included in the text word form in this method. After or without ordering, in its original form or after the text of the morphological analysis are compared with the contents of a pre-existing engine vocabulary. If a dictionary word form admits she believed to be correct, otherwise the controller is presented. He may leave the word as it is, leave it and pasted into the dictionary, so that later in the session like word will be recognized system without comment; replace (fix) the word in this place; require such further substitutions throughout the text; edit tin together with its surroundings. Operations over questionable portion of the text indicated or possible can be combined on the basis of the designer’s auto-leveling.

Results of numerous studies [1] have shown that only the vocabulary method and save human labor and leads to a minimum of erroneous actions of both genders. There are skip text errors, one, hand, and the inclusion of the right words to doubtful, on the other. Therefore, the vocabulary has become the dominant method, although polygraph method sometimes used as an auxiliary.

4. Analysis algorithms of text information
4.1. Algorithms of morphological analysis

Using morphological analysis algorithms are recognized elements of the morphological structure of the word - roots base affixes closure. By the algorithms commonly used on the morphological level are Stemming and lemmatization. Stemming goal is to identify semantically similar basis of word forms necessary for the adequate weighting terms in the process of information retrieval. At the entrance stemmer - text output - a list of key words in the input text. Stemmer developed since the late 50s. of XX century. Classified algorithmic and vocabulary. Stemmer algorithmic function on the basis of data files containing lists of suffixes and inflections. In the process of morphological analysis program performs the mapping suffixes and endings of words in the input text and in the appropriate list, the analysis begins with the last character of the word. Stemmer dictionary function on the basis of dictionaries Words bases. During the morphological analysis of a stemmer performs the mapping basics words in the input text and in the appropriate dictionary and analysis begins with the first character of the word.

Glossary's stemmers provides greater search accuracy, while algorithmic - most complete, allowing more errors that appear in understemming or overstemming. Overstemming takes place if one basis with the identified words of different semantics; with understemming on the basis of one word is not identified with the same semantics, for example, bet as the basis of better, a childr as the basis for children. In the first case there is an overstemming as the basis for bet adjective better bet is identified with the verb and its derivatives (bets, betting), the importance of which has nothing to do with the value of the adjective. In the second case there is understemming as for childr cannot be identified based on the plural (children) and the singular (child) one token.

4.2. Parsing algorithms

One of the fundamental algorithms used on the syntactic level, is a syntactic splitting. At the entrance of the splitter is text, at the output is list of sentences text. Syntactic decomposition algorithms are developed from the 1960s. and provide recognition sentences on the basis of the text formatting characters: spaces, punctuation, carriage return characters. Breakdown text sentences compounded by the lack of standard text formatting; points, exclamations, question marks, which are commonly used as a separator, can be used not only at the end, but in mid-sentence. Sentences are the basic unit of analysis in many systems, and in automatic summarization and text output consists of sentences. Mistakes in identifying sentences significantly reduce the effectiveness of such systems in general.

Deduction-inversion architecture decomposition of the text, according to which the first text is divided into paragraphs, and then - to the words, then words are generated sentences. Thus, the decomposition begins with the larger unit (paragraph), then going to a smaller unit (word), and then - again to a larger (sentence). Deduction-inversion architecture decomposition allows you to ignore such components of the text as headings, subheadings, tables of contents, as they are not included in the paragraphs.

Syntactic decomposition is the basis to perform a variety of pattern recognition algorithms phrase structure of sentence. Widespread allocation algorithms n-gram are phrases consisting of two (bigramm), three (threegram) and more (tetragram, pentagram, hexagram) tokens [2]. Breakdown phrase in this case is carried out taking into account the position o f the token in the sentence. For example, the sentence John has a dog includes 4 unigrams, 3 digrams (John has, has a, a dog), 2 threegrams (John has a, has a dog), 1 tetragram - all the sentence. Count of bigram for each sentence (ng(s)) will be n-1, threegram - n-2, where n-number of tokens in a sentence i.e. ng(s) = Wi-(n-1), Wi-(n-2), ... Wi-(n-n), where in wi - ordinal level n-gram, from a bigram. Recognition of n-gram is based on the relevant rules.

Analysis of the distribution of n-gram reveals statistically significant phrases, and is often used in stochastic algorithms annotation tags parts of speech. The beginning of the end of the sentence and designated some conditional tags (false tags), that can be considered as a threegram even offering consisting of one token and set the probability parameters needed to select a particular tag.

Distribution of n-gram used for the purpose of automatic classification and categorization, as act as an important parameter, which allows to determine the identity of the text to a specific category, type, group, genre. When analyzing at the syntactic level as the basic units are bigrams and digrams as recurrence phrases with lots of tokens unlikely. N-gram analysis of higher order is used in automatic spelling correction, as well as in automatic OCR (Optical Character Recognition), which are the basic unit of the characters in the tokens.

For the analysis of morphologically significant phrases used chunkers, which give the output lists of phrases of a certain type (nominal, verbal, adjectival, adverbial). The most common noun phrase chunkers to recognizing phrases with governing noun. It is this type of phrases denoted objects described in the text and their ranking on weighting coefficients to obtain a list of keywords that reflect the main content of the text. Referencing text based dictionary nouns allows you to get almost the same results as referencing conducted, taking into account the words belonging to other parts of speech [3]. Recognition phrases of this type is performed on the basis of preliminary annotation tags parts and combining separate parts of speech in the sentence on the basis of the rules of grammar.

Phrase structure rules have been developed for the English within the concept of generative grammar proposed by Chomsky. Grammar rules are written as NP → NN; NP → DetNN; NP → DetANN, which specifies the composition of the phrase, in this case, the name (noun phrase - NP), and word order [4]. In the first case it is shown that the noun phrase may consist of only one noun (NN); in which case it consists of a determinant (Det) and a noun, the determinant takes the position before the noun, and the reverse order of the words wrong; in the third case is the phrase of the determinant, adjective (A), a noun, while other options word order wrong.

Hierarchical syntactic structures used in machine translation systems for the determination of equivalence of syntactic structures in the two languages. At the syntactic level decomposition can be carried out not only in word combinations and sentences, but also on the clause - elementary predicative structure expressing judgment. The notion of a clause to a certain extent corresponds to the notion of a proposition in linguistics, but the clause stand on formal grounds, which may include, for example, the presence of noun phrase and the accompanying verbal group. Breakdown clauses used in systems for mining more adequate transmission of text content.

Conclusions

So, this paper describes the basic algorithms for automatic correction of texts, and designed software package that allows for automatic level allocate errors or omissions in documents with structured text information. A distinctive feature of the complex is its focus on documents with specialized text.

Promising tasks include the following development of effective methods of learning classifiers detects errors and improving the algorithms used.

In writing this essay master's work is not yet complete. Final completion: winter in 2014-2015. Full text of the work and materials on the topic can be obtained from the author or his manager after that date.
Список использованных источников

  1. Peterson J.L., Computer programs for detection and correction spelling: errors. Commun. ACM, 1980, 23, № 12, 676 – 687.
  2. Bickel, S. Predicting Sentences using N-Gram Language Models / S. Bickel, P. Haider, T. Scheffer. – 2005. [Электронный ресурс]. – Режим доступа: http://delivery.acm.org/10.1145/1230000/1220600/p193-bickel.pdf
  3. Яцко В.А. Симметричное реферирование: теоретические основы и методика / В.А. Яцко // Научно-техническая информация. Сер.2. – 2002. – № 5. – С. 18-28.
  4. Brinton, L.J. The structure of modem English / L.J. Brinton. – Amsterdam; Philadelphia: John Benjamins, 2000. – 335 p.
  5. Яцко В.А. Алгоритмы предварительной обработки текста: декомпозиция, аннотирование, морфологический анализ / В.А. Яцко, М.С. Стариков, Е.В. Ларченко // Научно-техническая информация. Сер.2. – 2009.
  6. Штурман Я.П. Анализ систем автоматизированного обнаружения орфографических ошибок. НТИ, 1985.
  7. Бабко-Малая О.Б. Методы и системы автоматизированного обнаружения и коррекции текстовых ошибок / О.Б. Бабко-Малая, В. А. Шемраков // Препринт № 5. Л.: БАН СССР, 1987, 46 с.

Design by elemis.