Українська   Русский
DonNTU   Masters' portal

Abstract

Content

Introduction

The special role of linguistics in solving practical problems and needs of society is determined by the essence of natural human language, which is a unique means of storing and transmitting data. The formal structures of natural language (NL) identifying, language formalization as a whole, constructive theory and computer model of language building are the priority directions of the science in recent decades.

The task of natural language text intelligent processing first appeared at the turn of the 60s and 70s of the twentieth century. The computers emergence, development of Chomsky theory and generative language models has led to close cooperation between linguistics and computer science, to the emergence of computer linguistics. Its task is to develop computational algorithms and programs based on formal linguistic models created within mathematical linguistics.

The greatest opportunities and high quality texts analysis can be obtained by the full linguistic analysis. Linguistic processor (LP) of the system that supports a complete analysis of NL-text contains three main components corresponding to levels of language: morphological, syntactic and semantic. Input of the one analysis component is an output for another. Morphological component builds morphological interpretation of input text words; syntactic – syntactic sentence structure; semantic – semantic text graph.

Words connected by a sense allocation is an essential step of extracting knowledge from NL‐texts. Without quality parsing solution of this problem is impossible, because the grammatical expression of structural and semantic relations is syntactic relation. Syntax describes how word forms compounding into the word combinations and sentences, types of syntactic relationships of words and sentences, that is, the language mechanisms that contribute to the speech formation. During parsing input text is converted into a data structure, usually into a tree, that reflects the syntactic structure of the input word forms sequence and is well suited for further processing at the semantic level..

1. Theme urgency

Information retrieval systems, interactive systems, tools for machine translation and automatic annotating, rubricists and spellchecker modules make NL‐texts analyze anyway. Thus, the scope of the automatic texts processing is quite diverse and NL‐texts analysis is a very topical issue through high growth of textual information volume and its complex structuring.

Today making full‐fledged language processor (LP) is one of the most urgent tasks in computer linguistics. Solution of this problem would achieve a high level of language structures formality in a variety of application purposes. Building reliable syntactic text structures for all sentences is a very important and necessary step in automatic text recognition. Entity description of the input text, the definition of their properties and relations between them are made at the syntactic pattern level, because they do not depend on the meaning of utterances, so morphological and syntactic features and structures are involved as local context parsing rules. Thus, parsing determines the quality of the language processor and so the creation of effective syntactic component is an urgent task.

2. Goal and tasks of the research

The goal of this work is development of methodology for detecting of syntactic groups in English sentences.

The main objectives of the study are:

  1. To make an analytical review of automatic parsing methods.
  2. To study the types of syntactic relation of word forms in English sentences.
  3. To develop formal rules for constructing simple syntax groups within the sentence.
  4. To explore the minimal structural scheme (MSS) of simple English sentence and to develop MSS templates for automatic selection of the sentence predicate core.
  5. To develop formal rules based algorithms for syntax groups allocation and to implement them in the relevant software.

Object of research: semantic analysis of sentences.

Subject of research: detection of syntax groups.

Methods of research: Methods of NL‐texts sentences automatic parsing.

3. A review of research and development

The main task of parsing is to build the syntactic structure of the input sentence by using the morphological information about word forms.

The most common forms of sentences syntactic structure are the dependency graphs and immediate constituents (IC) graphs, they are used in pure form or mixed forms that combine the properties of both graphs [1,2].

Description of structures in the form of classical dependency graph is based on the concept of binary phrases in a sentence with dedicated principal and subordinate elements. The items are displayed by graph nodes, the subordination of one node to another – by directed arcs, so dependency graph is a directed graph. Typically, one node in the graph, which at most models corresponds to the predicate, do not have a node that he subordinate, and is called the vertex. Sometimes the subject and predicate are designated by two vertices.

Subordination relation defines a partial order on the set of nodes. If multiple nodes subordinate to one node, the order between them is not defined: the dependency graph does not convey information on the dependent word relative proximity to the main word. Usually subordination relation is divided into several types, and arcs of the graph are marked by indices of syntactic relations.

The immediate constituents tree model is based on the idea of sentence constructing as a sequence of pairwise syntagmatic components cohesion from the minimum individual words to the maximum – sentence, which component is subject and predicate group in the case of complete personal sentence.

Representation of syntactic structure as a IC-tree is well agreed with the traditional sentence analysis in which subject, predicate and their elements are described by categorical characteristics – the names of speech parts or groups.

It should be emphasized that the IC-trees and dependency trees describe the syntactic structure of the sentence in different aspects. Word combination is described explicitly by the firsts, but relations orientation is ignored; the seconds provide an opportunity to consider directed relations, but only between the individual words.

Existing methods of syntactic structures representing have certain disadvantages: dependency trees do not consider relationships between phrases and syntactically cohesive groups of words, the IC systems ignore the directed relations and do not allow to describe the discontinuous phrase. Moreover, in these representations the sentence members are determined on the basis of formal characteristics, but not in relation to their semantic content. Therefore, none of the models give a complete picture of the sentence syntactic structure.

Formal grammatical and probabilistic statistical approaches are differed from the perspective of the formal theories for natural language descriptions. Formal grammatical approach is aimed at the creation of complex systems of rules which would allow to decide in favor of one or other structure in each case, and statistical is aimed at the gathering occurrence statistics for the different structures in a similar context, on the basis of which a decision of structure alternative is made.

Chomsky proposed classification of formal languages and grammars which laid a basis for formal grammatical approaches. For computational linguistics the most important among them are the finite state machine grammar, context‐free (CFG) and context‐sensitive grammar.

Finite state machines are a declarative presentation method and are very effective in terms of speed, but are limited in ability to describe many of the natural language structures, such as embedded clauses.

Higher level is represented by a CFG which are described as products that are mapped the non‐terminal symbols of the left side to the set of terminal and non‐terminal symbols on the right side. CFG syntax is simple, but simple apparatus CFG is not enough to describe some phenomena of natural language. In particular, the context‐free rules are inconvenient to describe coordination (in person and number between subject and predicate, for example), to show discontinued dependencies caused by the words movement in the phrase. In addition, a rule that expresses the relationship between the components does not reflect the important feature of natural languages – the absorption of one category by other, so that the new component appears as substitute for main category.

The so‐called PCFG‐grammars (probabilistic context‐free grammars), in which each rule completed by probability estimation, are the basis of most probabilistic statistical analysis methods.

Conclusion

This work is aimed at improving of English sentences automatic parsing.

A review of methods of automatic parsing and models of sentences syntactic structure representation showed that the model in the predicate structure form is the most promising because it allows not only to describe the argument structure and the number of predicate actants, but also take into account their semantic content by using semantic classification of predicates.

During the research the next items will be developed:

Development of English texts parser based on identifying syntactic groups algorithms, which will receive the sentences syntactic structure in the form of predicate structure and improve the quality of further semantic analysis, will be the development of the final work topic. Predicate model is a path to the text understanding, which is closely related to the predicate structures identification that characterize the meaning of the sentence, as well as these predicate structures chains identification that mediate the meaning of the text [18].

References

  1. Гладкий А.В. Синтаксические структуры естественного языка в автоматизированных системах общения. – М.: Наука, 1985. – 144 с.
  2. Ножов И.М. Морфологическая и синтаксическая обработка текста (модели программы). – М.: Наука, 2003 – 140 с.
  3. Автоматическая обработка текстов на естественном языке и компьютерная лингвистика: учеб. пособие / Большакова Е.И. и др. – М.: МИЭМ, 2011.]
  4. Автоматическая Обработка Текста [Электронный ресурс]. – Режим доступа: http://www.aot.ru/technology....
  5. Taylor A., Marcus M., Santorini B. The Penn Treebank: The Overview // ARPA Human Language Technology Workshop, 1998. – P. 3–22.
  6. Толдова С.Ю., Соколова Е.Г., Астафьева И., Гарейшина А., Королева А., Привознов Д., Сидорова Е., Тупикина Л., Ляшевская О.Н. Оценка методов автоматического анализа текста 2011–2012: синтаксические парсеры русского языка // Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции Диалог (Бекасово, 30 мая – 3 июня 2012 г.). Вып. 11 (18): В 2 т. Т. 2: Доклады специальных секций – М.: Изд-во РГГУ, 2012. – С. 77–90.
  7. Anisimovich K.V., Druzhkin K.Ju., Minlos F.R., Petrova M.A., Selegey V.P., Zuev K.A. Syntactic and semantic parser based on ABBYY Compreno linguistic technologies // Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции Диалог (Бекасово, 30 мая–3 июня 2012 г.). Вып. 11 (18): В 2 т. Т. 2: Доклады специальных секций – М.: Изд-во РГГУ, 2012. – С. 91–103.
  8. Iomdin L., Petrochenkov V., Sizov V., Tsinman L. ETAP parser: state of the art // Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции Диалог (Бекасово, 30 мая–3 июня 2012 г.). Вып. 11 (18): В 2 т. Т. 2: Доклады специальных секций – М.: Изд-во РГГУ, 2012. – С. 119–131.
  9. Antonova A.A., Misyurev A.V. Russian dependency parser SyntAutom at the DIALOGUE – 2012 parser evaluation task // Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции Диалог (Бекасово, 30 мая–3 июня 2012 г.). Вып. 11 (18): В 2 т. Т. 2: Доклады специальных секций – М.: Изд-во РГГУ, 2012. – С. 104–118.
  10. Каневский Е.А., Боярский К.К. Семантико-синтаксический анализатор SemSin [Электронный ресурс]. – Режим доступа: http://www.dialog-21.ru/digests/dialog2012/materials/pdf/Kanevsky....
  11. Загнітко А.П. Теоретична граматика української мови: Синтаксис: Монографія. Донецьк: ДонНУ, 2001. – 662 с.
  12. Вихованець І.Р. Частини мови в семантико-граматичному аспекті / І.Р. Вихованець. – К.: Наук. думка, 1988. – 256 с.
  13. Ермоленко Т.В. Синтаксическая модель предложения русского языка на основе предикатных структур // Искусственный интеллект. – 2012. – № 3. – С. 126–136.
  14. Харламов А.А., Ермоленко Т.В. Разработка компонента синтаксического анализа предложений русского языка для интеллектуальной системы обработки естественно-языкового текста // Программная инженерия № 7, 2013. С. 37–47.
  15. Бондаренко Е.А. Принципы автоматической обработки естественно-языковых текстов: валентностный подход / Е.А. Бондаренко, О.А. Каплина // Искусственный интеллект. – 2013. – N 1. – С. 80–90.
  16. Харламов А.А. Метод выделения главных членов предложения в виде предикативных структур, использующих минимальные структурные схемы / А.А Харламов, Т.В. Ермоленко, Г.В. Дорохина, Д.С. Гнитько // Речевые технологии. – 2012. – № 2. – С. 75–85.
  17. Дорохина Г.В. Автоматическое выделение синтаксически связанных слов простого распространенного неосложненного предложения / Г.В. Дорохина, Д.С. Гнитько // Сучасна інформаційна Україна: інформатика, економіка, філософія: матеріали доповідей конференції, 12 – 13 травня 2011 року, Донецьк, 2011. Т. 1. – С. 34–38.
  18. Alexander A. Kharlamov, Tatyana V. Yermolenko, Andrey A. Zhonin Text Understanding as Interpretation of Predicative Structure Strings of Main Text’s Sentences as Result of Pragmatic Analysis (Combination of Linguistic and Statistic Approaches) // Speech and Computer 15th International Conference, SPECOM 2013, Pilsen, Czech Republic, Septenber 2013. Proceedings. – P. 333–339.

Important note

The master's work is not completed yet. Final completion is on December 2014. Full work text and subject materials can be obtained from the author or her adviser after this date.