Techniques used in Natural Language Processing

Source: http://www.doc.ic.ac.uk/~nd/surprise_97/journal/vol2/ctt
Introduction

In the first article, we have briefly gone through an overview of some simple techniques used in Natural Language Processing. However, these simple techniques are both ineffective and too simplistic and are thus seldom used in actual systems. In real life, different applications require different depth of NLP and thus many more efficient and specific techiques have been developed to cater to these different needs. In this article, we will take a look at some of the more practical methods being used by some real-life applications.

Syntactically Driven Parsing

Syntactically driven parsing, as the name implies, deals with the syntax of the input utterances. It is done by first constructing a complete syntactic analysis of the input utterance and then create an internal representation, which is easily understood by the computer, from it. As being brought out in the first article, a simple form of syntactic analysis is to create a parse tree using context-free grammar. However, as been pointed out, this form of analysis faces some serious problems, eg. pluralness and passive sentences. The following techniques have thus been developed as an answer to some of this problems.

Transformational Grammar

Transformational grammar is basically an extension to the context-free grammar. A parse tree is first generated using the CFG and then a set of transformation rules are used to map this syntax tree into a related parse tree. Thus for example, a basic parse tree can be created using the context-free grammar with tags for pluralness or passiveness of the sentence. These tags would then activate the relevent transformation rule which rearranges the parse tree in such a way that the required agreement is transmitted to all parts of the tree. This, however, is not such a practical method because the process is simply too time consuming and computationally costly.

Augmented Transition Network (ATN)

A transition network is a network consisting of nodes and labelled arcs. The nodes represent different states of a process and the arcs represent the transitions from state to state, with labels refering to word categories in NLP. A Recursive Transition Network (RTN) is a transition network which allows arc labels to refer to both word categories as well as other networks. It has the same descriptive power as context-free grammars. In order to extend the power of RTN and Transformational Grammar, Augmented Transition Network was developed as a method of expressing syntactic grammar that was computationally tractable and could capture linguistic generalizations in a concise way. Basically, ATN consists of an RTN augmented by a set of tests to be satisfied before an arc was traversed and a set of registers that could be used to save intermediate results or global states. To demonstrate the power of ATN, let us consider the same example from article 1, but in passive form in which the CFG is unable to resolve: The rice was eaten by the cat. The figure and table below shows an example ATN and a table of the tests and actions for traversing the arc respectively.

"The rice" In order to parse the above sentence, we start with the left most node (node A) in the ATN. From here, there's two paths available to be traversed, arc 1 and arc 2. However, since the first part of the sentence, "The rice", is a noun phase (NP), arc 2 is chosen. But before arc 2 is traversed, Table 1 is checked for the additional tests and actions needed to be performed. In this case, a T in the Test entry means that there's no additional tests. The actions performed are to set the subject (SUBJ) register to the thing being parsed, ie rice, and the TYPE register to DECLARATIVE.

"was" The next word "was" is treated as a verb in this case and is parsed through arc 3 to node D. The additional test here is to check that the verb agrees with the subject, rice in this case. This is how agreements are enforced in ATN. On passing the test, the verb (V) register is set. Note that node D has a line through it, indicating that this could be the end of the parse if there is no input left. Thus sentences like "The cat eats." can be accepted. Since there is more inputs available, the parse continues,

"eaten" From node D, there is again two paths to choose from. Arc 4 is taken because "eaten" is a verb. Additional tests on arc 5 requires that the input is a past particle, in which eaten satisfies. The second test requires that the content of the register is to be, in which was satisfies as it is a form of to be.Four sets of actions are taken here. Firstly, the content of the object (OBJ) register is substituted by the contents of the subject (SUBJ) register. Next the content of the verb register is overwritten with the past participle form, was eaten. A flag, which indicates that the sentence is passive, is then set to TRUE and a placeholder SOMEONE is put in the SUBJ register.

"by" From node E, arc 7 is chosen. The passive flag is checked and set to false before traversing the arc to node F.

"the cat" The last part of the sentence is simply another noun phrase the cat and thus arc 8 is traversed, putting the cat as the subject. Since there is no more input left, the process finally terminates at node E. This simple example illustrates ATN's power and ability in capturing the regularities of language in a concise and elegant way. An example of a system based on ATN grammar is the RUS system developed by Bobrow amd Webber in 1980.