DonNTU
                                                      укр рус
Department CMS
www.ksm.donntu.ru
Master of DonNTU Christina Larionova

Christina Larionova


Autobiography
Library
Links list
Report of search
Individual task

Faculty: Computers and Information Science
Speciality: Computer Ecology-Economic Monitoring
Department: Monitoring Computer Systems
Theme of master's work:

Methods of encoding arbitrary information
on the base of linguistic resources in computer texts


Scientific supervisor: Ph.D Nataliya Y. Gubenko
      DonNTU site          Masters portal          CIS Faculty          Department CMS
Abstract
Master's Qualification Work


«Methods of encoding arbitrary information on the base of linguistic resources in computer texts»
Introduction

In our time, researchers have focused on stegosystems which hide a picture in picture, sound and video files through the invisible and capable of modifying the image to a relatively large amount of information. However, such systems have one drawback: the user has difficulty to invent a pretext by which they could regularly share their unique photographs made by themselves or other stegocontainers.

The hidden message can be embedded in the text itself. However, almost all known until the present moment stegosystems are kind of unstable. Primitive algorithms based on the selection of certain words from the text, placement meaningless gaps, etc. may be, if not opened, revealed by the statistical analysis, or lead to a noticeable distortion of the semantic and stylistic text [1].

Linguistic steganography is a science encoding arbitrary binary information in the texts based on linguistic resources. At the same time it is necessary to keep the external «harmless» and meaningful bearing text. Encoding occurs by substitution of one synonym for another within synonymous groups, which include the word source. If this is the absolute synonymous substitution is carried out regardless of context. In the case of relative synonyms all possible synonyms and omonimy text to replace the word tested for compatibility with the context. Compatibility is the possibility of entering the same phrase, and replace that word. Supporting linguistic resources are thus specially prepared synonymous dictionary and an extensive database of Russian phrases.


Topical issue

In the last couple of years, linguistic steganography has attracted much attention. Understanding the meaning of words, making the individual words into meaningful information, a response to humor, symbolism and uncertainty - all of these are still the privilege of the human mind and has no analogues in the world of computers.

They do not have the ability to learn languages, and speech recognition program only determine the difference between the frequencies of the audible signal and compares them with pre-defined speech sounds. How much effort we would have invested in learning the computer understanding the meaning of words, thus creating artificial intelligence (at least today) is far from human. Character - a popular method of security. For example, in systems webmail (Hotmail, Yahoo) the user asked to enter a few letters from the keyboard, which he sees on the screen. This method is "Protection against robots." Using the system is protected from computer programs that automatically registered email addresses for future spam. Such programs are not able to recognize letters in the images. Specialists dealing with the topic of artificial intelligence, aware of the many unsolved problems in this area, and all because there has not yet been invented a way to teach the computer to think intuitively.

At this stage linguistic steganography is use of dictionaries (quasi-) synonyms in the development .

The entire vocabulary of the language is divided into many groups of different volumes. Inside groups as grammatically similar words (they belong to one part of speech), and semantically, to the present synonymy. If the next word carries the text belongs to the group with m> 1 synonymous, it may have hidden information in approximately log2 (m) bits. For simplicity, consider that all the groups have volumes equal deuces, i.e. contain 2n members, where n = 1, 2, 3 ..., and the words within a group of pre-order numbers i = 0, ... , 2n-1. In steganography another syllable length n-bit encoded message is seen as an intra-number of i-synonym to replace the original word in the text. After that, the text is searched another word synonymous and its analogous to the replacement, and so until the exhaustion of the encoded message, or carrying the text [2].

At the receiving end uses the same dictionary quasisynonyms. Determine the size of n is synonymous with the next group of words along the text, and the intra-binary number of a word n-style allocated binary sequence.

Occasionally, when encoding the number of hidden information set coincides with the intra-numbered words in the original text, and then no replacement is not the case. But with large groups quasisynonyms happen rarely, and then nothing can replace intra-limited, not only alter the meaning of the text, but also to make it semantically incoherent, and so «suspicious», which is becoming noticeable.


Aims and objectives

Some developers of stegoalgorithms based on linguistic algorithms, could solve most problems by combining the idea of machine and manual method of steganographic insert, combining it with steady cryptography. They relied on a theory of linguistic steganography works by Bergmair and Katzenbeisser (2004) for the machine recognition stegotexts and use Huffman codes to counteract the static steganographic analisis.


Scientific novelty

The method proposed in this paper is based on the replacement of text words by synonyms, but testing produced by replacing the word according to the context. If the replacement context is possible, which is defined by the user, the corresponding synonym is left in the group of potential replacements, otherwise not included in the group.

Word phrases in this case are called syntactically connected and semantically compatible pair words. For example, to express, to have time to answer and others where grammatically related words are underlined. The phrase may contain a subsidiary part of speech (usually a preposition). Together with words, they form a chain of subordination, for example, to transfer -> over -> the radio.

Further, it is assumed that a huge (measured in hundreds of thousands) number of phrases that (in this case, Russian) language are collected in advance - without regard to their frequency and idiom - into a base of word phrases. It is in it that synonyms are searched as potential components of phrases before you use them in steganografic purposes.


The practical value of the work

This work will develop steganografic algorithm, which can be used for various text encryption: in virtual chat rooms, e-mails, forums, etc. Actually, this product can be used everywhere where you need to encoding some information into text.


A review of existing development and topic research

At this stage in the development of linguistic steganography we already know some ways of hiding information.


Semagrams

A semagram is a way to hide information by means of signs or symbols. For example, the symbol by hand, placing objects on the table in sequence, characteristic changes in the design of the website - all these are semagrams. Such signs are not very visible, they are quite usual in the modern world. Sometimes the use of visual semagrams is the only way of communication with friends and colleagues. You need only to agree in advance of such signs to resort to them in moments of danger.

Text semagrams are the messages hidden within the text. Uppercase letters, underscores, particular handwriting, spacing between letters and words - they all can be used for the transmission of any message. Associations also may be used for this purpose, if you need to transfer very small amount of information. Say, you have agreed in advance with your friends that you share an email with a harmless type of weather forecasts. The phrase "prolonged sky clouds" could mean the alarm and request for international assistance [3].


Open coding

In this case, the following is meant: post a message in the text of the letter in such a way that it does not catch the eye of an accidental reader. When comes to analysis, computers and people show different abilities and different recognize steganografic message. Therefore, the following examples may "fail" if they will be analyzed by a human. They use the linguistic features of text to deceive computer formulas to electronic filters and surveillance systems [4]. This is only a demonstration of imperfection "non-intellectual" computers. You should not use these examples for the transmission of sensitive data - they are suitable only for testing the effectiveness of filters.


Errata

Digital filters are tuned to specific words. But how many typos you can make in a word! It is not difficult to maintain the sense of the word, slightly changing its spelling. Look what you can do with the phrase "human rights".

Such variations are numerous. Of course, to type a text changing words in this way, is difficult. But for certain words - those which are caught by filters - this method can be used [5].


Phonetics

Usually, a national filter "catches" the words in the language, which is predominantly used by residents of the country. Sometimes also in the language, which is highly distributed or used on web sites (English, French). Of course, one can say with certainty as to the accuracy of programmed filter. But to come closer to understanding that can be used phonetically similar words [6]. This method is most suitable, if you use a script that differs from accepted in your country (e.g. Latin, instead of Russian).


Jargon

The use of jargon in the text can be put to a standstill an outside reader. Distorted sense of informal and slang can hide the contents of the letter. It is better to choose the words that keep the text-"carrier" in the clear, readable form, even if taken jargon "as it is" [7]. The list of options is limited by concepts that are familiar to all participants of the correspondence.


Figure 1 - Scheme of sections linguistic steganography

Figure 1 - Scheme of sections linguistic steganography

Hidden coding

This is a special case of hiding the text in an accessible message. Sometimes the technique is quite simple.

The advantage of this method is that the "message-container" preserves semantic wholeness as a letter to one person to another, does not cause suspicion, and not forced to look for any different meaning in it [8].

Another type of hidden code is to use a special formula for the allocation of hidden messages from the message-container [9].

In July 2007 there was ECIW 2007 - the sixth European conference on information warfare and information security. It was attended by speakers from Europe itself, and from various NATO countries, as well as Israel, Malaysia and China. In addition to discussing issues of global economic security, combating terrorism, psychological operations and propaganda of modern methods of protection and destruction of infrastructure in military conflicts, was considered by some interesting reports on cryptography and steganography.

Unfortunately though, and conference materials are available, but only for a fee in hard copy or CD-ROM and can not be posted on the Internet without the consent of the authors. But one of the interesting reports is in the public domain and deserves special attention. It is on Lexical Natural Language Steganography Systems with Human Interaction - "Steganografic systems based on natural languages lexically employed when interacting with people." Authors: K. Wouters, B. Wyseur and B. Preneel Electrical Engineering Department of the Belgian Catholic University - ESAT Katholieke Universiteit Leuven.

As an enemy the authors of this paper suggested not only a program-detector, but also a trained person (eg, linguists), who tries to catch all suspicious and unnatural dialogues the interlocutors, which would indicate the presence of stego channel.

As an environment for testing protocol was selected IRC-chat: a large amount of people can communicate simultaneously, but conditional users Alice and Bob can not send messages directly to each other, and apply only to other users. This did not establish the existence of direct contact between them all at once, moreover, they can be more anonymous online chats using tor.

It is assumed that Alice and Bob know nicknames and public keys of each other, as well as a chat channel for communications. When they enter the chat, Alice sends to the local queue secret message M, announcing its readiness to transmit it to Bob. Bob confirms the readiness to adopt it. Using Diffie-Hellmana, they agree on a secret key K, which will be used to generate substitution tables from word-synonyms in the pre-selected dictionary. Also, the private key K will be used to obtain a session key S_k, used for communication encryption M. Encryption is done using a stream cipher (RC4) so that Bob may be coming decipher the hidden message immediately, byte by byte.

Every time Alice will print text, it will shoot out a window with a table of synonyms, so that she can give a natural and grammatically correct form. So in a simple way the problem and protection the engine from the OCR and the human observer is solved. To send a hidden text is only one bit per word. As a dictionary the researchers used a dictionary of the English Dictionary OpenOffice. From a session key, which was received after agreeing on the Diffie-Hellmann, produced gamma S, the bits which are interpreted in pairs: a pair of 0 ('00 'in S), 1 ('11' in S) and NULL ('01 'or'10 'in S). NULL indicates that this word does not pass a single bit and allows the user to replace an arbitrary, even more difficult to analyze. Following the appointment of bits of words used deterministic algorithm that makes the distribution of bits is better in terms of choice of synonyms, and resilience analysis. This algorithm is not yet fully developed by the authors and they hope to replace the primitive model by a more adaptable.

Despite a good distribution table, the authors had to cope with difficulties, for example, when too often forcing the user to choose the same synonym. The problem was solved by the attribution of bits not to one word, but a set of words and the use of codes Hafmana, but that led to even greater reduction in bandwidth of a stego channel.

Limitations of the system is connected not only with low bandwidth, but also to the fact that statistical analysis is able to detect the use of synonyms, uncharacteristic for the speech of the person (if his personality is installed). In addition, the easy selection of several synonyms can lead to grammatical errors and require additional care of a user to correct the resulting text.

Nevertheless, this system is well suited for covert transmission of short messages in chat rooms. As a test platform has been selected the program X-chat, was written a plugin for steganografic functions for cryptographic calculations used library OpenSSL. In the future the authors plan to strengthen the resilience of its system by using adjusted by the individual style of speech, tables, codes, error correction against an active attacker, and to include the possibility of typographical errors and the IRC-slang. This work was partly funded by the Institute for promoting innovations in science and technology and interdisciplinary research Institute of Broadcast Technologies, founded the Flemish Government in 2004 [10].

LS is young, but already well represented at international conferences on the concealment of information (Information Hiding) and security (Information Security). But, unfortunately, the results of these studies are not yet known in Ukraine and in particular DonNTU.


Approbation

The results of the work were reported at the V International Conference of Students, graduate students and young scientists «Computer monitoring and information technology» (CMIT 2009) and published in the book.


Description and future results

The proposed algorithm has two entrances steganografic.

- Binary-coded information for hidden encryption.

- Original text, bearing in Russian with a minimum volume of approximately 200 times greater amount of hidden information.

The algorithm includes the following steps.

1. Search for similar words.

The text is scanned, and it identifies those words and merged multiword expressions, which constitute the system dictionary and synonyms.

2. Forming joint synonymous groups.

Synonymous words are considered one after the other. If the synonymous group includes only absolute synonyms, the group adopted a whole and without any additional checks. If the group is at least one relative synonymous with all of these synonyms are the source for the operation of so-called transitive closure.

3. Checking the combined groups for phrases.

If the group contains only absolute synonyms, it is not checked for context, and to check for compatibility with the other words can be taken any of the synonyms. If the group contains relative synonyms, they must be checked for compatibility with external full words on the left and right of the test.

4. Encoding.

The sequence of strained synonymous groups is scanned from left to right. Let the sizes of all groups of multiples of the degree of deuces or reduced to the nearest degree of deuces. Then for the next group of length 2n of the encoded message is allocated syllable length n and binary content, taken as a synonym for intra-room, used in lieu of the original text. If there are groups, the length is not equally deuces, all length groups to multiply and take the degree 2N, closest to the received product down. Then, from the encoded information shall be shut off syllable length N and from the successive divisions, and by finding the residues are to substitute numbers homonyms synonymous words in each sentence separately.

5. Re-agree words in the context

If the text synonym for encoding has been replaced, in the general case, you need to substitute reconsider morpho-syntactic characteristics and context.



Figure 2 - Illustration of the algorithm scheme, based on the linguistic resources
(figure is animated; loops = 7; size = 10Kb)

Since the original fragment is 206 bytes, and the hidden information - 1,5 byte, the latter is 0.73% of the carrier text. This value is called steganografic density. It is small, but quite a long text can be hidden quite meaningful settlement, they should only be 200 times shorter than the carrier text.


Conclusions

At this stage of work there has been developed the algorithm of text encryption, selected the synonymous dictionary and the work to improve this algorithm is carried out. It has been designed for the Russian language, but in future it is planned to expand to two languages and to add English. This would allow for some analysis and compare which of the languages is more vulnerable to distorted by these algorithms with the substitution of synonyms.

But such applications have attracted the attention of linguistics community, far removed from science: distributors of software (they have to hide in the transmitted buyer product sales of a unique number), brokers (they need to communicate secretly to change any rate or rating), diplomats (they need to identify the source diversion of public key information) officials.

In the future, people will create a program capable of producing readable texts, which will be hidden by others, using dictionaries, uncertainties replacement. However, experts can not yet say with certainty whether the computers have the ability to create readable texts from scratch and hide in the message, using the language semantics.


Literature

1. Большаков И.А. Статья Использование синонимов, ограниченных контекстными словосочетаниями, для целей лингвистической стеганографии / И.А. Большаков , 2004. – 7 с.


2. Brecht Wyseur Article Lexical Natural Language Steganography Systems with Human Interaction / Brecht Wyseur, Karel Wouters, Bart Preneel , 2007. 8 c.


3. Digital Security and Privacy for Human Rights Defenders [Электронный ресурс] : http://equalit.ie/secbox/esecman/russian/chapter2_8.html


4. OpenPGP в России [Электронный ресурс] :http://www.pgpru.com/novosti/2007/


5.Большаков И.А. Статья Два метода синонимического перефразирования в лингвистической стеганографии [Электронный ресурс] : http://www.dialog-21.ru/Archive/2004/Bolshakov.htm


6. Журнал Информационный анализ Выпуск 8, 2004 : http://www.viniti.ru/cgi-bin/nti/nti.pl?action=show&year=2_2004&issue=8&page=23


7. Электронная библиотека ВИНИТИ [Электронный ресурс] : http://www.viniti.ru/cgi-bin/nti/nti.pl?action=search&query=%F8


8. Электронная статья Разработана эффективная система стеганографии через чат [Электронный ресурс] : http://www.itsec.ru/newstext.php?news_id=38490


9. Лукашевич Н.В. Статья Automated analysis of multiword expression for computational dictionaries / Н.В. Лукашевич, Б.В. Добров , Д.С. Чуйко


10. Большаков И.А. Статья Электронные словари: для людей и компьютеров / И.А. Большаков , А.Ф. Гельбух, С.Н. Галисия-Аро

Remark of material significance

The abstract of the Master's work is not complete yet . The final completion in 1st December 2009. Full text of the work and materials on the topic can be obtained from the author or his head after that date.

    Autobiography     Library     Links list     Report of search     Individual task
      DonNTU site          Masters portal          CIS Faculty          Department CMS

© Ch. Larionova 2009