Íàçàä â áèáëèîòåêó

EFFICIENT TEXT COMPRESSION USING SPECIAL CHARACTER REPLACEMENT AND SPACE REMOVAL

Authors: Chakraborty D., Ghosh S., Mukherjee J.
Source: International Journal of Computer Engineeringand Technology (IJCET), ISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online) Volume 1Number 2, Sept - Oct (2010), pp. 38-46

Abstract

In this paper, we have proposed a new concept of textcompression/decompression algorithm using special character replacement technique.Moreover after the initial compression after replacement of special characters, we removethe spaces between the words in the intermediary compressed file in specific situations toget the final compressed text file. Experimental results show that the proposed algorithmis very simple in implementation, fast in encoding time and high in compression ratio andeven gives better compression than existing algorithms like LZW, WINZIP 10.0 andWINRAR 3.93.

Keywords: Lossless compresssion; Lossy compression; Non-printable ASCII value;Special character, Index, Symbols.

INTRODUCTION

As evident from the name itself data compression is concerned with thecompression of a given set of data [5,6,8]. The primary reason behind doing so is toreduce the storage space required to save the data, or the bandwidth required to transmitit. Although storage technology has developed significantly over the past decade, thesame cannot be said for transmission capacity. As a result the concept of compressingdata becomes very important. Data compression or source coding is the process of encoding information using fewer bits (or other information bearing units) than anunencoded representation would use through specific use of encoding schemes. It followsthat the receiver must be aware of the encoding scheme in order to decode the data to itsoriginal form. The compression schemes that are designed are basically trade- offs amongthe degree of data compression, the amount of distortion introduced and the resources(software and hardware) required to compress and decompress data [5,9]. Datacompression schemes may broadly be classified into – 1.Lossless compression and2.Lossy compression. Lossless compression algorithms usually exploit statisticalredundancy in such a way as to represent the sender’s data more concisely without error.Lossless compression is possible because most real world data has statistical redundancy.Another kind of compression, called lossy data compression is possible if some loss of fidelity is acceptable. It is important to consider that in case of lossy compression, theoriginal data cannot be reconstructed from the compressed data due to rounding off orremoval of some parts of data as a result of redundancies. These types of compression arealso widely used in Image compression [10, 11, 12, 13]. The theoretical background of compression is provided by information theory and by rate distortion theory. There is aclose connection between machine learning and compression: a system that predicts theposterior probabilities of a sequence given its entire history can be used for optimal datacompression (by using arithmetic coding on the output distribution), while an optimalcompressor can be used for prediction (by finding the symbol that compresses best, giventhe previous history). This equivalence has been used as justification for datacompression as a benchmark for "general intelligence". We hereby focus on thecompression of text. Various algorithms have been proposed for text compression [1,2,3,4,7]. We proposed an efficient text compression algorithm that should yield bettercompression than existing algorithm like Lempel-Zip-Welch and existing software likeWinzip10.0 and Winrar3.93, while ensuring the compression is a lossless compressionprocess. Our proposed algorithm is based on a systematic special character replacementtechnique.

The rest of this paper is organized as follows:- Section II, the concepts of specialcharacter replacement is provided. Section III describes the creation and maintenance of dynamic dictionary. Section IV describes the process of removal of spaces betweenspecial symbols in the intermediary compressed file. Section V gives the proposedalgorithm and section VI described the experimental results and Section VII concludesthe paper.

1. SPECIAL CHARACTER REPLACEMENT

In the proposed algorithm we replaced every word in a text with an ASCIIcharacter. In the extended ASCII set of characters there are two hundred and fifty-four(254) characters. Among these some of them hold NULL, Space, Linefeed or Englishalphabets as their special symbols. Neglecting them there are one hundred and eighty-four(184) ASCII characters that have been used in this proposed algorithm. In thisproposed algorithm, one letter or two letter English words in the text file are not replacedwith an ASCII character. A non-printable ASCII character replaces words having morethan two letters. For an example, the word ‘of’ remains the same, whereas a non-printableASCII character ‘1’ replaces the word ‘name’. Whenever a new word is found wemaintained an index (integer) and the corresponding special ASCII character is replacedfor the word in the compressed text file. When the word is repeated in the file, it isreplaced by the same ASCII value assigned to it previously.

In this algorithm, we used one hundred and eighty-four symbols for the first onehundred and eighty-four words. Once the number of words exceeds the above value, wecombined the ASCII characters to generate new symbols for the new words in the textfile. When a space is encountered between two words, it is replaced with an integer ‘0’.To determine the end of a statement symbol ‘9’ is used, so the termination of a sentence can be identified during the process of decompression of the text file from thecompressed file. For an example, suppose there is a line of text: My name is Debashis Chakraborty.

Assuming this is the first sentence in the file and following the proposedalgorithm the words ‘My’ and ‘is’, are kept unchanged in the compressed file. ‘Name’being the first word to be compressed we assigned the index ‘1’ to it and replace the wordwith the ASCII character of ‘1’. Similar process is repeated for the other words whoselength is greater than two. We also replaced the space between words with ‘0’ and the ‘.’with ‘9’. Therefore the corresponding compressed sentence for the above example is: My0$0is#0&9

where $, #,& are non printable ASCII characters for integers ‘1’ for ‘name’, ‘2’for ‘Debashis’ and ‘3’ for ‘Chakraborty’ respectively, each occupying one byte of memory space in the memory. The original line of text had 32 bytes, whereas thecompressed line has 12 bytes. Thus the above proposed method enables us to obtaincomprehensive compression of text, resulting in better transmission bandwidthmanagement and requires less storage.

2. CREATION OF DYNAMIC DICTIONARY

Text Compression Algorithms should always be lossless compression algorithms,preventing any loss of information. The text file regenerated from the compressed filemust be identical to the original file. All text compression algorithms maintain adictionary containing the words that appear in the text file. The text file is regeneratedfrom the original file with the help of this dictionary. Dictionaries maintained can eitherbe static or dynamic.

In this proposed algorithm, dynamic dictionary is used. We maintained a tablecontaining the fields, named ‘Index’, Symbol’ and ‘Word’ to form the dictionary.Initially the table is empty. When a word to be compressed in the text file is encountered,check whether the word exists in the table. Every time a new word is found in the textfile, assign an integer value to it and tabulate its special symbol using single non-printable ASCII characters or combination of such character symbols. It stored theassigned integer under index field, the special symbol under symbol field and the corresponding word under the word field. Every time a new word is found in the text thedictionary is updated using the same procedure. When a word is repeated, the alreadyassigned symbol for the word is used.

During the process of decompression, the special symbol in the compressed file issearched to obtain its corresponding integer value or index and the corresponding word.Finally the symbols are replaced with their corresponding words to regenerate the originalfile.

3. REMOVAL OF SPACES FROM THE INTRMEDIARYCOMPRESSED FILE

Every file contains spaces between the words to identify different words. Thewords are separated from each other with spaces. Here we propose a method to removethese spaces without loosing any information to obtain better compression.

The usage of special symbols for every word in the original file compresses thesize of the file and an intermediary compressed file is obtained. We do not remove thespaces from the intermediary file. Instead it contains ‘0’ to specify the location of spacesbetween words. Every word in the original file is replaced by either one special symbol ora combination of two special symbols.

In the intermediary compressed file when we obtain ‘0’ after one special symbol,the contents are not modified. Whereas when a word is replaced by a combination of twosymbols or the word is a one or two lettered word( no replacement in the intermediarycompressed file), we remove the ‘0” after them i.e. the space between the present wordand next word is removed. For example, suppose the there is a line of text: My name is Debashis Chakraborty.

Assuming the special symbol for ‘name’ is ‘$’, ‘Debashis’ is ‘##’ and‘Chakraborty’ is ‘@’, then after the final compression the output of the above sentence is: My$0is##@9

4. PROPOSED ALGORITHM

We proposed an algorithm that takes a text file as input. The proposed algorithmcan compress text files to comparable size of Lempel-Ziv Welch Algorithm, Winzip10.0and Winrar3.93. The proposed algorithm is:

Algorithm for Compression

Step 1: Read the contents of Text file on word at a time.

Step 2: Create a dictionary containing the fields ‘Index’, ‘Symbol’ and ‘Word’. Thedictionary is initially empty.

Step 3: Length Calculation

Step 4: Special Character replacement

Step 5: Continue the above process of compression and updation of the dynamicdictionary till the end of the original file is reached.

Step 6: Removal of Spaces from the intermediary file

Step 7: Continue the above process of compression till the end of the intermediaryfile is reached.

Algorithm for Decompression

Step 1: Read the symbol form the compressed file.

Step 2: If the read symbol is ‘0’ replace with a space or tab. For the symbol ‘9’ replacewith a “.” to indicate end of sentence.

Step 3: Decoding of Special Characters

Step 4: Continue the above process till the end of the compressed file is reached.

5. EXPERIMENTAL RESULTS

The algorithm developed has been simulated using TURBO C. The input text filesare considered to be .txt,. rtf, .cpp and .c files. All the text files that we have tested are of different sizes. The compression ratios obtained are tabulated in Table 1. Thecompression ratio is better than that of Lempel-Ziv Welch Algorithm,Winzip10.0 and Winrar3.93 for majority of the text files. All the text files reconstructed from thecompressed file are of the same size as that of original file. Therefore the proposedalgorithm follows lossless compression.


Table 1 Compression of text files for different algorithms

Original File Original File Size Compression by LZW Compression by WINRAR 3.93 Compression by WINZIP 10.0 Compression by Proposed Algorithm
sgjm1.txt 5046 bytes 3292 bytes (35%) 2056 bytes (59%) 2468 bytes (51%) 1537 bytes (69%)
sgjm2.txt 7061 bytes 4357 bytes (38%) 2565 bytes (63%) 2547 bytes (64%) 2129 bytes (70%)
sgjm3.rtf 2891 bytes 1842 bytes (36%) 1303 bytes (55%) 1269 bytes (56%) 896 bytes (69%)
sgjm4.txt 431 bytes 388 bytes (9%) 260 bytes (39%) 231 bytes (46%) 158 bytes (63%)
sgjm5.rtf 2037 bytes 1330 bytes (35%) 859 bytes (58%) 828 bytes (59%) 635 bytes (63%)
sgjm6.txt 3369 bytes 2196 bytes (35%) 1545 bytes (54%) 1504 bytes (55%) 1110 bytes (67%)
sgjm7.txt 10549 bytes 5457 bytes (47%) 3933 bytes (63%) 3923 bytes (63%) 3492 bytes (66%)
sgjm8.txt 7584 bytes 4216 bytes (44%) 3067 bytes (59%) 3048 bytes (60%) 2389 bytes (68%)
sgjm9.rtf 5529 bytes 3249 bytes (41%) 2351 bytes (57%) 2324 bytes (58%) 1793 bytes (67%)
sgjm10.rtf 4152 bytes 2658 bytes (36%) 1869 bytes (55%) 1831 bytes (56%) 1428 bytes (66%)
sgjm11.cpp 458 bytes 421 bytes (8%) 259 bytes (43%) 239 bytes (48%) 134 bytes (70%)

6. CONCLUSIONS

In this paper, a new text compression algorithm used to compress different type of text files has been introduced. The main advantage of this compression scheme is that thealgorithm gives better compression than existing algorithms for different text file sizes.This compression scheme is comparable to the Lempel-Zip Welch Algorithm,Winzip10.0 and Winrar3.93 in terms of compression ratio.

REFERENCES

  1. J.Ziv and A. Lempel, “Compression of individual sequences via variable lengthcoding”, IEEE Transaction on Information Theory, Vol 24: pp. 530 – 536, 1978.
  2. J.Ziv and A. Lempel, “A universal algorithm for sequential data compression”, IEEETransaction on Information Theory, Vol 23: pp. 337 – 343, May 1977.
  3. Gonzalo Navarro and Mathieu A Raffinot, “General Practical Approach to PatternMatchingover Ziv-Lempel Compressed Text”, Proc. CPM’99, LNCS 1645, Pages14-36.
  4. S. Bhattacharjee, J. Bhattacharya, U. Raghavendra, D.Saha, P. Pal Chaudhuri, “AVLSI architecture for cellular automata based parallel data compression”, IEEE-2006,Bangalore, India, Jan 03-06.
  5. Khalid Sayood, “An Introduction to Data Compression”, Academic Press, 1996.
  6. David Solomon, “Data Compression: The Complete Reference”, SpringerPublication, 2000.M. Atallah and Y. Genin, “Pattern matching text compression: Algorithmic andempirical results”, International Conference on Data Compression, vol II: pp. 349-352, Lausanne, 1996.
  7. M. Atallah and Y. Genin, “Pattern matching text compression: Algorithmic andempirical results”, International Conference on Data Compression, vol II: pp. 349-352, Lausanne, 1996.
  8. Mark Nelson and Jean-Loup Gaily, “The Data Compression Book”, Second Edition,M&T Books.
  9. Timothy C. Bell, “Text Compression”, Prentice Hall Publishers, 1990.
  10. Ranjan Parekh,” Principles of Multimedia”, Tata McGraw-Hill Companies, 2006
  11. Amiya Halder, Sourav Dey, Soumyodeep Mukherjee and Ayan Banerjee, “AnEfficient Image Compression Algorithm Based on Block Optimization and ByteCompression”, ICISA-2010, Chennai, Tamilnadu, India, pp.14-18, Feb 6, 2010.
  12. Ayan Banerjee and Amiya Halder, “An Efficient Image Compression AlgorithmBased on Block Optimization, Byte Compression and Run-Length Encoding along Y-axis”, IEEE ICCSIT 2010, Chengdu, China, IEEE Computer Society Press, July 9-11,2010.
  13. Rafael C. Gonzalez, Richard E. Woods, Digital Image Processing.
  14. Debashis Chakraborty, Sutirtha Ghosh and Joydeep Mukherjee, “An Efficient DataCompression Algorithm Using Differential Feature Extraction”, NCETCS August 26-28,2010.