DonNTU   Masters' portal

Abstract

Content

Introduction

Nowadays Internet is the active environment of information distribution and collection It is possible to find earlier unknown useful knowledge on the Internet pages automatically by means of the intellectual data analysis. As a rule, commercial Internet pages contain many information blocks. Besides the main contents blocks, there are also navigation panels’ blocks, copyright and confidentiality notice blocks, advertisements blocks. In the Internet pages intellectual analysis, such blocks are called noisy blocks. Delivery of Web pages to mobile phones or to personal digital assistants became possible with the last wireless technology. However, mobile devices have very small screen sizes and memory capacities. Web pages transformation for delivering to the mobile device is an important new problem. There are many algorithms of "noisy blocks" cleaning to estimate content objects in the Internet page limits. It allows a Web pages important parts extraction only for delivering to mobile devices. In addition noisy blocks elimination can improve the Internet pages content analysis. Therefore, it is important to eliminate noisy blocks for information perception improvement.

To ensure transparent cleaning web pages, the architecture which enters a program intermediary gateway between client computer and requested Web page is developed. It works in a usual proxy server mode. Client computers browsers are adjusted for work with the developed proxy server. Upon the request of the necessary Web page, the developed proxy server analyzes URL and loads set page. At present the gateway is adjusted on invariable page translation to client browsers. The architecture and realization of a program gateway provides introduction of methods of cleaning of pages.

The program gateway is realized in the Java language and use of non-blocking sockets. Feature of a gateway is thousands clients simultaneous connection support and transfer of Web pages contents in real time. Data transmission in real time is understood as loading the Internet pages for a great number of simultaneous users without an appreciable delay for them, connected with intermediary server introduction.

1. Internet pages content analysis system architecture

Program gateway is an intermediary between the user's computer and the requested Internet page. The user enters URL in the address line of browser, which settings allows the Internet connection through the developed proxy server. Then the gateway performs a request to the specified URL address. Software gateway receives Internet page content from the specified URL, and processes it in real-time. The content analysis system provides Internet pages loading for multiple simultaneous users. As a result, the customer receives requested Web page content displayed on a browser. The handled Internet pages transfer through the gateway in real-time to the multiple simultaneous users live.

2. Internet pages content analysis program gateway

Software gateway is implemented in cross-platform Java programming language. This provides a gateway function for more than one hardware platform and operating system.

Developed software gateway is a multi-threaded application. Multi-threading advantage is to provide the most effective implementation of the time quantization procedure. (Code of switch between tasks in Java would look like to be more awkward than an independent description of the action for each stream). Software gateway is designed for tasks required to perform several actions at once, that implies the multi-threading use. Public proxy server needs to serve multiple clients simultaneously. Few customer threads simultaneously serve requires the multi-threading implementation for the convenience of program realization. This allows you to run the gateway in real-time for multiple simultaneous users.

Foundation of the gateway software implementation is MINA library. MINA is an acronym for "Multipurpose Infrastructure for Network Applications". Currently, Mina supports TCP and UDP protocols based on the Java NIO API. It provides support for the serial port, and transports on the basis of the Apache Portable implementation. At the moment there are 3 mines version out: 1.0 , 1.1 , 2.0. Developers advise to use the latest point release 2.0 (for Java 5 or higher). MINA using requires JDK 1.5 or above. MINA works perfectly with JDK 1.4 also, but only if you do not use SSLFilter, which uses Java 5 SSLEngine. This means, JDK 1.5 or above is required to use SSL with MINA. Secure Sockets Layer (SSL) is a cryptographic protocol, which provides a secure connection between the client and the server.

The main MINA module depends on the two libraries, Sonangol4J and bàñkpîrt-util-concurrent (for 1.0 ). Sonangol4J (simple logging facade for Java) provides bindings for log4j, JDK 1.4 logging API and NLîg4J. Bàñkpîrt-util-concurrent is the porting of the package java.util.concurrent, which was introduced in JDK 1.5. Mina 1.0 uses this to save time, perform the compatibility with JDK 1.4. MINA 1.1 or above depends on the package java.util.concurrent directly, and therefore does not depend on him more. MINA allows you to create client and server applications easily, using classes IîCînneñtîr and IîAññeptîr.

MINA structure was used in the implementation of the gateway software. Bearing in mind the rules for implementation of the MINA for gateway software initially requires the object to listen to incoming connections. As the gateway must support TCP/IP protocols, the role of the object will be favored the àññeptîr of NiîSîñketAññeptîr class. Then you need the information processing filter configuration. The first adds a filter to transfer binary or protocol data object in the message. As the responsible is filter PrîtîñîlCîdeñFilter. Then you need a filter to store a list of all action. It is a newly created session, a message recipient and sender, etc. Filter that performs these actions is LîggingFilter. After you install filters configuration the handler is determined. It’ll be used to serve client connections and requests for the current time. Created class ServerHàndler is the handler, it implements IîHàndler interface. That handler is the main class of the class server, containing a main method, as well you will be able to manage all incoming requests from the client. A mandatory phase is initializing port, which you hear the gateway, and specify what actions would be possible with an open session. Using the class InetSîñketAddress is determined by what the host will be monitored, the bind method is that, when open session you can read data in the request and, at the same time to send reply to customers. Clearly shows the use of MINA in creating the gateway (fig. 1).

Creating a configuration of the gateway structure using MINA

Figure 1 – Creating a configuration of the gateway structure using MINA

3. Inquiries analysis algorithms for program gateway implementation

Software gateway processes the URL request from the user, it breaks his options, then parse. Request fills the list map for the URI, Method, Context and Protocol parameters. GET request sends data to the host using URL. Encoding scheme for URL forwarding method get is the following. For each element of forms, with the name, is a steam "name value" where value is the value element, entered by the user or assigned by default. The value may not be available. All couples are combined in a row, as the separator is the character " * ". As the names and the values represent the normal text, they may contain characters that are not valid in the composition of the URL. Such characters are replaced by a sequence, consisting of a character "%" and their hexadecimal ASCII code. Space character can be replaced is not only a code "%20", but also a sign of "+".

When reading a request class HttpRequestDeñîder characters into account request URL when partitioning GET method. Class HttpRequestMessàge is used for the temporary storage of parsed request and provides access to the request parts. Class ServerHàndler is a handler, which is used to serve client connections and requests for the current time. It defines the actions during the downtime session, and the time, in currents which session will be idle. In the same that handler is responsible for removing type of HTTP protocol. The method takes dîHttpUrlCînneñtiînAñtiîn domain parameter, which asks for the user, and creates a class object HttpURLCînneñtiîn, from which the HTTP protocol type is extracted. This allows you to obtain the content Web page. In-class HttpRespînseEnñîder generates a response to the customer to his request, given the standards of a encoding response. The HTTP response body is stored in class HttpRespînseMessàge, and it is carried out quarterly newsletter Connect i/o in the Internet browser page, machined by method dîHttpUrlCînneñtiînAñtiîn of ServerHàndler class.

At this stage the gateway is configured on the unchanged page webcast to client browsers. Architecture and implementation of software gateway provide the introduction of pages "noisy blocks" cleaning methods with the aim of improving analysis of the data Web pages.

Conclusion

The Internet pages content analysis methods were reviewed. It provide Internet pages clear cleaning before it showing. Subsystem of the Web pages content delivery to the multiple simultaneous users in real time is implemented.

References

  1. MINA [Electronic resource] - access mode: http://mina.apache.org/.
  2. URLConnection [Electronic resource] - access mode: http://docs.oracle.com/javase/tutorial/networking/urls/readingWriting.html.
  3. Lan Yi, Bing Liu, Xiaoli Li Eliminating Noisy Information in Web Pages for Data Mining.
  4. Xinyi Yin, Wee Sun Lee Using Link Analysis to Improve Layout on Mobile Devices.

Important

At the time of writing the masters work is still not completed. The estimated completion date: 1 December 2012. The full text of the work, as well as Materials on the topic, can be obtained from the author or the head of the only after the specified date.