Gimadeev Sergey

CV Biografy Abstract

Subject: Methods and algorithms for optimizing access to spatial data in the database HBase

Contents:

Goals and Objectives
Background
Scientific novelty
Expected practical results
Review of geographic information systems
Distributed data storage and processing
Mathematical methods of solving the problem
Conclusion
References

Goals and Objectives

Typical tasks of spatial data mining include clustering, classification and identification of trends. The main difference between the extraction of knowledge in conventional databases and in the space – this is what the neighbors determine the spatial attributes of the objects have an impact on this very subject. Therefore, algorithms for extracting knowledge from spatial data are highly dependent on the efficiency of the process of determining the relationship to the neighborhood of a spatial object. The operation of determining the ratio between the spatial objects of the neighborhood needs a lot of computing time. A simple algorithm to search for nearby objects has the computational complexity, since it is necessary to view all the pairs of objects. Thus, the purpose is to analyze the problem of the relevance of data generated in a GIS, as well as the identification of methods and algorithms that can be used effectively to index spatial data [1].

Background

Recently there have been a large number of information systems generating spatial data. Such systems are commonly referred to as geographic information systems (GIS). This is a system such as sensor networks, remote sensing, satellite monitoring of transport, etc. All of this suggests that the number and amounts of spatial databases will continue to grow rapidly. Intelligent analysis of these data may provide useful information about the real physical world and help in solving their problems to experts such domains as ecology, land management, municipal management, transportation, economics, and many others.

Scientific novelty

scientific innovation is the use of codes of Morton and tree data structures such as a quadtree, which will index the spatial objects for further processing. This approach should provide high performance in distributed database systems using HBase.

Expected practical results

With this work assumes a solution to the problem of indexing spatial data obtained from geographic information systems.

Review of Geographic Information Systems

The significant difference of GIS from other information systems with the spatial localization of the data is included in the description of spatial objects and topological characteristics of the classification on this basis of spatial objects: point, line and area (areal). Temporal aspect, as a rule, consists of three long-term factor, average time and operational. From this aspect is related characteristics of the quality of information – the relevance and the procedure for updating the data. By the time characteristics of the stored information and the SOT, usually divided into:
• long-term (ten years of storage);
• medium-term (years);
• annual and seasonal;
• operational.
However, the temporal aspect of data in GIS is defined and the class of tasks. For example, in the most prompt and determined by the GIS from one week to two to three weeks. In the special GIS (military intelligence, the analysis of emergency) the efficiency can be minute. Thematic information in GIS is not limited. That is what creates the possibility of using GIS as a universal information system solutions for a variety of tasks. Moreover, it is the main thematic information, while the spatial information serves as a link to combine, compare, search, and interpret a variety of related information. In most GIS to determine where to use two classes of data: one class – the position (coordinate) data, and another class – attribute data to determine the parameters of time and thematic focus.

Distributed data storage and processing

To work with large volumes of data in recent years increasingly used a distributed storage and processing. Moreover, there is a shift from the use of high-performance mainframe to the cluster systems consisting of a large number of low-cost servers with x86-64 [2]. Following the publication of articles by Google disclosing the principles of data storage and processing, implemented it for their services [3, 4], there were open source projects that implement these principles. The most successful of them so far is the project of Hadoop. Hadoop is a free Java framework that supports the execution of distributed applications running on large clusters, combining conventional server running open source operating system Linux. For distributed data storage uses a file system Hadoop Distributed File System (HDFS), which is analogous to open Google File System (GFS). On top of that file system databases running HBase, which is analogous to BigTable database of Google [3]. Hadoop is implemented in the computational paradigm, known as MapReduce [4]. According to this paradigm, the application is divided into many small tasks, each of which can be performed on any node in the cluster. All these technologies make it possible to achieve very high aggregate bandwidth of the cluster. This system allows applications to easily scale to thousands of nodes and petabytes of data. Therefore, the development of spatial indices, we will perform for database HBase.

Mathematical Methods for solving the problem

to build a spatial index is proposed to use a data structure quadtree (quadtree). Quadtree – a tree data structure in which the root and each internal node has four child nodes [4]. This data structure has been designed specifically for the breakdown of two-dimensional space by recursively dividing it into four quadrants (Fig. 1).

Figure 1 – Principle of the quadrants of the tree

Figure 2 – Indexing spatial objects, blocking

The root of the tree divides the region into four quadrants NE, SE, SW, SE (in accordance with the card). Each level of the tree quadrants requires two bits of code Morton. To encode geographic coordinates, we propose to use the code length of 64 bits. Since the geographic coordinates of the Earth range from -90° to 90° in latitude and from -180° to 180° of longitude 64-bit code will allow Morton to set the coordinates to an accuracy up to seven decimal places in decimal degrees. Morton codes were chosen due to ease of calculation of the cell numbers in the geometric coordinates. Spatial databases operate on two-dimensional geometric objects such as point, polyline and polygon. The point in the two-dimensional space is represented by two geographical coordinates: longitude and latitude. Consider the algorithm for determining the Morton code for the coordinates of the two-dimensional space: In order to get the code for the node containing Morton's point, it is enough to translate the indices x and y in the binary representation, and then build a new number, in turn, taking on one category of the binary representation of each number [5, 6]. Let us now consider a more complicated case, when indexed by a spatial object represented by a broken line or a polygon. Any two-dimensional object can be replaced by the described rectangle (bounding box) [7, 8], with some loss of accuracy. This allows us to simplify calculations and to significantly increase the processing speed of spatial queries. This approach is common practice in the construction of the indices of spatial data. Thus, an arbitrary two-dimensional object when indexing is described as a rectangle. Quadtree has the property of adaptability. It means. That spatial objects can be indexed nodes at different levels. It all depends on the area of the indexed object. If the point is indexed, the site uses the lower level (Fig. 2, Fig. 3). If the object is represented by a rectangle is indexed, the reference to it is written in a knot the size of which exceeds the minimum size of the rectangle [1].

Figure 3 – Indexing of spatial objects, the tree structure

Mutual influence between spatial objects defined by topological relation, distance and direction between the objects. To determine the topological relations between two spatial objects A and B are commonly used following predicates [9, 10]:
• equals – match object A with object B;
• disjoint – object A spatially separated from the object B;
• intersects – object A intersects object B;
• touches – an object A is in contact with the object B;
• crosses – crosses the spatial object A with object B;
• within – an object A is inside the object B;
• contains – object A contains object B;
• overlaps – object A spatially partially overlaps the object B.
detailed description of these predicates can be found in the standard Open Geospatial Consortium (OGC) [10]. In addition to the topological relations between spatial objects are relations of direction (for example, object A is located north of the object B). To set the direction of the relationship in the development of indices, we used the English-language expressions: south, southwest, west, northwest, north, northeast, east, southeast and any_direction. The distance between two spatial objects is given a real number. They are considered adjacent if the distance between them does not exceed a specified value. In the index tables, together with a topological relation and the ratio of the direction we are going to store the distance between objects [11]. To reduce the size of the index table should be possible to specify the maximum distance between the objects at which they will be considered adjacent. This distance can be constant or depend on the areas of indexed objects [1].

Conclusion

As a result of an analysis of geographic information systems, they identified the major deficiencies in terms of indexing and retrieval of spatial data. To solve this problem, we developed a method of indexing spatial data in the database HBase.

References

1. A Telyatnikov, S. Gimadeev, Indexing spatial data in order to increase mining, ISDMCI '2011 / Materials of international scientific conference. Evpatoria, HNTU, 2011, 317–321 pp.
2. Gartner. Gartner Predicts Video Telepresence Will Replace 2.1 Million Airline Seats Per Year by 2012. http://www.gartner.com/it/page.jsp?id=876512.
3. Chang F., Dean J., Ghemawat S., et al. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation – Volume 7 (OSDI '06), Vol. 7. USENIX Association, Berkeley, CA, USA, 15.
4. Dean J., Ghemawat, S. Mapreduce: simplified data processing on large clusters. Commun. ACM, January 2008, 51:107–113 pp.
5. Morton G.M. A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario, Mar. 1966.
6. Larsson, Thomas, and Tomas Akenine-Moller, Collision Detection for Continuously Deforming Bodies, Eurographics 2001.
7. Samet H. The Design and Analysis of Spatial Data Structures. Addison-Wesley, Reading, MA, 1990.
8. Finkel R., Bentley J.L. Quad Trees: A Data Structure for Retrieval on Composite Keys. Acta Informatica 4 (1): 1–9 pp., 1974.
9. Ester M., Kriegel H.-P., Sander J. Knowledge Discovery in Spatial Databases, invited paper at 23rd German Conf. on Artificial Intelligence (KI '99), Bonn, Germany, 1999.
10. Cormen, T.H., C.E. Leiserson, and R. Rivest, Introduction to Algorithms, MIT Press, Inc., Cambridge, Massachusetts, 1990.
11. Yiwei Wu, Yingshu Li, Distributed Indexing and Data Dissemination in Large Scale Wireless Sensor Networks, http://www.cs.gsu.edu/yli/papers/icccn09.pdf.

Faculty Computers Sciences and Technologies

Department Automated Control Systems

Speciality "Information Control Systems"

Theme of Master's Work:

Methods and algorithms for optimizing access to spatial data in the database HBase

Scientific Supervisor: Telyatnikov Alexander