Abstract

Content

Introduction

Currently, no website or serious application can do without using databases. They are used for storing, organizing and grouping data, which ensures ease of access to data and the integrity of information. To create, manage, administer and use databases, specialized programs or groups of programs of the database management system (DBMS – Database Management System) are used.

The development of the DBMS began in the mid 60-ies of the XX century in the development of space programs. The first full-fledged DBMS was the hierarchical IMS (Information Management System). This system today is also used as the main hierarchical DBMS on high-performance servers. IMS is a secure, reliable, high-throughput software for online transaction processing and batch processing.

1. Theme urgency

With the advent of technically more complex software systems, the need for storing and processing large amounts of information increases in direct proportion. This leads to the fact that the amount of information stored in databases exceeds the physical limitations of centralized systems. As a result, reduced productivity, fault tolerance and compromised integrity of the stored information.

In modern development approaches, decentralization processes are becoming increasingly important. Over the past decades, developers have sought to design geographically distributed databases. This approach allows you to increase database performance, improve the protection of stored information, provide effective means of data replication. However, to achieve the above advantages, it is necessary to apply effective methods for designing distributed databases.

2. Goal and tasks of the research

The purpose of the study is to develop methods for designing a distributed database that can decentralize the database by breaking it into fragments. The resulting fragments will be placed in the nodes of the computer network. Based on the methods obtained, it is necessary to create algorithms for the software package, the results of which will increase the efficiency of processing requests to a distributed database, data replication, and protection of stored data.

The main objectives of the study:

  1. Study of the characteristics of distributed data processing systems.
  2. Studying the main tasks included in the design process of distributed databases.
  3. Investigation of the problems of database fragmentation, placement of fragments in the nodes of a computer network.
  4. Formation of a query execution strategy.
  5. Selection and justification of the effectiveness criterion for a distributed database.
  6. Development of algorithms that describe fragmentation and fragment placement methods. The formation of the architecture of a distributed database.
  7. Implementation of a distributed database in a software package.

Research Object: Distributed Database.

Subject of research: models and algorithms of the software package designed to describe the methodology of decentralization of the database.

As part of the master's work, it is planned to obtain relevant scientific results in the following areas:

  1. Formation of the design task of a distributed database, taking into account the features of database fragmentation and placement of the resulting fragments.
  2. Formation of a query execution strategy.
  3. Selection and justification of a distributed database performance criterion that takes into account the influence of the physical parameters of the system on the speed of request processing and the transaction availability ratio.

As practical results, it is planned to develop a highly loaded distributed database and integrate it into a ready-made software package for online ticketing.

3. Database Fragmentation

Database fragmentation, or sharding, is a database architecture pattern associated with horizontal partitioning (dividing the rows of one table into several different tables, called partitions). Each section has the same layout and columns, but different rows. Accordingly, the data stored in each of them is unique and does not depend on the data stored in other sections.

It is customary to compare horizontal partitioning with vertical. In a table with vertical partitioning, entire columns are separated and placed in separate tables. The data contained in one vertical section does not depend on the data in all other such sections; each of them contains both separate rows and separate columns. Figure 1 shows the horizontal and vertical partitioning of a table.

The process of horizontal and vertical table partitioning

Figure 1 – The process of horizontal and vertical table partitioning

Fragmentation involves splitting data into two or more smaller parts called logical segments. Then they are distributed over individual database nodes, the so-called physical segments, which can contain several logical segments. The data stored in all segments represents a whole set of logical data.

A segmented database is an example of a shared architecture. This means that all segments are autonomous; they do not use the same data or computing resources. In some cases, however, it may make sense to copy specific tables to each segment to use for reference. For example, there is a database for the application, which depends on fixed conversion factors for weight measurements. By replicating a table containing the necessary data on the conversion rate to each segment, it is possible to provide each segment with all the data necessary for the queries.

Often fragmentation is implemented at the application level, that is, the application includes code that determines which segment to transmit for reading and writing. However, some database management systems have built-in fragmentation capabilities, which allows it to be implemented directly at the database level.

4.1 Benefits of database fragmentation

The main advantage of sharding is that it can simplify scaling out. Horizontal scaling is the addition of new machines to the existing stack, which allows you to distribute the load and process more traffic faster. This practice is often compared to scaling up, which involves updating the hardware of an existing server, usually by adding more RAM or CPU.

Maintaining a relational database running on the same machine, and scaling it vertically as needed, updating its computing resources, is relatively simple. However, in the end, any non-segmented database will be limited in terms of storage and processing power, so the ability to scale horizontally makes your installation much more flexible.

Another reason why some people choose the architecture of a fragmented database is to reduce the response time of the request. When sending a query to a non-fragmented database, a search is performed on each row in the table before a set of the desired results is found. In an application with a large monolithic database, queries can be processed very slowly. In a segmented table, fewer rows are searched while searching, and results are returned much faster.

Fragmentation can also make an application more reliable by mitigating the effects of crashes. If your application or website uses a monolithic database, a failure may result in the unavailability of the entire application. However, when using a fragmented database, disconnection can affect only one data segment. Some parts of the application or website may become inaccessible, but the overall impact will still be less than if the entire database crashed.

4.2 Disadvantages of database fragmentation

Although database fragmentation can simplify scaling and improve performance, it can also impose certain limitations. In this section, we will discuss some of these limitations and situations in which it is better not to use sharding at all.

The first problem that people face when using fragmentation is the difficulty of correctly implementing such an architecture. Incorrectly fragmented data can lead to data loss or damage to tables. But even if everything is done correctly, fragmentation can have a significant impact on your team’s workflows. Instead of accessing and managing data from one point, team members will have to manage data in several segments, which may interfere with the work of some groups.

The problem that is sometimes encountered after database fragmentation is that the segments ultimately vary greatly in size. Suppose you have a database with two separate segments: one for customers whose last names begin with letters from A to M, and the second for those whose last names begin with letters from N to Z. However, the application serves a lot of users whose last names begin with the letter G. Accordingly, the first segment gradually accumulates more data than the second. This slows down the application while serving a significant portion of your users. In this case, any of the benefits of database fragmentation is negated by slowdowns and crashes. Most likely, the database will need to be restored and reconfigured to ensure more even distribution of data.

Another significant drawback is that after fragmenting the database, it can be very difficult to return it to the previous architecture. Database backups made before fragmentation do not contain data recorded after fragmentation. Therefore, to restore the original architecture, it will be necessary to combine the new segmented data with the old backups or, conversely, convert the segmented database back to a single database, which will be time consuming and expensive.

4.3 Views of fragmented architectures

When executing queries or distributing incoming data into segmented tables or databases, it is important that they are transferred to the correct segment. Otherwise, this can lead to data loss or extremely slow query processing. In this section, we will look at several common fragmentation architectures, each of which uses a slightly different process to distribute data between segments.

4.3.1 Interval Fragmentation

Interval fragmentation involves segmenting data based on ranges of a given value. Suppose you have a database that stores information about all the products in a catalog. You can create several different segments and separate information about each product depending on the price range into which they fall (Figure 2).

Interval Fragmentation Example

Figure 2 – Interval Fragmentation Example

The main advantage of this model is that it is relatively easy to implement. Each segment contains a different set of data, but they all have the same design as well as the original database. The application code simply reads into which range the data falls and writes it to the corresponding fragment.

On the other hand, interval segmentation does not protect the data from uneven distribution, which leads to the aforementioned hotspots. Figure 2 shows that even if each shard contains the same amount of data, it is likely that specific products will receive more attention than others. And the corresponding segments, in turn, will receive a disproportionate number of read operations.

4.3.2 Directory Fragmentation

To implement this model, you need to create and maintain a lookup table that, with the help of a segment key, keeps track of what data is contained in a particular segment. In this context, a look-up table is a table that contains a static set of information about where specific data can be found. Figure 3 shows a simplified example of directory segmentation.

Directory segmentation example

Figure 3 – Directory segmentation example

Here, the Delivery Zone column is defined as a segment key. The data from the segment key is written to the look-up table along with the segment into which each corresponding row should be written. This is similar to interval segmentation, but instead of defining the range into which the data falls, each key is bound to its specific segment. Catalog segmentation is better than interval segmentation in cases where a segment key has a low communication power, it makes no sense to store a range of keys for a segment. Please note that this model also differs from key segmentation, as it does not process the segment key with a hash function; she just checks the key against the table to see where to write the data.

The main advantage of directory fragmentation is flexibility. Interval segmentation architecture limits users to ranges of values, and key architecture with a fixed hash function, which, as mentioned earlier, can subsequently be extremely difficult to change. Catalog segmentation allows you to use any system or algorithm to enter data into segments, and using this approach it is relatively easy to dynamically add fragments.

Directory segmentation is the most flexible of the methods discussed here, but the need to connect to a table before each query or record can adversely affect application performance. In addition, the lookup table can become a single point of failure: if it is damaged or otherwise malfunctioning, this will affect the ability to write or retrieve data.

Conclusion

In the framework of this work, the main ways of segmenting the database were considered, the features of each of the methods were identified. In the future, it is planned to use one of the described methods of database segmentation as part of the ongoing project for online ticketing.

The master's thesis is devoted to the urgent scientific task of researching methods for creating productive DBMS. As part of the research performed:

  1. Methods of database fragmentation are considered.
  2. Based on the analysis of literary sources, the structure of building productive DBMSs is highlighted.
  3. The requirements for a productive DBMS were formed and analyzed.
  4. Formation of a query execution strategy.
  5. Selection and justification of the effectiveness criterion for a distributed database.
  6. Development of algorithms that describe fragmentation and fragment placement methods. The formation of the architecture of a distributed database.
  7. Implementation of a distributed database in a software package.

Further research focuses on the following steps:

  1. Design and implementation of a distributed database, taking into account the features of database fragmentation.
  2. Formation of a query execution strategy.
  3. Justification of the selected criteria for the effectiveness of a distributed database.

When writing this essay, the master's work is not yet completed. Final completion: June 2020. Full text of the work and materials on the topic can be obtained from the author or his manager after the specified date.

References

  1. Codd E.F. A Relational Model of Data for Large Shared Data Banks – Journal Communications of the ACM, Vol. 13, no. 6., 1970, pp. 377—387.
  2. Berri C. A sophisticates introduction to database normalization theory – 4th Conf. of Very Large Databases, 1978, pp. 113-124.
  3. Thalheim B. The HERM Algebra – Book Entity-Relationship Modeling: Foundations of Database Technology Reading, Department of Computer Science Brandenburg University of Technology, Germany, 2013, pp. 223-244.
  4. Gurevich Yu. Logic and the challenge of computer science – Current Trends in Theoretical Computer Science (Borger E). Computer Science Press, 1988, pp. 1–57.