FCST  Department ACS  DonNTU   Masters' portal

Abstract

Content

Introduction

Today the IT industry should not only store information and reach an abstract performance, but process specific data of any kind and deliver them to a specific user. And it turned out that traditional relational databases that were created in the era of mainframe and Unix-based systems have been designed for transactional processing tabular data and does not permit you to work on systems with horizontal scaling required for the operation of distributed heterogeneous data sources of huge volumes. In addition, it became apparent that modern users do not want to waste time on the conversion of data into relational format, but prefer to keep them in their original form, structuring only when necessary, such as problem solving analytics. As a result, developers have started talking about the crisis of the former stable area database - there began the emergence of such movements as NoSQL and NewSQL[1, 2].

1. Theme urgency

We live in the information age. It is not easy to measure the total volume of electronic data, but IDC estimates the size of the digital universe in 2006 was 0.18 zettabytes, and by 2011 reached 1.8 zettabytes, showing a tenfold increase in 5 years[3]. A wealth of information creates a new challenges to the organization of its storage and processing. Modern high-load applications have changed the requirements for the database - become relevant effective technologies for creating specialized solutions with a guaranteed response time when dealing with large data sets. However, the potential long-established approach is not fully implemented.

2. Goal and tasks of the research

The goal of master's work is development of highly loaded web-based system with support for high attendance and focused on the storage of large quantities of graphics and multimedia

Main tasks of the research:

  1. To review the technologies used to optimize relational databases.
  2. To analyze the use of alternative NoSQL solutions compared to a relational database.
  3. Explore the characteristics of the most well-known heavy-systems architectures.
  4. Design the architecture of the highly loaded system online periodical.
  5. Development of a method to optimize a high load application.
  6. Practical realization web-based system

3.Review of research and developments

Since the release of the first work in which the problem was formulated has been over 40 years. Has already reached many publications on the subject. On a par with academic studies, parallel publishing companies overlook dedicated to the solution of this problem in the framework of its proprietary database. On this subject has already performed a lot of conferences such as HighLoad++ (2007, 2008, 2009, 2010, 2011i, 2012, 2013), NoSQL matters Conference (2013), NoSQL NOW (2013)[1, 2, 4].

4. Approaches to dealing with large data

A relational classical architecture had problems with increasing amounts of information So the engineers have come up with the solution of optimization problems. The optimization of queries devoted numerous articles and reviews.

There are two main methods of optimization - Statistical and algebraic. The statistical method is based on the ratings system, database statistics and assumptions of the model. The use of various heuristics narrows search space, and selects the optimum execution plan for the query. The algebraic method is based on the query of relational algebra and mathematical logic, so that the output is equivalent to a canonical inquiry[3, 6, 7].

Attempts to adapt the relational database management system to work with large data lead to the following:

  1. Rejection of strict consistency.
  2. Care of the normalization and the introduction of redundancy.
  3. Loss of expression the SQL language and need the part of its functions in software.
  4. Significant complication of the client software.
  5. The complexity of maintaining efficiency and failover the resultant solution.

It should be noted that the producers of the relational database management system realize all these problems and have already started to offer scalable cluster solutions. However, the cost of implementation and maintenance of such solutions often does not pay off.

NoSQL began to gain popularity in 2009, connection with the emergence of a large number of Web start-ups, which the most important task is to keep constant high-capacity storage with an unlimited increase in the volume of data. NoSQL does not mean rejection all the principles of the relational model. Moreover, the term NoSQL was first used in 1998 to describe a relational database, do not use SQL.

The main features of NoSQL:

  1. The exception unnecessary complications.
  2. High bandwidth.
  3. Unlimited horizontal scaling.
  4. Consistency in sacrifice performance.
SQL and NoSQL movement Volume animation 9.51 KB Number of frames 10

Picture 1 – SQL and NoSQL movement

Today there are a large number of NoSQL solutions[2].

MapReduce — is an approach to data processing, which has two major advantages compared to traditional solutions. The first and most important advantage - it's performance. Theoretically, MapReduce can be parallelized, which allows you to process huge amounts of data on a set of cores / processors / machines. The second advantage is the ability to describe the MapReduce processing normal code. Compared to what can be done with SQL, the possibility of code inside MapReduce is much richer and can extend the realm of possibility, even without the use of specialized solutions. Implementation is written in C #, Ruby, Java, Python[9].

Conclusion

A result of research is identified major approaches for the design of highly loaded database. Were analyzed the existing highly loaded systems. As part of the master's work is proposed to develop highly loaded web-based system, design data storage with high bandwidth for an unlimited amount of data increases.

This master's work is not completed yet. Final completion: December 2013. The full text of the work and materials on the topic can be obtained from the author or his head after this date.

References

  1. Мендкович Н. А. Обзор развития методов лексической оптимизации запросов / Н. А. Мендкович, С. Д. Кузнецов / Труды Института системного программирования т. 23, М., ИСП РАН, 2012, стр. 195-214
  2. Клеменков П.А. Большие данные: современные подходы к хранению и обработке / П.А. Клеменков, С.Д. Кузнецов / Труды Института системного программирования, т. 23, М., ИСП РАН, 2012, стр. 143-158.
  3. Tom White Hadoop: The Definitive Guide, 3rd Edition / White Tom /O'Reilly Media, 2012, 688 p.
  4. Волков Д. Открытые системы СУБД / 4. Д. Волков / М. 2012, № 02 ISSN 1028-7493
  5. Rick Cattell Scalable SQL and NoSQL Data Stores / Cattell Rick / SIGMOD Record, December 2010 (Vol. 39, No. 4)
  6. Mark A. Beyer, Douglas Laney. The Importance of Big Data: A Definition, 21 June 2012.
  7. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber. Bigtable: a distributed storage system for structured data. Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, vol. 7, p. 15-15, USENIX Association Berkeley, CA, USA, 2006
  8. Пашинин О.В. ОПТИМИЗАЦИЯ ЗАПРОСОВ К БАЗАМ ДАННЫХ / О.В. Пашинин / Математические структуры и моделирование 2007, вып. 17, с. 100–107
  9. Konstantin Shvachko, Hairong Kuang, Sanjai Radia, Robert Chansler. The Hadoop Distributed File System. MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010, pp. 1-10.
  10. P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: wait-free coordination for internet-scale systems. USENIXATC'10: Proceedings of the 2010 USENIX conference on USENIX annual technical conference. Berkeley, CA, USA: USENIX Association, 2010, pp. 11–11.