Українська   Русский

Abstract

× Note! This abstract refers to a work that has not been completed yet.
Estimated completion date: June 2018.
Сontact the author or his scientific adviser after that date to obtain complete text.

Content

Introduction

The most actively discussed topic by many world IT companies over the past 10 years is Big Data. To date, large data are one of the engines of information technology development. This is due to the fact that a huge amount of information has accumulated on all Internet users.

The term Big Data causes a lot of disagreement, many suggest that this is only the amount of information accumulated, but also do not forget about the technical side, this direction includes storage technologies, calculations, and also services.

The field of use of Big Data technologies is extensive. For example, with the help of Big Data, you can learn about customer preferences, the effectiveness of marketing campaigns or conduct a risk analysis. But their most popular use is seen in trade, healthcare, telecommunications, financial companies, as well as in public administration.

Using this technology in retail stores, you can accumulate a lot of information about customers, the system of inventory management, supply of marketable products. With the help of the received information it is possible to predict demand or deliveries of the goods, and also to optimize expenses.

In financial companies, big data provide an opportunity to analyze the creditworthiness of the borrower, i.e., on the basis of the identified cash flow, to select favorable and optimal credit conditions, and offer additional suitable banking services. The application of this approach will significantly reduce the time for consideration of applications.

Mobile operators, like financial organizations, have huge databases, which allows them to conduct a detailed analysis of the accumulated information. In addition to using Big Data to provide high-quality services, technology can be used to detect and prevent fraud.

Enterprises of the mining and fuel and oil industry can accumulate information on the quantity of extracted products and, on the basis of these data, draw conclusions about the effectiveness of the development of the field, monitor the state of the equipment, and forecasting the demand for products.

All of the above use of big data technology need some information protection. For example, a financial company that has just started its business can cause considerable material damage if the competing firm gets access to the accumulated or processed data. But the greatest damage can be done to fuel and energy enterprises, which are directly related to the state, if they do not attempt to protect their information [1].

1. Relevance of the topic

The urgency of the work is due to the fact that the large data processed by the distributed system can be:

  • confidential;
  • processed by other providers that provide cloud infrastructure as a service (IaaS), for example Amazon EC2, Google Compute Engine, Microsoft Azure, etc.

This requires taking a number of decisions and measures to provide multi-level data protection with the ability to add or remove a certain level depending on the network infrastructure and the data processed that solve a specific task.

2. Goals and objectives, planned practical results

The purpose of the master's work is to study the existing methods and means of protecting information in a distributed system.

The main objectives of the study:

  1. Analyze threats to information security in distributed information systems and methods for their prevention. Identify deficiencies in methods for protecting sensitive data in distributed processing in existing solutions.
  2. Explore the approach to the development of distributed algorithms using the Hadoop framework as an example, to analyze the MapReduce model from the security point of view. Learn the process of developing an application with further deployment of it in the cloud infrastructure.

Object of research: methods and means of information protection.

As part of this work is required:

  • explore Hadoop;
  • analyze and configure security tools for Hadoop in a cloud environment;
  • as practical results, it is planned to design and develop a working prototype (Minimum Viable Product), an administrative panel that allows users to use the distributed processing capabilities with built-in security features to protect the user's confidential information.

3. The review of the research and development

The topic of information protection in distributed information systems, which can be in the cloud infrastructure, is popular not only in Western, but also in national scientific communities.

3.1 Overview of international sources

There are many books and publications of foreign authors on the topic of information protection in distributed systems.

For example, the book Practical Hadoop Security [2] is an excellent guide for system administrators who are going to deploy Hadoop in a production environment and provide protection for this cluster.

The article Review on Big Data Security in Hadoop [3] describes security risks in the Hadoop file system, shows how you can encrypt/decrypt data in HDFS.

The article A Survey on the Data Security System for Cloud Using Hadoop [4] provides a brief overview of the security of Hadoop: a description of the operation of the Kerberos authentication protocol.

3.2 Overview of national sources

In the Russian-speaking scientific community, the following publications on data security can be identified.

The book Protection of information in computer systems and networks [5] is devoted to methods and means of multi-level information protection in computer systems and networks. In this book, the basic concepts of information security are formulated and threats to information security are analyzed. Particular attention is paid to international and domestic standards of information security.

In the book Information Security: Protection and Attack [6], both technical information describing attacks and protection from them, as well as recommendations for organizing the process of ensuring information security are given. Practical examples for the organization of protection of personal data are considered.

In the book Protection of computer information from unauthorized access [7], issues of protection of computer information from unauthorized access to computers within the network are considered. Particular attention is paid to models and mechanisms for managing access to resources, as well as architectural principles for building a protection system.

In the article Investigating the mechanisms for providing secure access to data located in the cloud infrastructure [8], a study was conducted that allows more detailed understanding of the security issues that are encountered in the design of the architecture of cloud environments.

In the article Some Aspects of Information Security in a Distributed Computer System [9], the architecture of a distributed computer system is considered. Particular attention is paid to the information security feature.

3.3 Overview of local sources

Among the masters of DonNTU can be identified the following publications.

In the article Analysis of security problems in the architecture of distributed NoSQL applications using the Hadoop framework software example [10] V. Chuprin Identified the main characteristics of storage for processing large amounts of data. Analyzed the features of the architecture of distributed applications using the example of the Hadoop framework and suggested recommendations for optimizing the security subsystem based on the problems presented.

In the work of N. Vorotyntsev Study of the Approach to Using Distributed Modules to Ensure Information Protection [11] describes the concepts in the field of computer networks and distributed systems.

4. Analysis of the safety of a distributed computing model

Data processing in distributed systems is based on the MapReduce model. The main advantage of this model is simple scalability in the presence of several computing nodes. The work of MapReduce consists mainly of two steps: Map and Reduce [12].

The mapping step performs preliminary processing of the input data. To do this, one of the main nodes (usually called the master or leader node) receives the input data of the task to be solved and divides them into independent parts. For example, a log file containing 1000 lines can be divided into 10 parts of 100 lines. After the data is separated, they are transferred to other working nodes (slave or follower nodes) for further processing.

At the step of reduction, the processed data is collapsed. The node responsible for solving the problem receives responses from the work nodes and the result is formed on their basis.

In order for all components of the MapReduce function to correctly and jointly perform calculations, you need to accept some agreement about the single structure of the data being processed. It should be flexible enough and common, and also meet the needs of most data processing applications. In MapReduce, lists and key/value pairs are used as the basic primitives. In the role of keys and values, integers, strings, or compound objects can appear, some of whose values can be ignored in further processing [13].

Figure 1 shows a simplified diagram of the data flow in the MapReduce model [14].

MapReduce data flow

Figure 1 – MapReduce data flow
(Size: 86,4 KB; Frames: 36; Repeats: 10; Delay: 0.75 seconds)

As can be seen in Figure 1, this model has many data points and therefore needs some protection of information. For example, when data is transferred over a network after grouping by key, an attacker can add or remove processed data and thereby disrupt the overall result of the task. The situation may be aggravated by the fact that processing is not in its private local computer network, but the infrastructure of other providers is used. One of the obvious and simple solutions can be the separation of confidential data (name, user's login) and its processed data (the number of loans taken, etc.). In this case, the hash value of confidential data can be used as the key. But this approach does not solve the problem, if the values of the data being processed are secret. In such a situation it is necessary to encrypt and decrypt the transmitted data by symmetric algorithms during processing by a specific node.

5. Analysis of existing protection tools for distributed systems

Many document-oriented databases already support encryption built into their SSL/TLS distribution. For example, CouchDB, starting with version 1.3, initially supports (with certain settings) HTTPS protocol [15]. MongoDB also allows you to choose the version of the distribution, both with support for SSL/TLS, and without it [16]. But in addition, in the commercial version (MongoDB Enterprise Server) there are additional security features: data encryption at rest, integration with the LDAP protocol and Kerberos authentication [17]. For other NoSQL databases that do not support the built-in SSL/TLS encryption, you can use an SSL tunnel or VPN if you use your own (trusted) local area network. When using the services of a cloud provider, such as BaaS (Backend as a service), it is not known how the network infrastructure behind the reverse proxy server is protected. If data transfer and data storage on the server are not protected by additional means, this significantly increases the likelihood of the following risks:

  • data leaks;
  • substitution of data during its processing;
  • complete or partial destruction of data.

Figure 2 shows an example of a client's interaction with a cloud storage service.

Interaction of the client with the database server through a reverse proxy server

Figure 2 – Interaction of the client with the database server through a reverse proxy server

Like all distributed systems, Hadoop uses a network to communicate between nodes. As the default data transfer protocol, HTTP is used, but support and HTTPS can be configured [18]. Hadoop allows you to encrypt data when transferring between nodes, but in addition it has solutions designed to protect data using a highly detailed authorization infrastructure.

The Sentry solution supports the previously created role-based access model named (RBAC) (Role-based Access Control), which functions on top of the data presentation form. The RBAC model has a number of functions designed to protect the corporate environment of large data. The first function is secure authentication, which provides mandatory control over access to data for authenticated users. Users are assigned roles, and then are given appropriate authority to access the data. This approach facilitates using models to scale the model, dividing users into categories according to their roles. Another function allows you to organize the administration of user credentials in such a way as to distribute this task among several administrators at the schema level or at the database level. Also, Sentry implements authentication using the Kerberos authentication protocol integrated into Hadoop.

Project Rhino is an open source project developed by Intel. It was created to improve the Hadoop platform: to provide additional protection mechanisms. The main objective of this project is to eliminate security holes in the Hadoop stack and ensure security at all levels within the Hadoop ecosystem. To this end, Intel is developing security in several areas and is focused on cryptographic capabilities.

Among all the works performed within the Project Rhino, the most interesting are new possibilities for encryption/decryption of files within several usage models. For example, adding a general abstraction layer to cryptographic codecs implements an API that allows several such codecs to be registered and used in a certain environment. To support this capability, an appropriate environment is developed for the distribution and management of keys.

Apache Knox Gateway is a solution for protecting Hadoop's perimeter. Unlike the Sentry solution, which provides tools for highly detailed control of data access, the Knox Gateway solution provides control over access to Hadoop platform services. The goal of Knox Gateway is to provide a single point of secure access to Hadoop-clusters. This solution is implemented as a gateway, which provides access to Hadoop-clusters via the REST API [19].

Conclusion

Currently, the accumulated information is of great value. With the advent of global computer networks (Internet), access to information has been greatly simplified, which has led to an increased threat of data security breach in the absence of measures to protect them.

As part of the master's work, it is supposed to analyze from the point of view of security the distributed data processing model (MapReduce). Analyze existing protection for distributed systems. Assess their effectiveness.

When designing data protection for distributed systems, it is necessary to take into account the fact that on the one hand they must reliably store sensitive data, and on the other hand, support multi-level protection with the ability to add or remove a certain level depending on the network infrastructure and the data being processed.

References

1. Аналитический обзор рынка Big Data // Хабрахабр. [Электронный ресурс]. – Режим доступа: https://habrahabr.ru/company/moex/blog/256747/
2. Practical Hadoop Security // Amazon. [Электронный ресурс]. – Режим доступа: https://www.amazon.com/Practical-Hadoop-Security-Bhushan-Lakhe/dp/1430265442
3. Review on Big Data Security in Hadoop // International Journal Of Engineering And Computer Science. [Электронный ресурс]. – Режим доступа: https://www.ijecs.in/issue/v3-i12/28%20ijecs.pdf
4. A Survey on Data Security System for Cloud Using Hadoop // International Journal of Innovative Research in Computer and Communication Engineering. [Электронный ресурс]. – Режим доступа: https://www.ijircce.com/upload/2016/november/164_A%20SURVEY.pdf
5. Защита информации в компьютерных системах и сетях // Ozon. [Электронный ресурс]. – Режим доступа: https://www.ozon.ru/context/detail/id/28336100/
6. Информационная безопасность. Защита и нападение // Ozon. [Электронный ресурс]. – Режим доступа: https://www.ozon.ru/context/detail/id/139249153/
7. Защита компьютерной информации от несанкционированного доступа // Ozon. [Электронный ресурс]. – Режим доступа: http://www.ozon.ru/context/detail/id/17981339/
8. Исследование механизмов обеспечения защищенного доступа к данным, размещенным в облачной инфраструктуре // Cyberleninka. [Электронный ресурс]. – Режим доступа: https://cyberleninka.ru/article/n/issledovanie-mehanizmov-obespecheniya-zaschischennogo-dostupa-k-dannym-razmeschennym-v-oblachnoy-infrastrukture
9. Некоторые аспекты информационной безопасности в распределенной компьютерной системе // Молодой ученый. [Электронный ресурс]. – Режим доступа: https://moluch.ru/archive/25/2709/
10. Анализ проблем безопасности архитектуры распределённых NoSQL приложений на примере программного каркаса Hadoop // Портал магистров ДонНТУ. [Электронный ресурс]. – Режим доступа: http://masters.donntu.ru/2014/fknt/chuprin/library/_hadoop-security.htm
11. Исследование подхода использования распределенных модулей для обеспечения защиты информации // Портал магистров ДонНТУ. [Электронный ресурс]. – Режим доступа: http://masters.donntu.ru/2005/fvti/vorotyntsev/diss/index.htm
12. MapReduce // Википедия. [Электронный ресурс]. – Режим доступа: https://ru.wikipedia.org/wiki/MapReduce
13. Чак Лэм. Hadoop в действии. – М.: ДМК Пресс, 2012. – 424 с.: ил.
14. Introduction to MapReduce // sci2s. [Электронный ресурс]. – Режим доступа: http://sci2s.ugr.es/BigData#Big%20Data%20Technologies
15. Native SSL Support // CouchDB. [Электронный ресурс]. – Режим доступа: http://docs.couchdb.org/en/1.3.0/ssl.html
16. MongoDB Support // MongoDB. [Электронный ресурс]. – Режим доступа: https://docs.mongodb.com/v3.2/tutorial/configure-ssl/#mongodb-support
17. MongoDB Download Center // MongoDB. [Электронный ресурс]. – Режим доступа: https://www.mongodb.com/download-center#enterprise
18. Sandeep Karanth. Mastering Hadoop. – Packt Publishing, 2014. – 374 pages.
19. Безопасность данных Hadoop и решение Sentry // IBM developerWorks. [Электронный ресурс]. – Режим доступа: http://www.ibm.com/developerworks/ru/library/se-hadoop/
20. Егоров А.А., Чернышова А.В., Губенко Н.Е. Анализ средств защиты больших данных в распределенных системах // Первая международная научно-практическая конференция Программная инженерия: методы и технологии разработки информационно-вычислительных систем (ПИИВС-2016). Донецк, 2016 г. – Сборник научных трудов. – ДонНТУ, Том 2, с. 28-33.
21. Егоров А.А., Чернышова А.В. Исследование инструментов распределенной системы Hadoop // Конференция Современные информационные технологии в образовании и научных исследованиях (СИТОНИ-2017). Донецк, 2017 г. – Сборник научных трудов. – ДонНТУ

Go up