Faculty of Computer Science
Major: “Computer Systems and Networks“
Dissertation topic: “Development of a distributed storage and information protection method”
Scientific adviser: Malcheva Raisa Victorovna, Ph.D.
The relevance of the topic. The rapid development of telecommunication systems, local and global computer networks has led to a dramatic increase in channel capacity data transmission. As a result, the speed of information exchange between nodes in a computer network often exceeds the speed of working with HD. By passing a general equalization of velocities should be advantages of a distributed network — reliability, scalability, flexibility and greater security of information by the possibility of geographical dispersion. Thus, the issue of work aimed at developing a method for distributed storage of information is relevant.
Related academic programs, plans, topics. Master's work is carried out during 2008-2009 in connection with the "Digital City" project. Donetsk national technical university take a leading position in the program.
The purpose and objectives of the research. The goal of this work is to create a technology of storing information, providing an adequate level of security and optimization of disk resources.
The idea of work is in the perception of a computer network as a single disk space. Information is divided into blocks, which are distributed to nodes for further storing.
The main objectives of development and research. To achieve this goal it is necessary to:
Scientific novelty is determined by the following states:
The practical significance of the results contained in the development of cross-platform software which realize the idea of developed distributed information storage.
Review of related researches and developments. Masters' works of Babenko and Dosta can be considerd as related local researches. They used to study LAN-based computing clusters. On the national level GRID infrastructure is being created. Global researches has moved quite far and made even paradigm — cloud computing which provide its resources fully transparent to end-users, hiding its internal structure and making feeling of single system. Leading researches on this topic is carried out on the Department of Computer Science at the University of California.
In the introduction justified the relevance of the topic of master's work, formulated the goal, motivation and objectives of the research, the idea of the work and its scientific novelty.
The first section contains analysis of modern methods of storing information, storage optimization methods, ways of transmission of information in peer-to-peer networks, the main task of research.
Now available a sufficient quantity of a single data repository, it is different, corresponding to the various requirements for the storage system.
In computer storage, logical volume management or LVM is a method of allocating space on mass storage devices that is more flexible than conventional partitioning schemes. In particular, a volume manager can concatenate, stripe together or otherwise combine partitions into larger virtual ones that can be resized or moved, possibly while it is being used.
Volume management is one of many forms of storage virtualization, the one implemented as a layer in the OS disk driver stack.
Most volume manager implementations share the same basic design and start with physical volumes (PVs), which can be hard disk partitions, RAID devices or a SAN. PVs are split into chunks called physical extents (PEs). Some volume managers (such as that in HP-UX and Linux) will have PEs of a uniform size; others (such as that in Veritas) will have variably-sized PEs that can be split and merged at will.
Normally, PEs are simply mapped one-to-one to logical extents (LEs). With mirroring, multiple PEs are mapped onto each LE. These PEs are drawn from a physical volume group (PVG), a set of same-sized PVs which act similarly to hard disks in a RAID1 array. PVGs are usually laid out so that they reside on different disks and/or data buses for maximum redundancy.
The LEs are pooled into a volume group (VG). The pooled LEs can then be concatenated together into virtual disk partitions called logical volumes or LVs. LVs are usable as raw block devices just as disk partitions are: mountable file systems can be created on them, or they can be used as swap storage.
On striped LVs, each successive LE is allocated from a different PV; depending on the size of the LE, this can improve performance on large sequential reads by bringing to bear the combined read throughput of multiple PVs.
The LVs can be grown or shrunk by concatenating more LEs from or returning them to the pool. The concatenated LEs do not have to be contiguous. This allows LVs to be grown without having to move already-allocated LEs. Some volume managers allow LVs to be resized in either direction while online. Changing the size of the LV does not necessarily change the size of a filesystem on it; it merely changes the size of its containing space. A file system that can be resized online is recommended because it allows the system to adjust its storage on-the-fly without interrupting applications.
PVs and LVs cannot be shared between or span different VGs (although some volume managers may allow them to be moved at will between VGs on the same host). This allows VGs to be conveniently brought online, taken offline or moved between host systems as a single administrative unit.
VGs can grow their storage pool by absorbing new PVs or shrink by retracting from PVs. This may involve moving already-allocated LEs out of the PV. Most volume managers can perform this movement online; this allows storage (if the underlying hardware is hot-pluggable) to be upgraded or replaced without system downtime.
Some volume managers also implement snapshots by applying copy-on-write to each LE. In this scheme, the volume manager will copy the LE to a copy-on-write table just before it is written to. This preserves an old version of the LV (the snapshot) which can later be reconstructed by overlaying the copy-on-write table atop the current LV. Snapshots which are read-write are branching snapshots because they implicitly allow diverging versions of an LV.
Snapshots can be useful for backing up self-consistent versions of volatile data like table files from a busy database, or for rolling back large changes in one swoop, such as an operating system upgrade. Some Linux-based Live CD systems also use snapshots to simulate read-write access on a read-only compact disc.
The LVM can:
The need to unify all the resources of the information infrastructure into a unified environment that is transparent to the user interface. The prototype can take the cluster system - computers, the combined network and transparently offer their computing power as if the work is carried out with a single machine. Relationship can serve as a local (LAN) and a global network. A cluster, in general, provides the CPU time of all nodes to perform mathematical calculations. The challenge is to identify and improve or establish a mechanism for storing files in a certain similarity of the cluster.
The basis for constructing such a scheme the concept of storing information dissemination in P2P-networks.
Peer-to-peer (P2P or decentralized) network is the computer network based on the equality of participants. In such networks, there are no dedicated servers, and each node (peer) is a client and server. Unlike client-server architecture, this organization helps to keep the network operating under any number and any combination of available nodes.
A peer-to-peer (or P2P) computer network uses diverse connectivity between participants in a network and the cumulative bandwidth of network participants rather than conventional centralized resources where a relatively low number of servers provide the core value to a service or application.
An important goal in P2P networks is that all clients provide resources, including bandwidth, storage space, and computing power. Thus, as nodes arrive and demand on the system increases, the total capacity of the system also increases. This is not true of a client-server architecture with a fixed set of servers, in which adding more clients could mean slower data transfer for all users.
The distributed nature of P2P networks also increases robustness in case of failures by replicating data over multiple peers, and—in pure P2P systems—by enabling peers to find the data without relying on a centralized index server. In the latter case, there is no single point of failure in the system.[3]
This nature helps with sharing extremely large files, as everyone is distributing the download (as in uploading to help download), so all "pieces" are sent at the same time to other users. It is a vast and intricate, but totally ingenious idea. However, depending on the type of network, these are also vulnerable. As with most network systems, insecure and unsigned codes may allow remote access to files on a victim's computer or even compromise the entire network. In the past this has happened for example to the FastTrack network when anti P2P companies managed to introduce faked chunks into downloads and downloaded files (mostly mp3 files) were unusable afterwards or even contained malicious code. The result of this and similar tactics is that the P2P networks of today have seen an enormous increase of their security and file verification mechanisms. Modern hashing, chunk verification and different encryption methods have made most networks resistant to almost any type of attack, even when major parts of the respective network have been replaced by faked or nonfunctional hosts.
An important goal in P2P networks is that all clients provide resources, including bandwidth, storage space, and computing power. Thus, as nodes arrive and demand on the system increases, the total capacity of the system also increases. This is not true of a client-server architecture with a fixed set of servers, in which adding more clients could mean slower data transfer for all users.
The distributed nature of P2P networks also increases robustness in case of failures by replicating data over multiple peers, and—in pure P2P systems—by enabling peers to find the data without relying on a centralized index server. In the latter case, there is no single point of failure in the system.[3]
This nature helps with sharing extremely large files, as everyone is distributing the download (as in uploading to help download), so all "pieces" are sent at the same time to other users. It is a vast and intricate, but totally ingenious idea. However, depending on the type of network, these are also vulnerable. As with most network systems, unsecure and unsigned codes may allow remote access to files on a victim's computer or even compromise the entire network. In the past this has happened for example to the FastTrack network when anti P2P companies managed to introduce faked chunks into downloads and downloaded files (mostly mp3 files) were unusable afterwards or even contained malicious code. The result of this and similar tactics is that the P2P networks of today have seen an enormous increase of their security and file verification mechanisms. Modern hashing, chunk verification and different encryption methods have made most networks resistant to almost any type of attack, even when major parts of the respective network have been replaced by faked or nonfunctional hosts.
The second section contains algorithm of distibuted file storage which can be described by the next series of steps:
The third section describes possible ways of cryptographic protection.
In cryptography, the one-time pad (OTP) is an encryption algorithm in which the plaintext is combined with a secret random key or pad as long as the plaintext and used only once. A modular addition is typically used to combine plaintext elements with pad elements. (For binary data, the operation XOR amounts to the same thing.) It was invented in 1917 and patented a couple of years later. If the key is truly random, never reused in whole or part, and kept secret, the one-time pad provides perfect secrecy. It has also been proven that any cipher with the perfect secrecy property must use keys with effectively the same requirements as OTP keys.[citation needed] The key normally consists of a random stream of numbers, each of which indicates the number of places in the alphabet (or number stream, if the plaintext message is in numerical form) which the corresponding letter or number in the plaintext message should be shifted. For messages in the Latin alphabet, for example, the key will consist of a random string of numbers between 0 and 25; for binary messages the key will consist of a random string of 0s and 1s; and so on.
The "pad" part of the name comes from early implementations where the key material was distributed as a pad of paper, so the top sheet could be easily torn off and destroyed after use. For easy concealment, the pad was sometimes reduced to such a small size that a powerful magnifying glass was required to use it. Photos accessible on the Internet show captured KGB pads that fit in the palm of one's hand,[13] or in a walnut shell.[14] To increase security, one-time-pads were sometimes printed onto sheets of highly flammable nitrocellulose.
The one-time pad is derived from the Vernam cipher, named after Gilbert Vernam, one of its inventors. Vernam's system was a cipher that combined a message with a key read from a paper tape loop. In its original form,[citation needed] Vernam's system was not unbreakable because the key could be reused. One-time use came a little later when Joseph Mauborgne recognized that if the key tape was totally random, cryptanalytic difficulty would be increased.
There is some ambiguity to the term due to the fact that some authors use the term "Vernam cipher" synonymously for the "one-time-pad", while others refer to any additive stream cipher as a "Vernam cipher", including those based on a cryptographically secure pseudorandom number generator (CSPRNG).[15]
In the fourth section fail-safe features are being developed.
Method which is used in ECC memory can fix single errors and discover double errors on-the-fly. So it could be useful to create a way to protect information block against being losted due to node(s) failures.
The fifth section contains information about effective practical usage of technology which is being developed.
Important notice! This work is still in progress. ETA is December 2009. Feel free to contact the author or scientific adviser for more information.