Русский

Укр. Мова

English

Магистр ДонНТУ Бухтияров Станислав Алексеевич

Bukhtiiarov Stanislav Alekseevich

Faculty: Computers science and informatics
Chair: Electronic computers
Speciality: Computers systems and networks
Group: КС-07м

The topic of masters work:

"Research of multiprocessor systems founded on the reconfigurable components"

Supervisor: prof. Barkalov A.A.

[Abstract] [Library] [References] [Report on the search] [Individual work]

The topic of masters work:

Masters work Bukhtiiaov SA Research of multiprocessor systems founded on the reconfigurable components 128х125 8 slides 84kb

Theme:

Research of multiprocessor systems founded on the reconfigurable components.

Supervisor:

prof. Barkalov A.A.

Consultant:

Ph.D Malcheva R.V.

Features of master's degree work

Introductory part:

During master's degree work I will conduct research of algorithms of the large informative arrays intended for treatment in the compressed intervals of time, and similarly analysis of such architectures as: bus, conveyer, vectorial, vectorial-conveyer, three-dimensional network (matrix),, hyperdeep blue, hierarchical, cluster, stream, and row of other architectures.

I will conduct the synthesis of the system which would automatically in the dynamics of calculations reconstruct the architecture under the structure of the decided task or its fragment. I will conduct comparison of power of algorithms of intended for this purpose. Sphere in which it was desirable was to apply similar skills are local networks (it is desirable cellular networks). Thus I want to synthesize a filter which is able code \ decode digital signal on the multiprocessor system on the basis of reconfigurable to the elementary base.

During all history of development of the computing engineering was given it a shot to find some general classification, which all possible directions of development of computer architectures would fall under. None of such classifications could overcome all variety of the developed architectural decisions and did not maintain the test time. Nevertheless, in a scientific turn got and the row of terms, which it is useful to know to not only the developers but also users of computers, is widely used.

Any computer system (be that SUPERCOMPUTER or personal computer) achieves the greatest productivity due to the use of hi-score elements and concurrent execution of large number of operations. Exactly possibility of parallel work of different devices of the system (works with ceiling) is basis of acceleration of basic operations.

Parallel COMPUTERS often subdivide on the Flynna classification on the machines of the type SIMD (Single Instruction Multiple Data - with one instruction stream at a plural data flow) and MIMD (Multiple Instruction Multiple Data - with a plural instruction stream at a plural data flow). As well as any other, resulted higher classification is unperfect: there are machines straight in her not getting, there are also important signs which in this classification are not taken into account. In particular, to the machines of the type SIMD vector processors are often taken, although their high performance depends on other form of parallel - conveyer organization of machine. Multi CPU vectorial systems, type Cray of Y-MP, consist of a few vector processors and can be adopted SIMD (Multiple SIMD).

The Flynna Classification does not do distinction by other recommendations important for calculable models, for example, on the level of "graininess" of parallel calculations and methods of synchronization.

Basic part:

The sphere of treatment of signals requires, at first, considerable calculable power (often with double precision single or in a format with a floating comma), and also possibility of rapid exchange by the results of similar calculations. Appendixes with similar requirements of mogug to exemplify echolocation of submarine boat, seismographic researches, satellite image processing, difficult geophysical and scientific systems and etc. All these appendixes require not only high calculable power but also implementation of strict requirements of the real time.

Usually similar high-perfomance systemy is had by a few other, than the work stations or other architecture. A necessity in treatment of large multidimensional data arrays, and also dependence of sizes of algorithms of treatment of signals from the degree of parallel result in that for the decision the set problems the devices are used with united in a network and the exchanged information on the most different of communications channels by processors. Architectures of type are most known hypercube, systole and multichannel, with the use of DSP-processors. In these DSP-processors elementary calculable power combines with the high-performance channels of input / conclusion. For example, the processor of the SHARC company Analog Devices has six coherent ports with a carrying capacity 40 mb/s each.

It is possible to select four basic types of architecture of the systems of the simultaneous processing:

1) Conveyer and vectorial treatment.

Basis of conveyer treatment is made by separate implementation of some operation in a few stages with data communication of one stage following. Productivity here increases due to that simultaneously on different stages of conveyer a few operations are executed. Conveyer is effective only then, when the load of conveyer is near to complete, and speed of serve of new operands corresponds to the burst performance of conveyer. If there is the delay, parallel less operations will be executed and total productivity will go down. Vectorial operations provide ideal possibility of complete load of calculable conveyer.

At vectorial instruction execution the same operation is used to all elements of vector (or more frequent than all to the proper elements of pair of vectors). For tuning of conveyer on implementation of concrete operation some adjusting time can be required, however after operands can enter a conveyer at full pelt, memory assumed by possibilities. Thus there are no pauses neither in connection with the selection of a new command nor in connection with determination of branch of calculations in conditional transition. Thus, main principle of calculations on a vectorial machine consists of implementation of some elementary operation or combination from a few elementary operations which must be repeatedly used to some data block. To such operations small compact cycles correspond in the initial program.

2) Machines of the type SIMD.

The machines of the type SIMD consist of large number of identical processor elements having own memory. All processor elements in such machine execute the same program. Obviously, that such machine made from the large number of processors can provide ever-higher productivity only on those tasks, at the decision of which all processors can do the same work. The model of calculations for the SIMD machine very looks like the model of calculations for a vector processor: single operation is executed above a large data block.

Unlike the limited conveyer functioning of vector processor, an array processor (synonym for most SIMD-machines) can be considerable more flexible. The processing elements of such processors are universal programmable COMPUTERS, so that the task decided parallel can be difficult enough. The ordinary display of this calculable model in the initial program is approximately the same, as well as in the case of vectorial operations: cycles on array cells, in which the values produced on one inc of cycle, is not used on other inc of cycle.

The models of calculations on vectorial and matrix COMPUTERS are so similar, that these COMPUTERS often come into question as equivalent.

3) Machines of the type MIMD.

A term covers a "multiprocessor" most machines of the type MIMD and (like to that how the term "array processor" is used to the machines of the type SIMD) is often used as a synonym for the machines of the type MIMD. In the multiprocessor system every processor element (PE) executes the program regardless of enough other processor elements. Processor elements, certainly, must somehow contact with each other, that necessitates more detailed classification of machines of the type MIMD. In multiprocessors with general memory is present memory of information and commands, accessible all PE. With the general memory PE associate by a general tire or network of exchange. In opposition this variant in the multi cpu systems (machines with local memory) all memory is divided between processor elements and every block of memory is accessible to the processor only related to him. The network of exchange links processor elements with each other.

On a MIMD-multiprocessor the aggregate of independent processes episodically circulating to the divided information is the base model of calculations. There is plenty of variants of this model. On one end of spectrum is model of the distributed calculations, in which the program divides by the large enough number of parallel tasks consisting of great number of sub programm. On other end of spectrum is model of stream calculations, in which every operation in the program can be examined as separate process. Such operation waits the entrance information (operands) which must be passed to her by other processes. After their receipt operation is executed, and the got value is passed to those processes which him need. In the stream models of calculations with the large and middle level of processes contain the large number of operations and is executed in a stream manner.

4) Multi CPU machines with SIMD-processors.

Many modern SUPERCOMPUTERS are the multi CPU systems, in which as vector processors or processors of the type SIMD are used. Such machines behave to the machines of the SIMD class.

Programming languages and proper compilers for the machines of the type SIMD usually provide linguistic constructions which allow to the programmer to describe "coarse-grained" parallel task. Within the limits of every task compiler automatically suitable cycles. The machines of the type SIMD, as can itself represent, enable to use the best on these two principles of vectorial operations or those parts of the program, which befit for this purpose, and flexible possibilities of MIMD-architecture, for other parts of the program.

The multi CPU systems for years development of the computing engineering suffered the row of stages of the development. Historically first began to accustom the SIMD technology. However presently set steady interest to the MIMD architectures was. This interest is mainly determined by two factors:

The MIMD Architecture gives large flexibility: at presence of adequate support from the side of vehicle facilities and software MIMD can work as one user system, providing the high-performance data processing for one applied task, as multiprogram machine executing the great number of tasks parallel, and as some combination of these possibilities.

The MIMD Architecture can take all advantages of modern microprocessor technology on the basis of strict account of correlation cost/productivity. In actual fact practically all modern multi CPU systems are built on those microprocessors, which it is possible to find in the personal computers, work stations and small of multi CPU servers.

One of distinctive features of the multi CPU computer system is the network of exchange, by which processors unite with each other or with memory. The model of exchange is so important for the multi CPU system, that many descriptions of productivity and other estimations are expressed by attitude of time of treatment toward time of exchange, proper decided tasks. There are two basic models of interprocessor exchange: one is based on the message passing, other - on the use of general memory. In the multi CPU system with general memory one processor carries out a record in a concrete cell, and other processor produces reading from this memory cell. To provide co-ordination of information and synchronization of processes, an exchange often will be realized on principle of mutually eliminating access to general memory by the method of "mailbox".

In architectures with local memory the direct division of memory is impossible. Instead processors get access to the shared information by means of the message passing on the network of exchange. Efficiency of chart of communications depends on protocols of exchange, basic networks of exchange and carrying capacity of memory and channels of exchange.

Often, and besides groundlessly, in machines with general memory and vectorial machines expenses on an exchange are not taken into account, because the problems of exchange are largely hidden from a programmer. However present burden costs on an exchange in these machines are and is determined by the conflicts of tires, memory and processors. Than anymore processors are added to the system, the anymore than processes vie at the use of the same information and tire, that results in the state of satiation. The model of the system with general memory is very comfortable for programing and is sometimes examined as high leveling mean of estimation of influencing of exchange to work of the system, even if the basic system in actual fact is realized with the use of local memory and principle of the message passing.

In networks with the channel switching and in networks with commutation of packages as far as growth of requirements to the exchange it is necessary to take into account possibility of overload of network. Here an interprocessor exchange links network resources: channels, processors, buffers of reports. The volume of the passed information can be brief due to careful functional of task and careful executable functions.

Thus, existent MIMD-machines disintegrate on two basic classes depending on an amount the united processors, which determines the method of organization of memory, and method of them.

Machines with general (divided) conventional memory behave to the first group, uniting to a few ten (ordinary less than 32) processors. Comparatively the two-bit of processors in such machines allows to have one centralized general memory and unite processors and memory by one tire. At presence of at the processors of cash memory of sufficient volume a high-performance tire and general memory can satisfy the addresses to memory, acting from a few processors. As there is unique memory with the same access time, these machines are sometimes named UMA (Uniform Memory Access). Such method of organization with the comparatively small divided memory presently is most popular.

The second group of machines is made by the large-scale systems with the distributed memory. In order to support plenty of processors it is necessary to distribute conventional memory between them, otherwise the bar of admission of memory simply can not be enough for satisfaction of queries acting from the very large number of processors. Naturally at such approach it is also required to realize communication of processors between itself.

With growth of number of processors it is simply impossible to go round the necessity of realization of model of the distributed memory with a hi-score network for communication of processors. With rapid by growth of productivity of processors and related to this toughening of requirement of increase of bar of admission of memory, the scale of the systems (I.e. number of processors in the system), which organization of the distributed memory is required for, diminishes, also, as well as the number of processors which it is succeeded to support on one divided tire and general memory diminishes.

Storage allocation between the separate knots of the system takes two above all advantages. At first, it is the method of increase of bar of admission of memory effective from point of cost, as most appeals can be executed parallel to local memory in every knot. Secondly, it diminishes the delay of address (access time) to local memory. These two advantages yet more abbreviate the amount of processors, for which architecture with the distributed memory makes sense. Ordinary devices of input/conclusion, as well as memory, is distributed on knots and in actual fact knots can consist of small number (2-8) of the processors united between itself by other method. Although such klasters of a few processors with memory and network interface can profit enough from point of efficiency in the value term, it is not very much substantial for understanding of that how such machine works, therefore we while will stop on the systems with one processor on a knot. Basic difference in architecture, which it is necessary to select in machines with the distributed memory consists in that how communication is carried out and the logical model of memory is which.

Conclude:

The results of theoretical researches and practical use are shown, that multi-CPU computer systems with programmable architecture and structurally-procedural organizations of calculations:

- such productivity which is near to spades productivity on any class of the decided tasks is provided;

- it is enabled to program architecture, including the direct channels of communications, sets macro operations, internal high-level language and structure of the distributed memory;

- provide practically linear growth of productivity proportionally to the number of parallel functioning super;

- the scaled of the system is provided due to original vehicle-programmatic facilities module principle of organization.

All of it opens wide prospects for creation of domestic super computers answering, from one side, to the necessities of providing of informative safety of our country, and from other side, competitiveness in the world market.

By the key moment of realization in the multi CPU systems with the small number of processors of both chart of record with cancellation and chart of record with data recovery, there is the use for implementation of these operations of mechanism of tire. For implementation of operation of update or cancellation a processor simply takes a tire and translates an address on her, which the update or cancellation of information must be produced on. All processors are continuously looked after a tire, controlling addresses appearing on her. Processors check whether there is in their cash-memory no address appearing on a tire. If this so, the proper information in a cache either is rescinded or is brushed up depending on the used protocol. The successive order of appeals, inherent to the tire, provides also strictly successive implementation of write operations, as, when two processors compete for implementation of record in the same cell, one of them must get access to the tire before other. One processor, getting access to the tire, will cause the necessity of update or cancellation of copies in other processors. In any case, all records will be executed strictly consistently. One of conclusions, which follows to do from the analysis of this chart consists that a record in the divided data element can not end until she will not take access to the tire.

In addition to cancellation or update of the proper copies of block of cash-memory, a record was produced in which, we must also place a data element, if there is the miss of cash-memory at a record. In cash-memory with a through record, finding the last item value is easily, as all written down information is always sent also and in memory, which the last written down item value can be chosen from (the presence of buffers of record can result in some complication).

However for cash-memory with the reverse copying the task of finding of the last item value is more difficult, as it is the value, probably, is in a cache, instead of in memory. The same chart of supervision is used in this case, that at a record: every processor looks after and controls the addresses placed on a tire. If a processor discovers that he has the modified ("dirty") copy of block of cash-memory, exactly he must provide sending of this block in reply to the query of reading and cause abolition of address to conventional memory. As caches with the reverse copying produce the less requirements to the bar of admission of memory, they far preferably in multiprocessors, in spite of some increase of complication. Therefore further we will consider the questions of realization of cash-memory with the reverse copying.

For realization of process of supervision can be used ordinary tag cache. More over, bit of authenticity (valid bit) mentioned before, allows easily to realize cancellation. Misses of operations of reading, caused by either cancellation or some other event, also not difficult for understanding, as they are simply based on possibility of supervision. For write operations we would like also to know, whether there are other cash copies of block, as in the case of absence of such copies, a record can not be sent on a tire, that abbreviates time on implementation of record, and also required bar of admission.

To watch, whether there is a block divided, we can enter the additional bit of the state (shared), related to every block, exactly also as it was done for the bits of authenticity (valid) and modification of block. Adding the bit of the state, determining whether there is a block divided, we can decide a question about that, whether a record must generate operation of cancellation in protocol with cancellation, or operation of translation at the use of protocol with the update. If there is a record in a block being able "divided" at the use of protocol of record with cancellation, a cache forms on a tire operation of cancellation and marks a block as private (private). No subsequent operations of cancellation of this block anymore will not send this processor. A processor with the exceptional (exclusive) copy of block of cash-memory is usually named the "proprietor" (owner) of block of cash-memory.

At the use of protocol of record with the update, if a block is able "divided", every record in this block must be translated. In the case of protocol with cancellation, when operation of cancellation is sent, the state of block changes with "divided" into "undivided" (or "private"). Later, if other processor will inquire this block, the state again must change on "divided". As our looking cache after sees all misses also, he knows, when this block of cache is inquired by other processor, and his state must become "divided".

As any transaction on a tire controls addresses tag cache, potentially it can cause conflicts with the addresses to the cache from the side of processor. Number of such potential conflicts it is possible to reduce by application of one of two methods: by duplication of tag, or use of multilevel caches with the "scope" (inclusion), in which levels, being nearer to the processor are the subset of levels being farther from him. If tag is duplicated, the appeals of processor and supervision after a tire can be executed parallel. Certainly, if there is a miss at the appeal of processor, he must execute arbitration with the mechanism of supervision for the update of both sets of tag. Exact also, if the mechanism of supervision after a tire finds consilient tag, he will need to conduct arbitration and address both sets of tag cache (for implementation cancellations or updates beaten "divided"), it is possible also and to the data array in a cache, for finding of copy of block. Thus, at the use of chart of duplication a tag processor must be halted only in case that he executes the address to the cache in the same moment of time, when the mechanism of supervision discovered a copy in a cache. Moreover, activity of mechanism of supervision stays too long only when a cache deals with a miss.

If a processor uses a multilevel cache with properties of scope, then every line in a basic cache is present in the second cache. Thus, activity on the supervision can be related to the cache of the second level, while majority of active processor can be related to the primary cache. If the mechanism of supervision gets the hit in the second cache, then he must execute arbitration for a primary cache, to renew the state and it is possible to find information, that usually will result in the halt of processor. Such decision was accepted in many modern systems, as a multilevel cache allows substantially to reduce requirements to the bar of admission. Sometimes it can be even useful to duplicate tag in the second cache, yet more to shorten the amount of conflicts between activ processor and mechanism of supervision.

There are a lot of variations of charts of coherent of cache in the real systems, depending on that whether a chart is used on the basis of cancellation or update, whether cash-memory is built on principles of through or reverse record, when the update is, and also whether the state of "domain" takes place and as it will be realized.

LIST OF LITERATURE:

1. Глушков В.М. Синтез цифровых автоматов. - М.: Физматгиз,1962. - 476 с. 2. Глушков В.М., Капитонова Ю.В., Мищенко А.Т. Логическое проектирование дискретных устройств. - Киев: Наук, думка, 1987. - 264 с.3. Варшавский В.И. Однородные структуры: Анализ. Синтез.Поведение.М.Энергия,1973.-152 с4. Евреинов Э.В. Однородные вычислительные системы,структуры и среды. М Радио и связь, 1981. - 208 с. 5. Евреинов Э.В., Прангишвили И.В. Цифровые автоматы с настраиваемой структурой (однородные среды). - М.: Энергия, 1974. - 240 с.