Introductory part:
During master's degree
work I will conduct research of algorithms of the large informative arrays
intended for treatment in the compressed intervals of time, and similarly analysis
of such architectures as: bus, conveyer, vectorial, vectorial-conveyer,
three-dimensional network (matrix),, hyperdeep blue,
hierarchical, cluster, stream, and
row of other architectures.
I
will conduct the synthesis of the system which would automatically in the
dynamics of calculations reconstruct the architecture under the
structure of the decided task or its fragment. I will conduct comparison of
power of algorithms of intended for this purpose. Sphere in which it was desirable
was to apply similar skills are local networks (it is desirable cellular
networks). Thus I want to synthesize a filter which is able code \ decode digital
signal on the multiprocessor system on the basis of reconfigurable to the elementary base.
During all history of development of the computing
engineering was given it a shot to find some general classification, which all
possible directions of development of computer architectures would fall under.
None of such classifications could overcome all variety of the developed
architectural decisions and did not maintain the test time. Nevertheless, in a
scientific turn got and the row of terms, which it is useful to know to not
only the developers but also users of computers, is widely used.
Any computer system (be that SUPERCOMPUTER
or personal computer) achieves the greatest productivity due to the use of hi-score
elements and concurrent execution of large number of operations. Exactly
possibility of parallel work of different devices of the system (works with
ceiling) is basis of acceleration of basic operations.
Parallel COMPUTERS often subdivide
on the Flynna classification on the machines of the type SIMD (Single
Instruction Multiple Data - with one instruction stream at a plural data flow)
and MIMD (Multiple Instruction Multiple Data - with a plural instruction stream
at a plural data flow). As well as any other, resulted higher classification is
unperfect: there are machines straight in her not getting, there are also
important signs which in this classification are not taken into account. In
particular, to the machines of the type SIMD vector processors are often taken,
although their high performance depends on other form of parallel - conveyer
organization of machine. Multi CPU vectorial systems, type Cray of Y-MP,
consist of a few vector processors and can be adopted SIMD (Multiple SIMD).
The Flynna Classification does not
do distinction by other recommendations important for calculable models, for
example, on the level of "graininess" of parallel calculations and methods
of synchronization.
Basic part:
The sphere of treatment
of signals requires, at first, considerable calculable power (often with double
precision single or in a format with a floating comma), and also possibility of
rapid exchange by the results of similar calculations. Appendixes with similar
requirements of mogug to exemplify echolocation of submarine boat,
seismographic researches, satellite image processing, difficult geophysical and
scientific systems and etc. All these appendixes require not only high
calculable power but also implementation of strict requirements of the real
time.
Usually similar
high-perfomance systemy is had by a few other, than the work stations or
other architecture. A necessity in treatment of large multidimensional
data arrays, and also dependence of sizes of algorithms of treatment of signals
from the degree of parallel result in that for the decision the set problems
the devices are used with united in a network and the exchanged information on
the most different of communications channels by processors. Architectures of
type are most known hypercube, systole and multichannel, with the use of
DSP-processors. In these DSP-processors elementary calculable power combines
with the high-performance channels of input / conclusion. For example, the processor
of the SHARC company Analog Devices has six coherent ports with a carrying
capacity 40 mb/s each.
It is possible to select four basic types of architecture of the systems
of the simultaneous processing:
1) Conveyer and vectorial treatment.
Basis of conveyer treatment is made by separate
implementation of some operation in a few stages with data
communication of one stage following. Productivity here increases due to that
simultaneously on different stages of conveyer a few operations are executed. Conveyer is effective only then, when the load of conveyer is near to complete, and speed of serve of new operands corresponds to the burst performance of
conveyer. If there is the delay, parallel less operations will be executed and
total productivity will go down. Vectorial operations provide ideal possibility
of complete load of calculable conveyer.
At vectorial instruction execution
the same operation is used to all elements of vector (or more frequent than all
to the proper elements of pair of vectors). For tuning of conveyer on
implementation of concrete operation some adjusting time can be required,
however after operands can enter a conveyer at full pelt, memory assumed by
possibilities. Thus there are no pauses neither in connection with the
selection of a new command nor in connection with determination of branch of
calculations in conditional transition. Thus, main principle of calculations on
a vectorial machine consists of implementation of some elementary operation or combination
from a few elementary operations which must be repeatedly used to some data
block. To such operations small compact cycles correspond in the initial
program.
2) Machines of the type SIMD.
The machines of the type SIMD consist of large number
of identical processor elements having own memory. All processor elements in
such machine execute the same program. Obviously, that such machine made from
the large number of processors can provide ever-higher productivity only on
those tasks, at the decision of which all processors can do the same work. The
model of calculations for the SIMD machine very looks like the model of
calculations for a vector processor: single operation is executed above a large
data block.
Unlike the limited conveyer
functioning of vector processor, an array processor (synonym for most SIMD-machines)
can be considerable more flexible. The processing elements of such processors
are universal programmable COMPUTERS, so that the task decided parallel can be difficult
enough. The ordinary display of this calculable model in the initial program is
approximately the same, as well as in the case of vectorial operations: cycles
on array cells, in which the values produced on one inc of cycle, is not used
on other inc of cycle.
The models of calculations on
vectorial and matrix COMPUTERS are so similar, that these COMPUTERS often come
into question as equivalent.
3) Machines of the type MIMD.
A term covers a "multiprocessor" most machines
of the type MIMD and (like to that how the term "array processor" is
used to the machines of the type SIMD) is often used as a synonym for the machines
of the type MIMD. In the multiprocessor system every processor element (PE) executes
the program regardless of enough other processor elements. Processor elements,
certainly, must somehow contact with each other, that necessitates more detailed
classification of machines of the type MIMD. In multiprocessors with general
memory is present memory of information and commands, accessible all PE. With
the general memory PE associate by a general tire or network of exchange. In
opposition this variant in the multi cpu systems (machines with local memory)
all memory is divided between processor elements and every block of memory is
accessible to the processor only related to him. The network of exchange links
processor elements with each other.
On a MIMD-multiprocessor the
aggregate of independent processes episodically circulating to the divided information
is the base model of calculations. There is plenty of variants of this model.
On one end of spectrum is model of the distributed calculations, in which the
program divides by the large enough number of parallel tasks consisting of
great number of sub programm. On other end of spectrum is model of stream
calculations, in which every operation in the program can be examined as
separate process. Such operation waits the entrance information (operands)
which must be passed to her by other processes. After their receipt operation
is executed, and the got value is passed to those processes which him need. In
the stream models of calculations with the large and middle level of processes contain the large number of operations and is executed in a stream
manner.
4) Multi CPU machines with SIMD-processors.
Many modern SUPERCOMPUTERS are the multi CPU systems,
in which as vector processors or processors of the type SIMD are
used. Such machines behave to the machines of the SIMD class.
Programming languages and proper
compilers for the machines of the type SIMD usually provide linguistic
constructions which allow to the programmer to describe
"coarse-grained" parallel task. Within the limits of every task
compiler automatically suitable cycles. The machines of the type SIMD, as can
itself represent, enable to use the best on these two principles of vectorial operations or those parts of the program, which befit for this purpose, and
flexible possibilities of MIMD-architecture, for other parts of the program.
The multi CPU systems for years
development of the computing engineering suffered the row of stages of the
development. Historically first began to accustom the SIMD technology. However
presently set steady interest to the MIMD architectures was. This interest is mainly
determined by two factors:
The MIMD Architecture gives large flexibility: at presence of adequate
support from the side of vehicle facilities and software MIMD can work as one
user system, providing the high-performance data processing for one applied
task, as multiprogram machine executing the great number of tasks parallel, and
as some combination of these possibilities.
The MIMD Architecture can take all advantages of
modern microprocessor technology on the basis of strict account of correlation
cost/productivity. In actual fact practically all modern multi CPU systems are
built on those microprocessors, which it is possible to find in the personal
computers, work stations and small of multi CPU servers.
One of distinctive features of the multi CPU computer
system is the network of exchange, by which processors unite with each other or
with memory. The model of exchange is so important for the multi CPU system,
that many descriptions of productivity and other estimations are expressed by
attitude of time of treatment toward time of exchange, proper decided tasks.
There are two basic models of interprocessor exchange: one is based on the
message passing, other - on the use of general memory. In the multi CPU system
with general memory one processor carries out a record in a concrete cell, and
other processor produces reading from this memory cell. To provide
co-ordination of information and synchronization of processes, an exchange
often will be realized on principle of mutually eliminating access to general
memory by the method of "mailbox".
In architectures with local memory
the direct division of memory is impossible. Instead processors get access to
the shared information by means of the message passing on the network of
exchange. Efficiency of chart of communications depends on protocols of
exchange, basic networks of exchange and carrying capacity of memory and
channels of exchange.
Often, and besides groundlessly, in machines
with general memory and vectorial machines expenses on an exchange are not
taken into account, because the problems of exchange are largely hidden from a
programmer. However present burden costs on an exchange in these machines are
and is determined by the conflicts of tires, memory and processors. Than
anymore processors are added to the system, the anymore than processes vie at
the use of the same information and tire, that results in the state of
satiation. The model of the system with general memory is very comfortable for
programing and is sometimes examined as high leveling mean of estimation of
influencing of exchange to work of the system, even if the basic system in
actual fact is realized with the use of local memory and principle of the
message passing.
In networks with the channel switching and in networks
with commutation of packages as far as growth of requirements to the exchange
it is necessary to take into account possibility of overload of network. Here
an interprocessor exchange links network resources: channels, processors,
buffers of reports. The volume of the passed information can be brief due to
careful functional of task and careful executable functions.
Thus, existent MIMD-machines
disintegrate on two basic classes depending on an amount the united processors,
which determines the method of organization of memory, and method of them.
Machines with general (divided)
conventional memory behave to the first group, uniting to a few ten (ordinary
less than 32) processors. Comparatively the two-bit of processors in such machines
allows to have one centralized general memory and unite processors and memory
by one tire. At presence of at the processors of cash memory of sufficient
volume a high-performance tire and general memory can satisfy the addresses to
memory, acting from a few processors. As there is unique memory with the same
access time, these machines are sometimes named UMA (Uniform Memory Access).
Such method of organization with the comparatively small divided memory
presently is most popular.
The second group of machines is made
by the large-scale systems with the distributed memory. In order to support
plenty of processors it is necessary to distribute conventional memory between
them, otherwise the bar of admission of memory simply can not be enough for
satisfaction of queries acting from the very large number of processors.
Naturally at such approach it is also required to realize communication of
processors between itself.
With growth of number of processors
it is simply impossible to go round the necessity of realization of model of
the distributed memory with a hi-score network for communication of processors.
With rapid by growth of productivity of processors and related to this
toughening of requirement of increase of bar of admission of memory, the scale
of the systems (I.e. number of processors in the system), which organization of
the distributed memory is required for, diminishes, also, as well as the number
of processors which it is succeeded to support on one divided tire and general
memory diminishes.
Storage allocation between the
separate knots of the system takes two above all advantages. At first, it is
the method of increase of bar of admission of memory effective from point of
cost, as most appeals can be executed parallel to local memory in every knot.
Secondly, it diminishes the delay of address (access time) to local memory.
These two advantages yet more abbreviate the amount of processors, for which
architecture with the distributed memory makes sense. Ordinary devices of
input/conclusion, as well as memory, is distributed on knots and in actual fact
knots can consist of small number (2-8) of the processors united between itself
by other method. Although such klasters of a few processors with memory and network
interface can profit enough from point of efficiency in the value term, it is
not very much substantial for understanding of that how such machine works,
therefore we while will stop on the systems with one processor on a knot. Basic
difference in architecture, which it is necessary to select in machines with
the distributed memory consists in that how communication is carried out and
the logical model of memory is which.
Conclude:
The results of
theoretical researches and practical use are shown, that multi-CPU computer systems with programmable architecture and structurally-procedural organizations of calculations:
- such productivity
which is near to spades productivity on any class of the decided tasks is
provided;
-
it is enabled to program architecture, including the direct channels of
communications, sets macro operations, internal high-level language and
structure of the distributed memory;
- provide practically linear growth of productivity proportionally to
the number of parallel functioning super;
- the scaled of the system is provided due to original
vehicle-programmatic facilities module principle of organization.
All of it opens wide
prospects for creation of domestic super computers answering, from one side, to
the necessities of providing of informative safety of our country, and from
other side, competitiveness in the world market.
By the key moment of realization in the multi CPU systems with the small
number of processors of both chart of record with cancellation and chart of
record with data recovery, there is the use for implementation of these
operations of mechanism of tire. For implementation of operation of update or cancellation
a processor simply takes a tire and translates an address on her, which the update
or cancellation of information must be produced on. All processors are
continuously looked after a tire, controlling addresses appearing on her.
Processors check whether there is in their cash-memory no address appearing on
a tire. If this so, the proper information in a cache either is rescinded or is
brushed up depending on the used protocol. The successive order of appeals,
inherent to the tire, provides also strictly successive implementation of write
operations, as, when two processors compete for implementation of record in the
same cell, one of them must get access to the tire before other. One processor,
getting access to the tire, will cause the necessity of update or cancellation
of copies in other processors. In any case, all records will be executed strictly
consistently. One of conclusions, which follows to do from the analysis of this
chart consists that a record in the divided data element can not end until she
will not take access to the tire.
In addition to cancellation or update of the
proper copies of block of cash-memory, a record was produced in which, we must
also place a data element, if there is the miss of cash-memory at a record. In cash-memory
with a through record, finding the last item value is easily, as all written
down information is always sent also and in memory, which the last written down
item value can be chosen from (the presence of buffers of record can result in
some complication).
However for cash-memory with the reverse copying
the task of finding of the last item value is more difficult, as it is the value,
probably, is in a cache, instead of in memory. The same chart of supervision is
used in this case, that at a record: every processor looks after and controls
the addresses placed on a tire. If a processor discovers that he has the
modified ("dirty") copy of block of cash-memory, exactly he must provide
sending of this block in reply to the query of reading and cause abolition of
address to conventional memory. As caches with the reverse copying produce the less
requirements to the bar of admission of memory, they far preferably in
multiprocessors, in spite of some increase of complication. Therefore further
we will consider the questions of realization of cash-memory with the reverse
copying.
For realization of process of supervision can be used ordinary tag
cache. More over, bit of authenticity (valid bit) mentioned before, allows
easily to realize cancellation. Misses of operations of reading, caused by
either cancellation or some other event, also not difficult for understanding,
as they are simply based on possibility of supervision. For write operations we
would like also to know, whether there are other cash copies of block, as in
the case of absence of such copies, a record can not be sent on a tire, that abbreviates
time on implementation of record, and also required bar of admission.
To watch, whether there is a block divided, we
can enter the additional bit of the state (shared), related to every block,
exactly also as it was done for the bits of authenticity (valid) and
modification of block. Adding the bit of the state,
determining whether there is a block divided, we can decide a question about
that, whether a record must generate operation of cancellation in protocol with
cancellation, or operation of translation at the use of protocol with the update.
If there is a record in a block being able "divided" at the use of
protocol of record with cancellation, a cache forms on a tire operation of cancellation
and marks a block as private (private). No subsequent operations of cancellation
of this block anymore will not send this processor. A processor with the
exceptional (exclusive) copy of block of cash-memory is usually named the
"proprietor" (owner) of block of cash-memory.
At the use of protocol of record with the update,
if a block is able "divided", every record in this block must be
translated. In the case of protocol with cancellation, when operation of cancellation
is sent, the state of block changes with "divided" into
"undivided" (or "private"). Later, if other processor will
inquire this block, the state again must change on "divided". As our
looking cache after sees all misses also, he knows, when this block of cache is
inquired by other processor, and his state must become "divided".
As any transaction on a tire controls addresses tag
cache, potentially it can cause conflicts with the addresses to the cache from
the side of processor. Number of such potential conflicts it is possible to
reduce by application of one of two methods: by duplication of tag, or use of
multilevel caches with the "scope" (inclusion), in which levels,
being nearer to the processor are the subset of levels being farther from him.
If tag is duplicated, the appeals of processor and supervision after a tire can
be executed parallel. Certainly, if there is a miss at the appeal of processor,
he must execute arbitration with the mechanism of supervision for the update of
both sets of tag. Exact also, if the mechanism of supervision after a tire
finds consilient tag, he will need to conduct arbitration and address both sets
of tag cache (for implementation cancellations or updates beaten
"divided"), it is possible also and to the data array in a cache, for
finding of copy of block. Thus, at the use of chart of duplication a tag
processor must be halted only in case that he executes the address to the cache
in the same moment of time, when the mechanism of supervision discovered a copy
in a cache. Moreover, activity of mechanism of supervision stays too long only
when a cache deals with a miss.
If a processor uses a multilevel cache with properties
of scope, then every line in a basic cache is present in the second cache. Thus,
activity on the supervision can be related to the cache of the second level,
while majority of active processor can be related to the primary cache. If the
mechanism of supervision gets the hit in the second cache, then he must execute
arbitration for a primary cache, to renew the state and it is possible to find information,
that usually will result in the halt of processor. Such decision was accepted
in many modern systems, as a multilevel cache allows substantially to reduce
requirements to the bar of admission. Sometimes it can be even useful to duplicate
tag in the second cache, yet more to shorten the amount of conflicts between activ
processor and mechanism of supervision.
There are a lot of variations of charts of
coherent of cache in the real systems, depending on that whether a chart is
used on the basis of cancellation or update, whether cash-memory is built on
principles of through or reverse record, when the update is, and also whether
the state of "domain" takes place and as it will be realized.
LIST OF LITERATURE:
1. Глушков В.М. Синтез цифровых автоматов. - М.: Физматгиз,1962. - 476 с.
2. Глушков В.М., Капитонова Ю.В., Мищенко А.Т. Логическое
проектирование дискретных устройств. - Киев: Наук, думка, 1987. - 264 с.3. Варшавский В.И. Однородные структуры: Анализ. Синтез.Поведение.М.Энергия,1973.-152 с4. Евреинов Э.В. Однородные вычислительные системы,структуры и среды. М Радио и связь, 1981. - 208 с.
5. Евреинов Э.В., Прангишвили И.В. Цифровые автоматы с
настраиваемой структурой (однородные среды). - М.: Энергия, 1974. - 240 с.
|