Rev. C
|
Page 4 of 48
|
December 2006
ADSP-TS201S
The TigerSHARC DSP uses a Static Superscalar
TM architecture.
This architecture is superscalar in that the ADSP-TS201S pro-
cessor’s core can execute simultaneously from one to four 32-bit
instructions encoded in a very large instruction word (VLIW)
instruction line using the DSP’s dual compute blocks. Because
the DSP does not perform instruction re-ordering at runtime—
the programmer selects which operations will execute in parallel
prior to runtime—the order of instructions is static.
With few exceptions, an instruction line, whether it contains
one, two, three, or four 32-bit instructions, executes with a
throughput of one cycle in a 10-deep processor pipeline.
For optimal DSP program execution, programmers must follow
the DSP’s set of instruction parallelism rules when encoding an
instruction line. In general, the selection of instructions that the
DSP can execute in parallel each cycle depends on the instruc-
tion line resources each instruction requires and on the source
and destination registers used in the instructions. The program-
mer has direct control of three core components—the IALUs,
the compute blocks, and the program sequencer.
The ADSP-TS201S processor, in most cases, has a two-cycle
execution pipeline that is fully interlocked, so—whenever a
computation result is unavailable for another operation depen-
dent on it—the DSP automatically inserts one or more stall
cycles as needed. Efficient programming with dependency-free
instructions can eliminate most computational and memory
transfer data dependencies.
In addition, the ADSP-TS201S processor supports SIMD opera-
tions two ways—SIMD compute blocks and SIMD
computations. The programmer can load both compute blocks
with the same data (broadcast distribution) or different data
(merged distribution).
DUAL COMPUTE BLOCKS
The ADSP-TS201S processor has compute blocks that can exe-
cute computations either independently or together as a single-
instruction, multiple-data (SIMD) engine. The DSP can issue up
to two compute instructions per compute block each cycle,
instructing the ALU, multiplier, shifter, or CLU to perform
independent, simultaneous operations. Each compute block can
execute eight 8-bit, four 16-bit, two 32-bit, or one 64-bit SIMD
computations in parallel with the operation in the other block.
These computation units support IEEE 32-bit single-precision
floating-point, extended-precision 40-bit floating point, and 8-,
16-, 32-, and 64-bit fixed-point processing.
The compute blocks are referred to as X and Y in assembly syn-
tax, and each block contains four computational units—an
ALU, a multiplier, a 64-bit shifter, a 128-bit CLU—and a 32-
word register file.
Register File—each compute block has a multiported 32-
word, fully orthogonal register file used for transferring
data between the computation units and data buses and for
storing intermediate results. Instructions can access the
registers in the register file individually (word-aligned), in
sets of two (dual-aligned), or in sets of four (quad-aligned).
ALU—the ALU performs a standard set of arithmetic oper-
ations in both fixed- and floating-point formats. It also
performs logic operations.
Multiplier—the multiplier performs both fixed- and float-
ing-point multiplication and fixed-point multiply and
accumulate.
shifts, bit and bit stream manipulation, and field deposit
and extraction operations.
Communications Logic Unit (CLU)—this 128-bit unit pro-
vides trellis decoding (for example, Viterbi and Turbo
decoders) and executes complex correlations for CDMA
communication applications (for example, chip-rate and
symbol-rate functions).
Using these features, the compute blocks can:
Provide 8 MACS per cycle peak and 7.1 MACS per cycle
sustained 16-bit performance and provide 2 MACS per
cycle peak and 1.8 MACS per cycle sustained 32-bit perfor-
mance (based on FIR)
Execute six single-precision floating-point or execute 24
fixed-point (16-bit) operations per cycle, providing
3.6G FLOPS or 14.4G/s regular operations performance at
600 MHz
Perform two complex 16-bit MACS per cycle
Execute eight trellis butterflies in one cycle
DATA ALIGNMENT BUFFER (DAB)
The DAB is a quad-word FIFO that enables loading of quad-
word data from nonaligned addresses. Normally, load instruc-
tions must be aligned to their data size so that quad words are
loaded from a quad-aligned address. Using the DAB signifi-
cantly improves the efficiency of some applications, such as
FIR filters.
DUAL INTEGER ALU (IALU)
The ADSP-TS201S processor has two IALUs that provide pow-
erful address generation capabilities and perform many general-
purpose integer operations. The IALUs are referred to as J and
K in assembly syntax and have the following features:
Provide memory addresses for data and update pointers
Support circular buffering and bit-reverse addressing
Perform general-purpose integer operations, increasing
programming flexibility
Include a 31-word register file for each IALU
As address generators, the IALUs perform immediate or indi-
rect (pre- and post-modify) addressing. They perform modulus
and bit-reverse operations with no constraints placed on mem-
ory addresses for the modulus data buffer placement. Each
IALU can specify either a single-, dual-, or quad-word access
from memory.
Static Superscalar is a trademark of Analog Devices, Inc.