Rev. D
|
Page 3 of 48
|
May 2012
GENERAL DESCRIPTION
The ADSP-TS203S TigerSHARC processor is an ultrahigh per-
formance, static superscalar processor optimized for large signal
processing tasks and communications infrastructure. The pro-
cessor combines very wide memory widths with dual
computation blocks—supporting floating-point (IEEE 32-bit
and extended precision 40-bit) and fixed-point (8-, 16-, 32-, and
64-bit) processing—to set a new standard of performance for
digital signal processors. The TigerSHARC static superscalar
architecture lets the processor execute up to four instructions
each cycle, performing 24 fixed-point (16-bit) operations or six
floating-point operations.
Four independent 128-bit wide internal data buses, each con-
necting to the four 1M bit memory banks, enable quad-word
data, instruction, and I/O access and provide 28G bytes per sec-
ond of internal memory bandwidth. Operating at 500 MHz, the
ADSP-TS203S processor’s core has a 2.0 ns instruction cycle
time. Using its single-instruction, multiple-data (SIMD) fea-
tures, the processor can perform four billion 40-bit MACS or
one billion 80-bit MACS per second.
Table 1 shows the proces-
sor’s performance benchmarks.
The ADSP-TS203S processor is code compatible with the other
TigerSHARC processors.
The Functional Block Diagram
on Page 1 shows the processor’s
architectural blocks. These blocks include
Dual compute blocks, each consisting of an ALU, multi-
plier, 64-bit shifter, and 32-word register file and associated
data alignment buffers (DABs)
Dual integer ALUs (IALUs), each with its own 31-word
register file for data addressing and a status register
A program sequencer with instruction alignment buffer
(IAB) and branch target buffer (BTB)
An interrupt controller that supports hardware and soft-
ware interrupts, supports level- or edge-triggers, and
supports prioritized, nested interrupts
Four 128-bit internal data buses, each connecting to the
four 1M-bit memory banks
On-chip DRAM (4M-bit)
An external port that provides the interface to host proces-
sors, multiprocessing space (DSPs), off-chip memory-
mapped peripherals, and external SRAM and SDRAM
A 10-channel DMA controller
Two full-duplex LVDS link ports
Two 64-bit interval timers and timer expired pin
An 1149.1 IEEE-compliant JTAG test access port for on-
chip emulation
The TigerSHARC uses a Static Superscalar
TM* architecture. This
architecture is superscalar in that the ADSP-TS203S processor’s
core can execute simultaneously from one to four 32-bit
instructions encoded in a very large instruction word (VLIW)
instruction line using the processor’s dual compute blocks.
Because the processor does not perform instruction reordering
at runtime—the programmer selects which operations will exe-
cute in parallel prior to runtime—the order of instructions is
static.
With few exceptions, an instruction line, whether it contains
one, two, three, or four 32-bit instructions, executes with a
throughput of one cycle in a 10-deep processor pipeline.
For optimal processor program execution, programmers must
follow the processor’s set of instruction parallelism rules when
encoding an instruction line. In general, the selection of instruc-
tions that the processor can execute in parallel each cycle
depends both on the instruction line resources each instruction
requires and on the source and destination registers used in the
instructions. The programmer has direct control of three core
components—the IALUs, the compute blocks, and the program
sequencer.
The ADSP-TS203S processor, in most cases, has a two-cycle
execution pipeline that is fully interlocked, so—whenever a
computation result is unavailable for another operation depen-
dent on it—the processor automatically inserts one or more stall
cycles as needed. Efficient programming with dependency-free
instructions can eliminate most computational and memory
transfer data dependencies.
In addition, the processor supports SIMD operations two
ways—SIMD compute blocks and SIMD computations. The
programmer can load both compute blocks with the same data
(broadcast distribution) or different data (merged distribution).
Table 1. General-Purpose Algorithm Benchmarks
at 500 MHz
Benchmark
Speed
Clock
Cycles
32-bit algorithm, 1 billion MACS/s peak performance
1K point complex FFT1(Radix2)
1 Cache preloaded.
18.8 μs
9419
64K point complex F
FT1(Radix2)
2.8 ms
13975
44
FIR filter (per real tap)
1 ns
0.5
[8
× 8][8 × 8] matrix multiply
(complex, floating-point)
2.8 μs
1399
16-bit algorithm, 4 billion MACS/s peak performance
256 point complex FF
T1 (Radix 2)
1.9 μs
928
I/O DMA transfer rate
External port
500M bytes/s
n/a
Link ports (each)
500M bytes/s
n/a
* Static Superscalar is a trademark of Analog Devices, Inc.