
TM1100 Preliminary Data Book
Philips Semiconductors
4-4
PRELIMINARY INFORMATION
File: cstm.fm5, modified 7/26/99
arate registers if the original matrix operands were need-
ed for further computations (the TM1100 optimizing C
compiler performs this analysis automatically). In this ex-
ample, the transpose matrix is placed in registers R18,
R19, R20, and R21. The final four store-word operations
put the transposed matrix back into memory.
Thus, using the TM1100 custom operations, the byte-
matrix transposition requires four load-word operations
and four store-word operations (the minimum possible)
and eight register-to-register data-manipulation opera-
tions. The result is 16 operations, or byte-matrix transpo-
sition at the rate of one operation per byte.
While the advantage of the custom-operation-based al-
gorithm over the brute-force code that uses 24 load- and
store-byte instruction seems to be only eight operations
(a 33% reduction), the advantage is actually much great-
er. First, using custom operations, the number of memo-
ry references is reduced from 24 to eight (a factor of
three). Since memory references are slower than regis-
ter-to-register operations (such as the custom operations
in this example), the reduction in memory references is
significant.
Further, the ability of the TM1100 compiling system to
exploit the performance potential of the TM1100 micro-
processor hardware is enhanced by the custom-opera-
tion-based code. This is because it is easier for the com-
piling
system
to
produce
an
optimal
schedule
(arrangement) of the code when the number of memory
references is in balance with the number of register-to-
register operations. The TM1100 CPU (like all high-per-
formance microprocessors) has a limit on the number of
memory references that can be processed in a single cy-
cle (two is the current limit). A long sequence of code that
contains only memory references can result in empty op-
eration slots in the long TM1100 instructions. Empty op-
eration slots waste the performance potential of the
TM1100 hardware.
As this example has shown, careful use of custom oper-
ations has the potential to not only reduce the absolute
number of operations needed to perform a computation
but can also help the compiling system produce code
that fully exploits the performance potential of the
TM1100 CPU.
4.3
EXAMPLE 2: MPEG IMAGE
RECONSTRUCTION
The complete MPEG video decoding algorithm is com-
posed of many different phases, each with computation-
al intensive kernels. One important kernel deals with re-
constructing a single image frame given that the forward-
and backward-predicted frames and the inverse discrete
cosine transform (IDCT) results have already been com-
puted. This kernel provides an excellent opportunity to il-
lustrate of the power of TM1100’s specialized custom op-
erators.
In the code fragments that follow, the backward-predict-
ed block is assumed to have been computed into an ar-
ray back[], the forward-predicted block is assumed to
a
e
i
m
b
f
j
n
c
g
k
o
d
h
l
p
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
Row Major
Column Major
mergemsb
a e b f
i m j n
mergelsb
c g d h
k o l p
pack16msb
pack16lsb
pack16msb
pack16lsb
Figure 4-2. Application of merge and pack instructions to the byte-matrix transposition of Figure 4-1. ld32d(0) r100
→ r10
ld32d(4) r100
→ r11
ld32d(8) r100
→ r12
ld32d(12) r100
→ r13
mergemsb r10 r11
→ r14
mergemsb r12 r13
→ r15
mergelsb r10 r11
→ r16
mergelsb r12 r13
→ r17
pack16msb r14 r15
→ r18
pack16lsb r14 r15
→ r19
pack16msb r16 r17
→ r20
pack16lsb r16 r17
→ r21
st32d(0) r101 r18
st32d(4) r101 r19
st32d(8) r101 r20
st32d(12) r101 r21
char matrix[4][4];
.
int *m = (int *) matrix;
temp0 = MERGEMSB(m[0], m[1]);
temp1 = MERGEMSB(m[2], m[3]);
temp2 = MERGELSB(m[0], m[1]);
temp3 = MERGELSB(m[2], m[3]);
m[0]
= PACK16MSB(temp0, temp1);
m[1]
= PACK16LSB(temp0, temp1);
m[2]
= PACK16MSB(temp2, temp3);
m[3]
= PACK16LSB(temp2, temp3);
.
Figure 4-3. On the left is a complete list of operations to perform the byte-matrix transposition of Figure 4-1 and Figure 4-2. On the left is an equivalent C-language fragment.