TM1100 Preliminary Data Book
Philips Semiconductors
4-10
PRELIMINARY INFORMATION
File: cstm.fm5, modified 7/26/99
executed 1280 operations (including loads, adds, sub-
tracts, and absolute values); in the restructured version,
there are only 256 operations—128 loads, 64 ume8uu
operations, and 64 additions. This is a factor of five re-
duction in the number of operations executed. Also, the
overhead of the inner loop has been eliminated, further
increasing the performance advantage.
4.4.2
More Unrolling
The code transformations of the previous section
achieved impressive performance improvements, but
given the VLIW nature of the TM1100 CPU, more can be
done to exploit TM1100’s parallelism.
The code in
Figure 4-14 has a loop containing only four
operations (not counting the loop overhead). Since
TM1100’s branches have a delay of three instructions
and each instruction can contain up to five operations, a
fully utilized minimum-sized loop can contain 16 opera-
tions (20 minus the loop overhead).
The TM1100 compiling system performs a wide variety
of powerful code transformation and scheduling optimi-
zations to ensure that the VLIW capabilities of the CPU
are exploited. It is still wise, however, to make program
parallelism explicit in source code when possible. Explicit
parallelism can only help the compiler produce a fast run-
ning program.
To this end, we can unroll the loop of
Figure 4-14 some
number of times to create explicit parallelism and help
the compiler create a fast running loop. In this case,
where the number of iterations is a power-of-two, it
makes sense to unroll by a factor that is a power-of-two
to create clean code.
Figure 4-15 shows the loop unrolled by a factor of eight.
The compiler can apply common subexpression elimina-
tion and other optimizations to eliminate extraneous op-
erations in the array indexing, but, again, improvements
in the source code can only help the compiler produce
the best possible code and fastest-running program.
pler array indexing.
Figure 4-12. The loop of Figure 4-11 recoded with one-dimensional array accesses. unsigned char A[16][16];
unsigned char B[16][16];
.
unsigned char *CA = A;
unsigned char *CB = B;
for (row = 0; row < 16; row += 1)
{
int rowoffset = row * 16;
for (col = 0; col < 16; col += 4)
{
cost0 = abs(CA[rowoffset + col+0] – CB[rowoffset + col+0]);
cost1 = abs(CA[rowoffset + col+1] – CB[rowoffset + col+1]);
cost2 = abs(CA[rowoffset + col+2] – CB[rowoffset + col+2]);
cost3 = abs(CA[rowoffset + col+3] – CB[rowoffset + col+3]);
cost += cost0 + cost1 + cost2 + cost3;
Figure 4-13. The loop of Figure 4-12 recoded with 32-bit array accesses and the ume8uu custom operation. unsigned int *IA = (unsigned int *) A;
unsigned int *IB = (unsigned int *) B;
for (row = 0; row < 16; row += 1)
{
int rowoffset = row * 4;
for (col4 = 0; col4 < 4; col4 += 1)
cost += UME8UU(IA[rowoffset + col4], IB[rowoffset + col4]);
}
unsigned int *IA = (unsigned int *) A;
unsigned int *IB = (unsigned int *) B;
for (i = 0; i < 64; i += 8)
{
cost0 = UME8UU(IA[i+0], IB[i+0]);
cost1 = UME8UU(IA[i+1], IB[i+1]);
cost2 = UME8UU(IA[i+2], IB[i+2]);
cost3 = UME8UU(IA[i+3], IB[i+3]);
cost4 = UME8UU(IA[i+4], IB[i+4]);
cost5 = UME8UU(IA[i+5], IB[i+5]);
cost6 = UME8UU(IA[i+6], IB[i+6]);
cost7 = UME8UU(IA[i+7], IB[i+7]);
cost += cost0 + cost1 + cost2 +
cost3 + cost4 + cost5 +
cost6 + cost7;
}
code makes good use of TM1100’s VLIW capabili-
ties.