Vector Processors and Data Level Parallelism
Introduction
A complex number consists of a real and imaginary component and is usually written in the form where and are either integer or floating-point values and (the imaginary value) . Sometimes in engineering, the letter is used in place of because is used for other values.
Multiplying two complex numbers is done by applying the FOIL (Firsts, Outers, Inners and Lasts) method, similar to that of binomial multiplication. For example, multiplying (a + bi)(c + di) is accomplished as follows:
Firsts: a * c
Outers: a * di
Inners: bi * c
Lasts: bi * di
This produces (a+bi)(c+di) = ac + adi + bci + bdi2. The terms are combined to produce the product back in the form a + bi. Keep in mind that i2 = -1.
An example using actual values: (2.5 + 3i)(4.0 + 2i)
Firsts: 2.5 * 4.0
Outers: 2.5 * 2i
Inners: 3i * 4.0
Lasts: 3i * 2i
This produces 10 + 5i + 12i + 6i2 = 10 + 17i + 6(-1) = 4 + 17i.
Some contemporary programming languages natively support complex numbers (Python, MATLAB). Newer revisions of some older languages (C, FORTRAN) have added support for complex numbers. Some programming languages have no native support for complex numbers.
Assignment Definition
Consider the following high-level language code which multiplies two vectors that contain single-precision complex numbers:
Values a, b and c are vectors; _re is the real component element and _im is the imaginary component element in each vector.
Convert this loop into pseudo RV64V assembly code using strip mining assuming the following architectural features:
Register s0 = loop counter & array index [i]
Vector registers: v0 – v31
MVL (maximum vector length) = 64
Instructions: vld (vector load)
vst (vector store)
vadd (vector add)
vsub (vector subtract)
vmul (vector multiply)
bne (branch if not equal)*
blt (branch if less than)*
j (unconditional jump)*
addi (integer add immediate)*
ori (logical or immediate)*
Note: instructions with an asterisk indicate the instructions are used only for setting initial index value and increments, and for loop control.
If the vector processor implements chaining with two lanes and has a single vector load/store unit, using the pseudo assembly code from question 1, show how convoys would be constructed to execute in the vector pipeline. How many chimes are required to execute the convoys?
Assume in the vector processor, the functional units have the following startup overhead: load/store unit: 12 cycles, multiply unit: 7 cycles, and the add/subtract unit: 6 cycles. How many clock cycles are required for each iteration of the loop, including startup overhead?
How many iterations are required to complete processing the vectors?
Instruction Formats
vld (vector load): vld vD, vec_ref
vst (vector store): vst vD, vec_ref
vadd (vector add): vadd vD, vS1, vS2
vsub (vector subtract): vsub vD, vS1, vS2
vmul (vector multiply): vmul vD, vS1, vS2
bne (branch if not equal): bne x1, x2, target_label
blt (branch if less than): blt x1, x2, target_label
j (unconditional jump): j target_label
addi (integer add immediate): addi xD, xS1, xS2
ori (logical or immediate): ori xD, xS1, const
Format Definitions
vD = destination vector register
vS1 = first source vector register
vS2 = second source vector register
vec_ref = vector reference
x1 = first general purpose register for comparison
x2 = second general purpose register for comparison
xS1 = first source general purpose register
xS2 = second source general purpose register
target_label = label of the target instruction for branch
const = an integer constant