### Flynn's Taxonomy of Computers

- Mike Flynn, "Very High-Speed Computing Systems," Proc. of IEEE, 1966
- **SISD:** Single instruction operates on single data element
- **SIMD**: Single instruction operates on multiple data elements
  - Array processor
  - Vector processor
- **MISD:** Multiple instructions operate on single data element
  - Closest form: systolic array processor, streaming processor
- MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams)
  - Multiprocessor
  - Multithreaded processor

# SIMD Processing

Single instruction operates on multiple data elements

- In time or in space
- Multiple processing elements
- Time-space duality
  - Array processor: Instruction operates on multiple data elements at the same time
  - Vector processor: Instruction operates on multiple data elements in consecutive time steps

### Array vs. Vector Processors



### Vector Processors

A vector is a one-dimensional array of numbers

- Many scientific/commercial programs use vectors for (i = 0; i<=49; i++) C[i] = (A[i] + B[i]) / 2
- A vector processor is one whose instructions operate on vectors rather than scalar (single data) values

#### Basic requirements

- Need to load/store vectors -> vector registers (contain vectors)
- Need to operate on vectors of different lengths -> vector length register (VLEN)
- Elements of a vector might be stored apart from each other in memory -> vector stride register (VSTR)

Stride: distance between two elements of a vector

### Vector Processors (II)

- A vector instruction performs an operation on each element in consecutive cycles
  - **Vector functional units are pipelined**
  - Each pipeline stage operates on a different data element
- Vector instructions allow deeper pipelines
  - No intra-vector dependencies -> no hardware interlocking within a vector
  - **No control flow within a vector**
  - **Known stride allows prefetching of vectors into cache/memory**

### Vector Processor Advantages

### + No dependencies within a vector

- •• Pipelining, parallelization work well
- Can have very deep pipelines, no dependencies!
- + Each instruction generates a lot of work
  - **Reduces instruction fetch bandwidth**
- + Highly regular memory access pattern
  - Interleaving multiple banks for higher memory bandwidth
  - •• Prefetching

### + No need to explicitly code loops

**Fewer branches in the instruction sequence** 

### Vector Processor Disadvantages

- -- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations
  - -- Very inefficient if parallelism is irregular

To program a vector machine, the compiler or hand coder must make the data structures in the code fit nearly exactly the regular structure built into the hardware. That's hard to do in first place, and just as hard to change. One tweak, and the low-level code has to be rewritten by a very smart and dedicated programmer who knows the hardware and often the subtleties of the application area. Often the rewriting is

Fisher, "Very Long Instruction Word architectures and the ELI-512," ISCA 1983. <sup>13</sup>

### Vector Processor Limitations

- -- Memory (bandwidth) can easily become a bottleneck, especially if
  - 1. compute/memory operation balance is not maintained
  - 2. data is not mapped appropriately to memory banks

### Vector Registers

- Each vector data register holds N M-bit values
- Vector control registers: VLEN, VSTR, VMASK
- Vector Mask Register (VMASK)
  - □□ Indicates which elements of vector to operate on
  - Set by vector test instructions

•• e.g., VMASK[i] = 
$$(V_k[i] == 0)$$

Maximum VLEN can be N

□□ Maximum number of elements stored in a vector register



### Vector Functional Units

- Use deep pipeline (=> fast clock) to execute element operations
- Simplifies control of deep pipeline because elements in vector are independent



V3 <- v1 \* v2

### Vector Machine Organization



- CRAY-1
- Russell, "The CRAY-1 computer system," CACM 1978.
- Scalar and vector modes
- 8 64-element vector registers
- 64 bits per element
- 16 memory banks
- 8 64-bit scalar registers
  - 8 24-bit address registers

### Memory Banking

Example: 16 banks; can start one bank access per cycle

Bank latency: 11 cycles

Can sustain 16 parallel accesses if they go to different banks



### Vector Memory System



### Scalar Code Example

For I = 0 to 49
 C[i] = (A[i] + B[i]) / 2

#### Scalar code

| MOVI R0 = 50       | 1                                                                                                                                                 |                                                                                                                                  |
|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|
| MOVA R1 = A        | 1                                                                                                                                                 | 304 dynamic instructions                                                                                                         |
| MOVA R2 = B        | 1                                                                                                                                                 |                                                                                                                                  |
| MOVA R3 = C        | 1                                                                                                                                                 |                                                                                                                                  |
| LD R4 = MEM[R1++]  | 11                                                                                                                                                | ;autoincrement addressing                                                                                                        |
| LD R5 = MEM[R2++]  | 11                                                                                                                                                |                                                                                                                                  |
| ADD $R6 = R4 + R5$ | 4                                                                                                                                                 |                                                                                                                                  |
| SHFR R7 = R6 >> 1  | 1                                                                                                                                                 |                                                                                                                                  |
| ST MEM[R3++] = R7  | 11                                                                                                                                                |                                                                                                                                  |
| DECBNZR0, X        | 2                                                                                                                                                 | ;decrement and branch if NZ                                                                                                      |
|                    | MOVA R1 = A<br>MOVA R2 = B<br>MOVA R3 = C<br>LD R4 = MEM[R1++]<br>LD R5 = MEM[R2++]<br>ADD R6 = R4 + R5<br>SHFR R7 = R6 >> 1<br>ST MEM[R3++] = R7 | MOVA R1 = A1MOVA R2 = B1MOVA R3 = C1LD R4 = MEM[R1++]11LD R5 = MEM[R2++]11ADD R6 = R4 + R54SHFR R7 = R6 >> 11ST MEM[R3++] = R711 |

### Scalar Code Execution Time

- Scalar execution time on an in-order processor with 1 bank
   First two loads in the loop cannot be pipelined: 2\*11 cycles
   4 + 50\*40 = 2004 cycles
- Scalar execution time on an in-order processor with 16 banks (word-interleaved)
  - □□ First two loads in the loop can be pipelined
  - □□ 4 + 50\*30 = 1504 cycles
- Why 16 banks?
  - □□ 11 cycle memory access latency
  - Having 16 (>11) banks ensures there are enough banks to overlap enough memory operations to cover memory latency

# Vectorizable Loops

A loop is vectorizable if each iteration is independent of any other

```
■ For I = 0 to 49
    C[i] = (A[i] + B[i]) / 2
  Vectorized loop:
        MOVI VLEN = 50
                                          1
        MOVI VSTR = 1
                                          1
        VLD VO = A
                                          11 + VLN - 1
                                          11 + VLN - 1
        VLD V1 = B
        VADD V2 = V0 + V1
                                          4 + VLN - 1
        VSHFR V3 = V2 >> 1
                                          1 + VLN - 1
        VST C = V3
                                          11 + VLN - 1
```

7 dynamic instructions

```
21
```

### Vector Code Performance

### No chaining

- i.e., output of a vector functional unit cannot be used as the input of another (i.e., no vector data forwarding)
- One memory port (one address generator)
- 16 memory banks (word-interleaved)





Vector Chaining

Vector chaining: Data forwarding from one vector functional unit to another



### Vector Code Performance - Chaining

Vector chaining: Data forwarding from one

vector functional unit to another



### Vector Code Performance – Multiple Memory Ports

Chaining and 2 load ports, 1 store port in each bank



# Questions (I)

What if # data elements > # elements in a vector register?

- Need to break loops so that each iteration operates on
  - # elements in a vector register
  - E.g., 527 data elements, 64-element VREGs
  - 8 iterations where VLEN = 64
  - 1 iteration where VLEN = 15 (need to change value of VLEN)
- Called vector stripmining

What if vector data is not stored in a strided fashion in memory? (irregular memory access to a vector)

- Use indirection to combine elements into vector registers
- Called scatter/gather operations

### Gather/Scatter Operations

Want to vectorize loops with indirect accesses:

```
for (i=0; i<N; i++)
    A[i] = B[i] + C[D[i]]</pre>
```

Indexed load instruction (Gather)

LV vD, rD # Load indices in D vector LVI vC, rC, vD # Load indirect from rC base LV vB, rB # Load B vector ADDV.D vA,vB,vC # Do add SV vA, rA # Store result

## Conditional Operations in a Loop

What if some operations should not be executed on a vector (based on a dynamically-determined condition)?

```
loop: if (a[i] != 0) then b[i]=a[i]*b[i]
goto loop
```

- Idea: Masked operations
  - VMASK register is a bit mask determining which data element should not be acted upon
     VLD V0 = A
     VLD V1 = B
     VMASK = (V0 != 0)
     VMUL V1 = V0 \* V1
     VST B = V1

### Another Example with Masking

```
for (i = 0; i < 64; ++i)
if (a[i] >= b[i]) then c[i] = a[i]
else c[i] = b[i]
```

| А  | В  | VMASK |
|----|----|-------|
| 1  | 2  | 0     |
| 2  | 2  | 1     |
| 3  | 2  | 1     |
| 4  | 10 | 0     |
| -5 | -4 | 0     |
| 0  | -3 | 1     |
| 6  | 5  | 1     |
| -7 | -8 | 1     |

Steps to execute loop

- 1. Compare A, B to get VMASK
- 2. Masked store of A into C
- 2. Complement VMASK
- 2. Masked store of B into C

### Masked Vector Instructions

#### Simple Implementation

 execute all N operations, turn off result writeback according to mask



#### **Density-Time Implementation**

 scan mask vector and only execute elements with non-zero masks



### Some Issues

### Stride and banking

As long as they are relatively prime to each other and there are enough banks to cover bank access latency, consecutive accesses proceed in parallel

#### Storage of a matrix

- Row major: Consecutive elements in a row are laid out consecutively in memory
- Column major: Consecutive elements in a column are laid out consecutively in memory
- You need to change the stride when accessing a row versus column

Matrix multiplication A& B, ban m raw major order BODI Aolo 5 6 8 2 3 4 5 2 10 11 12 13 14 15 16 17 18 19 7 8 10 9 h 20 30 40 50 VI Age BBNIO -> CGNIO (dot products of rows & columns of A&B) Load Ao mto a vector register V1 A: -> each time you need to increment the address by 1 to access the next column - First motion accesses have a stride of 1 4 cad Boonto a vector register V2 B. to each time you need to movement by 10 -> stride of 10 Different strides can lead to bank conflicts. -> How do you momore from?

### Array vs. Vector Processors, Revisited

Array vs. vector processor distinction is a "purist's" distinction

Most "modern" SIMD processors are a combination of both
 They exploit data parallelism in both time and space

# Array vs. Vector



### Vector Instruction



### Vector Instruction Level Parallelism

Can overlap execution of multiple vector instructions

- example machine has 32 elements per vector register and 8 lanes
- Complete 24 operations/cycle while issuing 1 short instruction/cycle



### Automatic Code Vectorization





### Vector/SIMD Processing Summary

- Vector/SIMD machines good at exploiting regular data-level parallelism
  - Same operation performed on many data elements
  - Improve performance, simplify design (no intra-vector dependencies)

Performance improvement limited by vectorizability of code

- Scalar operations limit vector machine performance
- Amdahl's Law
- **CRAY-1** was the fastest SCALAR machine at its time!

Many existing ISAs include (vector-like) SIMD operations Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD

# MMX Example: Image Overlaying (I)



Figure 8. Chroma keying: image overlay using a background color.

#### PCMPEQB MM1, MM3 Blue Blue Blue MM1 Blue Blue Blue Blue Blue X3!=blue X2!=blue X1=blue X0=blue X7!=blue X6!=blue X5=blue X4=blue MM3 **0xFFFF** 0x0000 0x0000 0xFFFF 0xFFFF 0x0000 0x0000 0xFFFF MM1 Bitmask

#### Figure 9. Generating the selection bit mask.