Tutorial 9

Problem 1
The control unit of a CPU needs to decide whether one of two signed numbers is greater than another. In order to accomplish this, the following logic expression needs to be evaluated:

\[ \text{GT} = \text{NOT} \ (\text{Z OR (S XOR V)}) \]

In this expression, S, V, and Z are single bit input signals from the ALU, and GT (greater-than) is a single bit valued output signal.

a) What is the truth table of this expression?
b) How can this combinational circuit be implemented using NAND gates?
c) Draw a circuit using n-switch und p-switch.

Problem 2
You are debugging a piece of software on an embedded processor. This processor has a word size of 32 bits, and it is using the big-endian data arrangement in a byte-addressed memory. At a debug port, you can access to single bytes in its memory.

You intend to write the word with value 1 to address 0x10. Which memory accesses do you have to perform at the debug port?

Problem 3
The data cache of a 32 bit processor is 16KB of size. The frame size is 256 bytes, and the cache is 4-fold set associative.

a) How many words are within one frame?
b) How many frames does the cache have?
c) How many bits are required to select words and frames? How many bits need to be taken into account for tag compare?
d) How many comparators does the cache use?

Problem 4
Sketch a function in C that adds two vectors using either SSE or AVX extensions. The function signature shall look like follows:

```c
void add_vector (float * c, float * a, float * b, int n);
```

The vectors shall be aligned at 256-bit boundaries. The length of the vectors shall be arbitrary.

Problem 5
Sketch a function on a GPU that adds a scalar value to a vector. The function signature shall be as follows:

```c
void add_scalar (float s, float * v, int n);
```

The length of the vectors shall be arbitrary.
Problem 6
Amdahl’s law states that the total performance of a parallel program consists of the portions of the program that can be parallelized and the sequential portion.

a) The vector unit of a CPU is 8 times faster than the scalar unit. Which part of the program needs to be parallelizable in order to make the program twice as fast?
b) Which portion of the program needs to be parallelizable in order achieve at least half of the theoretically possible speed up?
c) Assume 75% of the program is parallelizable. The hardware department says, it can develop of larger vector unit that is twice as fast. For how much would the software development department improve vectorization in the compiler, in order to match the hardware speed up?

Problem 7
a) Which computer was the first one with 8-bit wide bytes?
b) What are the design principles of the RISC architecture?
c) What is the difference between Stanford RISC and Berekeley RISC?
d) How many transistors does a D-RAM require to store one bit?
e) What is the difference between row-major and column-major layout for two-dimensional arrays?
f) Which flags are there at the output of an ALU, and what do they mean?

Problem 8
In a cache, there shall be 4 cache lines with one bit frame size. The cache shall be fully associative. The replacement strategy shall be least-recently-used (LRU).

The cache shall be accessed at the following addresses:
5 2 6 4 1 9 6 9 6 1 9 3 2 3 9 6

a) Determine the cache state for every time step: i.e. contents of the memories, tag addresses, and recently-used information.
b) Mark evictions from the cache.