Chapter 5

 

Overriding concern: performance

Execution Time = # instructions * CPI * CycleTime

# instructions depends on instruction set and compiler

CPI and clock cycle time depend on implementation of processor

 

Implement subset of MIPS: memory reference, arithmetic-logical and beq/jump

 

· Understand the general process for designing the datapath (part of processor that does arithmetic) and control (part of processor that commands the datapath, memory, and I/O devices according to program instructions)

· Understand connection between instruction set and implementation

· Understand how implementation choices affect clock rate and CPI

 

 

Processor Design Overview
 

· Identify steps to be taken for each type of instruction

· Consider the hardware requirements for each step

· Connect the hardware with appropriate controls to achieve desired result

 

Figure 5.1

 

Logic Design: Background

 

Working with bits: on/off values

Implement in electronics: high/low

Asserted - signal that is logically high. Assert - make high.

 

Two types of logic elements: combinational and state

 

Combinational: outputs depend only on current inputs (ALU)

State/Sequential: has some internal storage (memory, registers). 

 Output depends on both current inputs and internal state. 

 

Clocking methodology determines when signals can be read/written.

 

Edge-triggered - update only on clock edge

 

Figure 5.2

 

Processor Implementation

Initial implementation - 1 long clock cycle. 

Refinement - multiple clock cycles/instruction

 

First step for every instruction type is to fetch the next instruction to be executed. The program instructions are stored in memory. A register called the program counter (PC) keeps track of the address of the current instruction in memory. 

 

Figure 5.5

 

R-format Instructions

 

Register file - collection of 32 registers, specify the register number in order to read/write that register. 
 

ALU - performs operation after values read from registers. 

 

R-format specifies 3 registers (two source, one destination). Register file has two outputs (contents of two specified sources) and four inputs (3 register numbers plus data for write operation). Also has clock signal to control timing of write. 

 

 

Figure 5.7

 

Memory Access/I-format Instructions

 

lw $t1, offset($base) 

sw $t1, offset($base)

 

ALU - computes memory address

 

Data memory - reads/writes specified memory location

 

Sign extend - needed for 16-bit offset field

 

Figure 5.9
 
Branching/I-format Instructions

 

beq $t1, $t2, offset

 

ALU - compute branch target address, using PC and offset

 

Sign extend - needed for 16-bit offset field. Also need to shift 2 bits (so target is word offset).

 

Branch control - determines whether branch is actually taken

 

Figure 5.10
 
Creating a Single Datapath

 

Simplest: execute all instructions in one clock cycle

 

Restriction: no datapath element can be used more than once per instruction. Must duplicate an resources needed more than once. Examples: have separate memory for instructions v. data. Have adder for updating PC that is separate from the main ALU. 

 

Requirement: allow input to datapath elements to come from different sources. Examples: ALU may add two register values (R-type instruction) or may add a register and a sign-extended portion of the instruction itself (I-type instruction). Value to be written into a register may come from the ALU (R-type result) or memory (load). Solution: use multiplexors with appropriate control signals. 

 

Datapath for memory and arithmetic-logical instructions: 

Figure 5.12
 

Exercise 1 - handout in class

 
 

Additional requirements to handle branch instructions:

Need adder for computing branch target address (ALU is used for branch comparison)

 

Need to select between branch target address and normal sequential address (PC + 4)

 

Resulting datapath:

Figure 5.13

 

Implementing the Control Unit

Selection of registers is determined by simply decoding the instruction and using the appropriate fields to select registers from the register file. 

 

Based on the instruction being executed, the control unit must then determine:

· ALU control - what operation to be performed by the ALU

· Multiplexor controls - what signal to select for each mux

· Write signals needed for state elements (memory, register file)

 

ALU Control

Five possible functions for ALU: see table on page 353

 

Strategy: multiple levels of decoding.

1. Main control unit determines whether ALU control is based on the function field of the instruction (for R-type instructions) OR is an add (for memory access instructions) OR is a subtract (for beq instructions). This portion of the control is called ALUOp. Values generated are 00 (for memory access), 01 (for beq) or 10 (use funct field).

2. ALU control unit generates actual signals to ALU, to match table above. 

 

First step in designing ALU control: create a truth table of ALUOp/funct field combinations and the resulting ALU control signals. 

 

Figure 5.14 - original table with all possibilities

Figure 5.15 - revised table with additional don't cares, showing only entries that must be asserted. 

 

Optimizing the truth table and converting to hardware gates is a mechanical process that is best left to a computer program.

 

Multiplexor Control

 

ALUSrc - determines second operand for ALU. Will be either a register (R-type) or the sign-extended immediate field (memory access). 

 

PCSrc - determines what address to use for next instruction (what address to load into PC). Will be either PC+4 (normal operation; next instruction) or PC+4+[sign-extended, left-shifted] immediate field (branch target). 

 

MemtoReg - determines what value is written into the register file. Will be either the output of the ALU (R-type) or data read from memory (memory access). 

 

RegDst (not in Figure 5.13) - determines what register to write. Will be specified by either the 

rt field (memory access) or rd field (R-type). 

 

Most of these decisions can be based solely on the opcode field of the instruction. PCSrc is based on both the opcode AND the result of the compare. 

 

Read/Write (State) Control

 

RegWrite - assert for R-type and load instructions

MemRead - assert to read data from memory for load instructions

MemWrite - assert to write data to memory during store instructions

 

These signals are also based on the opcode field of the instruction. 

 

Figure 5.20

 

PLA Implementation

 

Programmable Logic Array – array of AND gates followed by array of OR gates. Inputs to AND gates are function inputs/inverses. Inputs to OR gates are outputs of ANDS.

 

Figure C.5

 

 

Datapath with Control Unit

 

Figure 5.19

Exercise 2 - handout in class

 

Operation of the Datapath

 

Four steps to execute an R-type instruction (Ex: add $t1, $t2, $t3)

1. Fetch instruction from memory, increment PC.

2. Read source registers from register file

3. ALU performs operation (based on funct field)

4. Result is written into register file

 

Five steps to execute memory access (I-type) instruction (Ex: lw $t1, offset($t2))

1. Fetch instruction from memory, increment PC.

2. Read base register from register file

3. ALU computes sum of base + sign-extended offset

4. Sum is used as address for data memory

5. Data from memory is written into register file

 

Four steps to execute branch (I-type). (Ex: beq $t1, $t2, offset)

1. Fetch instruction from memory, increment PC.

2. Read compare registers from register file. 

3. ALU performs subtraction. PC+4 is added to sign-extended, left-shifted offset to calculate target branch address. 

4. Zero result from ALU is used to decide which result to store in PC. 

 

Finalizing the Control

 

To complete the control function, need to create a truth table for each output. 

 

See table on page 367 and Figure 5.27

 

Implementing Jumps

 

Jumps add another option for the PC: jump address calculated by adding 4 bits from PC to 26 bits (left shifted by 2) from instruction. Requires another mux and shifter. 

 

Figure 5.29

 

Single-Cycle Performance Issues

 

Clock cycle has same length for every instruction. CPI is 1. 

BUT, clock cycle is determined by longest possible path, so the clock cycle must be set to the slowest instruction - in this case a load instruction. 

 
 

Assume the following operation times:
Memory units - 2 nanoseconds (ns)

ALU and adders - 2 ns

Register file access - 1 ns

 

Required length of various instructions: 

 
Cannot easily vary the clock cycle time. Better alternative: vary the number of clock cycles for different instruction classes. This would be even more important for machines with more powerful operations and addressing modes, which might have many more functional unit delays for some types of instructions. 

 

Multicycle Implementation

 

In a multicycle implementation, each step in the execution will take 1 clock cycle. 

 

Different instructions have different number of steps (and therefore different number of clock cycles).

 

Functional units may be used more than once per instruction. Reduces potentially costly elements such memory or ALUs, but sometimes requires the addition of storage units to preserve results for use later in an instruction. Registers and multiplexors are smaller and cheaper. 

 

Differences between single- and multi-cycle implementations

· Single memory can be used for both instructions and data

· There is a single ALU, rather than an ALU and two adders

· Registers are added after every major functional unit. These store data that will be used later in the same instruction. Data that will be used by subsequent instructions is stored into one of the major functional units (register file, PC, memory). 

 

Specific registers added to implementation:

· Instruction Register (IR) and Memory Data Register (MDR) save output of memory.

· A and B registers hold values read from register file.

· ALUOut holds the output of the ALU.

 

Additional multiplexors:

· ALU Input 1 - may be either a register or the PC

· ALU Input 2 - may be a register, the immediate field (for memory access), the constant 4 (for updating the PC) or the sign-extended shifted offset field (for branches).

 

Additional control signals:

· PCWrite and PCWriteCond - to write PC unconditionally (in normal operation) and conditionally (during branch)

 

Multicycle Datapath and Control

 

Figure 5.33

 

Breaking the Instruction Execution into Clock Cycles

 

Goal: Balance the amount of work to minimize the clock cycle time

 

Strategy: Break execution of an instruction into steps, with each step taking one clock cycle. To obtain roughly even step sizes, restrict each step to contain at most one major operation (ALU operation, register file access, memory access). Clock cycle is then as short as the longest of these operations. NOTE: Register file access has enough overhead to be considered a separate operation, but just storing a result into a single stand-alone register is short enough to be part of a step.
 

Exercise 4 - handout in class

 

 

Defining the Control

 

Control must specify both the signals required for that step and what step to take next in the sequence. 

 

Two techniques/representations for control: finite state machines and microprogramming. The actual hardware implementation can be synthesized from either of these formats by an automated CAD system (Appendix C).

 

Finite state machine - set of states and directions on how to change states [state = step]. Next state function maps the current state and inputs to a new state. Each state specifies outputs [control signals] that are asserted during that state. Implementation assumes that all signals not explicitly asserted are deasserted (i.e., not don't cares). Finite state control essentially corresponds to five steps of execution. Each state is represented by a circle. Outputs for that state are listed within the circle. Arcs between states are labeled with conditions. 

 

Figure 5.36

 

Figure 5.42

 
 

Microprogramming and Control

 

Graphical representations (i.e., finite state machines) can be unwieldy with larger, more varied instruction sets. As an alternative, think about establishing a set of rules or instructions for determining what control signals to assert and in what order. These low-level control instructions are called microinstructions. The process of designing the set of control instructions is called microprogramming. 

 

Remember: a microprogram is a symbolic representation of what must happen. The underlying hardware will still be implemented with gates, ROMs or PLAs. The goal is to develop a format that makes it easy to write and understand the microprogram. It should also be difficult to write inconsistent microinstructions (two different values for same control signal). 

 

Each microinstruction contains a set of fields. Each field specifies a nonoverlapping set of control signals OR is like a directive that provides guidance on how to interpret the microprogram. 

 

Figure 5.45, 5.46

 

Exercise 5 - handout in class
Timing Issues - Branch

Read Instructions
Calculate PC

Decode
Read Registers
Calculate Branch Subtract
Use Zero  
 
CPI for the Multicycle CPI

 

Assume (based on gcc instruction mix):

 loads:   5 cycles 23%

 stores:  4 cycles 13%

 ALU  4 cycles 43%

 Branches 3 cycles 19%

 Jumps  3 cycles 2 %

 

CPI = 0.23 * 5 + 0.13 * 4 + 0.43 * 4 + 0.19 * 3 + -.02 * 3 = 4.02
 
 

Exceptions

 

One of the hardest parts of control is to implement exceptions and interrupts - events that change the normal flow of execution. It's much harder to add exception handling later. 

 

External events - I/O device request - interrupt

Invoke operating system, arithmetic overflow, divide by 0, undefined instruction - internal - exception

Hardware malfunction - could be either

 

Two ways to communicate the reason for an exception: status register (Cause register) has field that indicates reason, or vectored interrupt - address of interrupt service routine (ISR) depends on cause of exception. 

 

MIPS architecture includes EPC (address of instruction that caused interrupt; subtract 4 from PC) and Cause register that records reason. 

 

Figure 5.50