5-STEP DESIGN METHODOLOGY FOR A GENERAL PURPOSE CPU USING STANDARD CPLDs/FPGAs

Karim Salman, Michael B. Anderton
Middle Tennessee State University

Abstract

We present a novel hardware Central Processing Unit (CPU) design methodology based on a 5-step approach. The method starts with a definition of the target CPU internal components and data and address size. The method is applicable to a higher level of abstraction and complexity. However, for ease of illustration a basic CPU with a minimum size instruction set is selected. The instruction set complies with the instruction set completeness criteria. The instruction format is likewise chosen to be simple and illustrates the way our methodology is implemented. The CPU is implemented on an Altera FPGA/CPLD Flex10K device using schematic approach with the Altera MAX+Plus II software CAD. The design was simulated and tested using Altera UP2 board.

Introduction

CPU design for engineering/engineering technology students varied widely in objectives and approach1-7. For a long time, block diagrams of simple CPUs have been used in beginning computer courses, mainly to allow students to visualize how a CPU functions. To meet this need, many textbook authors1,6,7 have devised simple CPUs at the block diagram level to illustrate how instructions are executed and data are manipulated. Obviously, omitting many of the circuit details allows an overall understanding that is usually sufficient for students with little or no experience with digital circuits. However, students in accredited Electronics and Computer Engineering and Engineering Technology programs, have a more thorough background in digital circuit design. They are able to understand how instructions are decoded, what control signals are required for datapath operation, and how those control signals are generated. By examining this extra level of detail, students can better tie the new material to principles they have already learned. The low end approach is mainly descriptive of either a hypothetical ad hoc designs that remain to be implemented and tested4,6,8, or description of operation and design criteria of available well established designs10. On the high end, the approach usually departs and concentrates on describing complex architectures. This paper describes an architecture that is amenable to hardware implementation. A 5-step design methodology is presented. Although the paper adopts an already described architecture6,
the 5-steps described here allow a novice to be able to tackle, design and implement more challenging and advanced CPU architectures.

The design methodology

We begin by establishing some requirements for the CPU in mind. However, we are attempting to finally design and implement a simple architecture without sacrificing the important and basic issues in such design. For example we need first to address the following issues: word length, memory size, and registers. These can also be restated or modified without in another design once the methodology is adhered to. Since we are adopting a previously published design, the first two steps in our methodology would be to define our basic computer:

1. 16-bit data wordlength and 12-bit memory address. The wordlength size can address double precision data more readily than 8-bit designs, yet it adds little overhead in our methodology. The memory size will not be too prohibitive for hardware implementation. Our choice for the target hardware platform is Altera Flex10K FPGA/CPLD device\footnote{11}. The memory size can therefore be increased if a newer device is selected.

2. The hardware components will consist of:
   a. A memory unit with 2k words of 16-bits each.
   b. Ten registers.
   c. Two decoders: a 3x8 operation decoder and a 4x16 timing decoder.
   d. A 16-bit multiplexer based common bus.
   e. Control Logic Unit.
   f. Arithmetic and Logic Unit (ALU).

In addition to a 4-bit Sequence Counter (SC), there are 9-registers as shown in Fig. 1:

![Fig. 1 Basic computer registers and memory.](image)

- Two address registers, the Program Counter (PC), and the Address Register (AR). Both are 12-bit wide. The program counter takes charge of the instruction normal flow whereas the Address Register main duty is to...
address the memory. This is an important concept in the design because it makes the design easy to implement from the hardware point of view, and allows interrupt handling without adding additional address registers.

- A 16-bit instruction register (IR). The instruction format that dictates this choice will be addressed later.
- A 16-bit accumulator (AC). The input to this accumulator is connected directly to the output of the Arithmetic and Logic Unit (ALU). The AC output permanently feeds and shares one leg of the ALU with the input register (INPR).
- Two additional 16-bit data registers, a Data Register (DR), and a Temporary Register (TR). The DR is permanently connected to the second leg of the ALU. Therefore, the ALU is physically separated from the main bus. This feature is important to avoid metastabilities in the sequential design.
- Two 8-bit registers. One is the Input Register (INPR), and the other one is the Output Register (OUTR). These facilitate communications with external input and output devices.
- A collection of 7 disjoint flip flops that collectively act as the CPU status register. The Interrupt Register (R), The Stop Register (S), The Carry Register (E), the Addressing Mode Register (I), the Interrupt Enable Register (IEN), the Flag Input Register (FGI), and the Flag Output Register (FGO). These are individually controlled as will be detailed later.

3. A simple instruction format, as shown below. By dividing the instruction word into three fields allows the choice of three basic references: Memory, Register, and Input/Output. The memory instruction format can be subdivided into Direct and Indirect through the I-bit. The register instruction format makes use of the 12-least significant bits of the instruction word. Without decoding, these can address 11 register instructions and the Halt instruction. No memory reference is used here. The input/output addressing mode utilizes only part of the available code in the 12-least significant bits when the 4-most significant bits are all 1’s. With three bit IR13-14 we are allowed to address 7-memory reference instructions. An instruction decoder is used to produce 8 control signals D0, D1, ..., D7. The last signal determines whether the instruction is memory reference when D7 = 0 or non-memory reference when D7 = 1. However, these can also be in a direct or indirect mode depending on the status of the IR3 bit which is the I-bit. The instruction set, though looks too few, but they are sufficient to satisfy the general instruction-set-completeness requirements, that requires minimum set of instructions in each of the following categories:

4. Arithmetic (ADD, INC, CLA), Logical (AND, CMA, CME, CLE), and Shift instructions (CIR, CIL). The three arithmetic instructions are sufficient to address all necessary arithmetic operations. For example, subtraction can be achieved by 2’s arithmetic through 1’s complementing AC (CMA), and incrementing (INC). Multiplication and Division is achieved through successive additions and subtractions. Logical operations are likewise achieved through the Boolean two primitives AND, and any of the two instructions CMA, CME.
Arithmetic and logical shift operations are achieved through the two instructions CIR or CIL and proper choice of the flip flop. Signed or unsigned arithmetic operations are delegated to a software task by testing and controlling the most significant bit of the accumulator.

- **Memory Reference Instructions.** Moving information to and from memory and processor registers can be achieved through the instructions AND, ADD, LDA, and STA.
- **Program Control Instructions** BUN, BSA, ISZ, SPA, SNA, and SZA, as well as processor status check instructions SZE, SKI, SKO, ION, and IOF.
- **Input and Output instructions** INP and OUT.

![Instruction format diagram](image)

**Fig. 2 Instruction format.**

5. **Control Unit Design.** This step represents the bulk and the heart of the CPU design. In order to be able to address the complex structure and timing constraints associated with the control unit a sequential design approach must be attempted. This is due to the fact that unlike the ALU, which is essentially a combinational circuit, most of the operations in the control unit are basically sequential because they deal with data in or out of the registers at the transitions of the CPU clock. Using the Register Transfer Language can describe more easily the sequential operations entailed. The normal Fetch, Decode, and Execute phases can also be addressed quite effectively with this simple language. The use of this language can transform the assembly code instructions and operations down to the microprogrammable level in accordance with the transition of the master clock. Since each instruction has to pass through the above three phases in Table 1, and in order to restrict the operations of each instructions within specified time windows, an instruction sequence counter (SC) is used that will provide a number of timing pulses \( T_0, T_1, ..., T_9 \), Fig 3, depending on phase and the instruction being processed.

“Proceedings of the 2005 American Society for Engineering Education Annual Conference & Exposition
Copyright ©2005, American Society for Engineering Education”
Table 1 Control Functions and Micro Operations for the Basic Computer.

**Fetch**

\[ RT_i: \ AR \leftarrow PC \]
\[ RT_j: \ IR \leftarrow M[AR], \ PC \leftarrow PC + 1 \]

**Decode**

\[ RT_k: \ D_0, \ldots, D_7 \leftarrow \text{Decode IR}(12 \ldots 14), \]
\[ AR \leftarrow IR(0 \ldots 11), \ I \leftarrow IR(15) \]

**Indirect**

\[ D_{IT}: \ AR \leftarrow M[AR] \]

**Interrupt**

\[ T, T_T (IEN)(FGI + FGO): \ R \leftarrow 1 \]
\[ RT_x, \ AR \leftarrow 0, \ TR \leftarrow PC \]
\[ RT_y: \ M[AR] \leftarrow TR, \ PC \leftarrow 0 \]
\[ RT_z: \ PC \leftarrow PC + 1, \ IEN \leftarrow 0, \ R \leftarrow 0, \ SC \leftarrow 0 \]

**Memory Reference:**

**AND**

\[ D_{IT}_x: \ Dr \leftarrow M[AR] \]
\[ D_{IT}_y: \ AC \leftarrow AC \land DR, \ SC \leftarrow 0 \]

**ADD**

\[ D_{IT}_z: \ DR \leftarrow M[AR] \]
\[ D_{IT}_a: \ AC \leftarrow AC + DR, \ E \leftarrow C_m, \ SC \leftarrow 0 \]

**LDA**

\[ D_{IT}_b: \ DR \leftarrow M[AR] \]
\[ D_{IT}_c: \ AC \leftarrow DR, \ SC \leftarrow 0 \]

**STA**

\[ D_{IT}_d: \ M[AR] \leftarrow AC, \ SC \leftarrow 0 \]

**BUN**

\[ D_{IT}_e: \ PC \leftarrow AR, \ SC \leftarrow 0 \]

**BSA**

\[ D_{IT}_f: \ M[AR] \leftarrow PC, \ AR \leftarrow AR + 1 \]
\[ D_{IT}_g: \ PC \leftarrow AR, \ SC \leftarrow 0 \]

**ISZ**

\[ D_{IT}_h: \ DR \leftarrow M[AR] \]
\[ D_{IT}_i: \ DR \leftarrow DR + 1 \]

**Register-Reference:**

\[ D_{IT}_j = r \ (\text{common to all register – reference instructions}) \]
\[ IR(i) = B_i \ (i = 0, 1, 2, \ldots, 11) \]
\[ r = SC \leftarrow 0 \]

**CLA**

\[ rB_1: \ AC \leftarrow 0 \]

**CLE**

\[ rB_2: \ E \leftarrow 0 \]

**CMA**

\[ rB_3: \ AC \leftarrow \overline{AC} \]

**CME**

\[ rB_4: \ E \leftarrow \overline{E} \]

**CIR**

\[ rB_5: \ AC \leftarrow shrAC, \ AC(15) \leftarrow E, \ E \leftarrow AC(0) \]

**CIL**

\[ rB_6: \ AC \leftarrow shlAC, \ AC(0) \leftarrow E, \ E \leftarrow AC(15) \]

**INC**

\[ rB_7: \ AC \leftarrow AC + 1 \]

**SPA**

\[ rB_8: \ \text{If} \ (AC(15) = 0) \ \text{then} \ (PC \leftarrow PC + 1) \]

**SNA**

\[ rB_9: \ \text{If} \ (AC(15) = 1) \ \text{then} \ (PC \leftarrow PC + 1) \]

**SZA**

\[ rB_{10}: \ \text{If} \ (AC = 0) \ \text{then} \ (PC \leftarrow PC + 1) \]

**SZE**

\[ rB_{11}: \ \text{If} \ (E = 0) \ \text{then} \ (PC \leftarrow PC + 1) \]

**HLT**

\[ rB_{12}: \ S = 0 \]

**Input-Output:**

\[ D_{IT}_k = p \ (\text{common to all input – output instructions}) \]
\[ IR(i) = B_i \ (i = 6, 7, 8, 9, 10, 11) \]
\[ p = SC \leftarrow 0 \]

**INP**

\[ pB_{13}: \ AC(0 \ldots 7) \leftarrow INPR, \ FGI \leftarrow 0 \]

**OUT**

\[ pB_{14}: \ OUTR \leftarrow AC(0 \ldots 7), \ FGO \leftarrow 0 \]

**SKI**

\[ pB_{15}: \ \text{If} \ (FGI = 1) \ \text{then} \ (PC \leftarrow PC + 1) \]

**SKO**

\[ pB_{16}: \ \text{If} \ (FGO = 1) \ \text{then} \ (PC \leftarrow PC + 1) \]

**ION**

\[ pB_{17}: \ IEN \leftarrow 1 \]

**IOF**

\[ pB_{18}: \ IEN \leftarrow 0 \]
Fig. 3 T timing signals.

Each instruction will be translated into the microprogrammable level through the register transfer language. Table 1 describes all the operations needed for each instruction for the whole processor. Hence this table can be considered as the blueprint for this processor. Once this table is formed the rest of the design is simply logical equation formation and implementation. Associated with this table is the internal CPU architecture. As depicted in Fig. 4, the bus system has a separate control unit.

Fig. 4 Basic computer registers connected to a common bus.
Out of three approaches for the bus design, namely Open Collector, Tri State, or Multiplexer, the latter is adopted. Any of the others can be equally selected for the design without undue degradation. At any time only one device can assume control of the bus system. Hence an encoder system with three select signals will ensure that any device can control the bus during any clock cycle, Fig. 5. For example when \( S_2 S_1 S_0 = 111 \) the memory is being read and therefore the memory will assume control of the bus.

The data in the bus will be transferred to the destination register at the correct clock transition when the particular load signal is active. Several steps are also required to facilitate easy hardware implementation. For example, seven registers share the Load control signal (LD) feature, five registers share the Increment control signal (INR) feature, and six registers share the Clear control signal (CLR) feature. Hence it will be beneficial if all the three control signals are embedded in one 16-bit register model. Enabling / disabling these signals can represent any of the eight registers. Fig. 6 depicts the implementation of 1-bit of this register using Altera MAX+PLUS II FPGA/CPLD CAD\(^9\) schematic approach.

---

Footnotes:

1. Altera MAX+PLUS II FPGA/CPLD CAD\(^9\) schematic approach.
As an example, if the IR register was selected by the instruction during the fetch cycle, then the memory contents in the bus system should be loaded to this register when the LD of the register is active. Since the fetch cycle is common to all instructions, then we can see from Table 1 that two timing signals are required for this phase, \( T_0 \) and \( T_1 \). During \( T_0 \) the load signal \( LD \) of the AR register, i.e. \( LD_{AR} \), has to be active. The logic equation for this signal is derived by scanning Table 1 at the fetch cycle and during \( T_0 \). This results in the following equation: \( LD_{AR} = \overline{R}T_0 \). However, \( LD_{AR} \) can also be asserted during \( T_1 \) or even \( T_3 \) if we have indirect reference to the memory. Hence the total logic equation for \( LD_{AR} \) would be: \( LD_{AR} = \overline{R}T_0 + \overline{R}T_1 + \overline{D}IT_3 \). In the same way the other two signals can be derived. The \( INR_{AR} \) is asserted during the execution phase of the BSA instruction, thus: \( INR_{AR} = D_7T_1 \) and the \( CLR_{AR} \) signal is asserted during interrupt only: \( CLR_{AR} = RT_0 \). Similarly, we can scan for the AC:

\[
LD_{AC} = D_0T_5 + D_1T_5 + D_2T_3 + (D_4IT_3 + IR_1) + \overline{D}_1\overline{R}_7T_4 + IR_0T_4 + IR_3 + IR_5 + IR_11 \\
\]

but

\[
IN_{AC} = D_7\overline{R}T_5 + IR_4 \quad \text{and} \quad CLR_{AC} = D_7RT_5 + IR_11.
\]

The memory read signal can be derived by scanning Table 1 for the following RTL: \( \overline{M}[\overline{AR}] \) to obtain:

\[
\text{Read} = \overline{R}T_1 + \overline{D}_7IT_3 + (D_0 + D_1 + D_2 + D_4)T_4. \quad \text{The Boolean equation for the memory write signal is obtained in a similar way but scanning for} \quad \overline{M}[AR] \rightarrow (\text{any register})\quad \text{instead. On the other hand, the bus control logic is derived from Tables 1 and Table 2.}
\]

Table 2 Encoder for Bus Selection Circuit.

<table>
<thead>
<tr>
<th>Inputs</th>
<th>Outputs</th>
<th>Register selected for bus</th>
</tr>
</thead>
<tbody>
<tr>
<td>( x_1 )</td>
<td>( x_2 )</td>
<td>( x_3 )</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>

One of the following control signals \( x_1, x_2, ..., x_7 \) are assigned to each device that may control the bus. For example the memory would require \( x_7 \) to be asserted in order to carry out a read operation. The logic equation for this would be:

\[
x_7 = \overline{R}T_1 + \overline{D}_7IT_3 + (D_0 + D_1 + D_2 + D_4)T_4. \quad \text{The others are similarly derived. The ALU is designed by embedding Full Adders with logic between the DR and the AC, as shown in Fig. 7.}
\]
As the design was progressing higher hierarchies were developed. For example, Fig. 8 depicts the control logic for the PC Register. This logic was then modularized into a symbol, Fig. 9. This module was then placed with all the other control modules associated with the other registers and timing signals, Fig. 10, and modularized again into a master Control Module as shown in Fig. 11. This hierarchal approach continues into Fig. 12 to include the Control Module, Register Module, memory unit, S flip-flop, and the bus multiplexer, and then finalized in Fig. 13, the top level module representing a complete CPU.
Fig. 8 Control logic for the PC Register.

Fig. 9 Control Module for the PC Register.
Fig. 10 Flip-flops, timing signals, and a module for each register’s control signals.

Fig. 11 Control Module.

“Proceedings of the 2005 American Society for Engineering Education Annual Conference & Exposition
Copyright ©2005, American Society for Engineering Education”
Fig. 12 Control Module, Register Module, memory unit, S flip-flop, and the bus multiplexer.

Fig. 13 Complete CPU Module.
The design was tested in two ways. First, it was simulated, Fig. 14, using MAX+Plus II software with a small program listed in Table 3. After a successful simulation, the design was downloaded onto the Altera UP2 board, Fig. 15, and tested with the same program listed in Table 3. The LED’s and 7-segment displays on the UP2 were utilized to observe the testing.

Fig. 14 Simulation of program in Table 3.

Fig. 15 Altera UP2 board
As depicted in Fig. 14, the program starts when System_Start is pulsed, which sets the S flip-flop. The PC Register is transferred to the AR Register when \( T_0 = 1 \). When \( T_1 = 1 \) the instruction in memory pointed to by AR is loaded into the IR Register to be decoded, and the PC Register was incremented to 001h in preparation for the next time \( T_0 = 1 \). The PC Register started at 000h, when loaded into AR pointed to the first location in memory. At that location is the opcode 2004h; load AC with the operand, 0053h, located at address 004h. The next instruction adds a second operand from memory at location 005h to the first operand already in AC. With the next instruction the result, 003Ch, is then stored in memory at location 006h. The write signal \( \text{we} \) can be clearly seen while IR contains 3006h, AR point to 006h, and AC contains the sum 003Ch. The last instruction is HALT, 7001h, which clears the S flip-flop and halts the CPU.

Table 3 Assembly Language program to add two numbers

<table>
<thead>
<tr>
<th>Location</th>
<th>Label</th>
<th>Instruction</th>
<th>Opcode</th>
<th>Comments</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>Org 0</td>
<td>LDA A</td>
<td>2004</td>
<td>/Load operand from location A</td>
</tr>
<tr>
<td>A</td>
<td></td>
<td>ADD B</td>
<td>1005</td>
<td>/Add operand from location B</td>
</tr>
<tr>
<td>B</td>
<td></td>
<td>STA C</td>
<td>3006</td>
<td>/Store sum in location C</td>
</tr>
<tr>
<td>003</td>
<td></td>
<td>HLT</td>
<td>7001</td>
<td>/Halt computer</td>
</tr>
<tr>
<td>004</td>
<td>A,</td>
<td>DEC 83</td>
<td>0053</td>
<td>/Decimal operand</td>
</tr>
<tr>
<td>005</td>
<td>B,</td>
<td>DEC -23</td>
<td>FFE9</td>
<td>/Decimal operand</td>
</tr>
<tr>
<td>006</td>
<td>C,</td>
<td>DEC 0</td>
<td>0000</td>
<td>/Sum stored in location C</td>
</tr>
<tr>
<td>END</td>
<td></td>
<td></td>
<td></td>
<td>/End of symbolic program</td>
</tr>
</tbody>
</table>

Conclusions

We have presented a new 5-step methodology for hardware CPU design. The computer was first defined with internal registers, bus and memory size. The instruction set was selected with minimum number of instructions and simple instruction addressing formats but complies with instruction set completeness requirements. In order to move to the lower microprogrammable level, where signals can be generated, a micro instruction table was created. The table detailed all the actions taken for each instruction and for the whole instruction set. This was possible through utilization of the Register Transfer Language constructs. This language was used since it transforms into hardware details the register operations inside the CPU that is essentially sequential in nature. The table also explains the usual three phases encountered by each instruction. The fourth step in the methodology was reduced to a simple and repetitive process of scanning the micro instruction table for each register and instruction actions. A general model for the registers was designed such that all the possible operations were confined to the assertion/de-assertion of three control signals, the load signal LD, the increment signal
INC and the clear signal CLR. Examples of logic equations formation were illustrated. The control unit was formed from the aggregate of all the logic equations for all the registers and their control signals. The complexity of the hardware was further reduced to a hierarchical design using the industry standard Altera MAX+Plus II software design environment. The design was downloaded to an FPGA/CPLD platform. The platform was a prototyping board UP2 developed by Altera Inc. The design was tested through simulation runs of a representative program. The CPU hardware was also tested successfully for the same program on the UP2 board with 7-segment displays and LEDs representing inputs and outputs. The entire design utilized about 13% of the available Logic Cells (LC) count that amount to about 70,000 gates. The rest of the LCs were available to use to enhance the design with more logic designs or peripherals without having to add additional external logic. The design can be considered a very successful embedded microprocessor soft core design or System on a Chip (SoC). Of course much denser FPGAs are now available where more powerful soft microprocessors can be implemented. The 5-step methodology introduced can be used for more complex designs. The main emphasis again is on the creation of the micro instruction table. Hence, there are no restrictions as to the size or number of registers used, or the size of the memory.

References:


K. SALMAN is a Professor in the department of ETIS at Middle Tennessee State University. He holds Ph.D. in electrical engineering from Brunel University, UK, and a member of IEEE. His main line of research is in embedded microprocessors and ciphering.
MICHAEL B. ANDERTON is a Graduate Assistant at Middle Tennessee State University where he received his B.S. in Engineering Technology. His undergraduate senior project involved work with the Residue Number System. His graduate research interest includes System-on-chip (SoC) technology.