Charles Gamiz, Justin Romberg, Wayne Vuong
Electrical & Computer Engineering Department
Rice University


The jwC Mini-Microprocessor is a two-phase clocked, 4-bit, accumulator-based microprocessor which interfaces to an 8-bit addressed external memory. Its functions include: addition and subtraction, logical operations, load immediate, load from and store to memory, comparisons, and branches.

In general, we chose to design the microprocessor to be as simple and minimal as possible, leaving more complex functionality to the compiler. Therefore, we chose an accumulator-based design with one general purpose register. We believe that our design was an excellent tradeoff between simplicity and performance, with the emphasis on simplicity.

For specific information, see:

Design
Components
Other

(NOTE: The content of this webpage is drawn from our report and is intended to be a descriptive source that emphasizes our design approach and decisions. For more detailed information, including additional layout diagrams and test plots, access to the full original hardcopy report is necessary.)


Instruction Set

Our instruction set is simple yet comprehensive. Since our data bus (see Datapath ) is only 4 bits wide, we decided to keep the number of instructions supported within 16 for easier implementation. The following is our detailed instruction set:

Arithmetic (ALU ops)
1000 add add, A = A + B, C = overflow
1001 sub subtract, A = A - B, C = underflow
1111 shr shift A right, MSB of A = LSB of B, S = LSB of A
1100 and logical AND, A = B & A, (bitwise)
1101 or logical OR, A = B | A, (bitwise)
1110 not logical NOT, A = !A, (bitwise)
1010 gt greater than, Z = (B > A), (boolean)
1011 eq equal to, Z = (B == A), (boolean)
Program Flow
0001 bz branch if comparison flag (Z) is set
0010 bc branch if carry flag (C) is set
0011 bs branch if shift overflow flag (S) is set
Data
0101 lma load accumulator (with data at given memory address)
0110 sma store accumulator (into given memory address)
0100 lva load register (with given immediate)
0111 mov move accumulator to register B (B = A)
Other
0000 nop null operation

Note that we use variable-length instructions depending on whether the instruction requires a memory address or immediate value. Specifically, lma , sma , bz , bc , and bs are all 3 nibble instructions (1 nibble for the opcode and 2 nibbles for an 8-bit memory address), lva is a 2 nibble instruction (1 nibble for the opcode and 1 nibble for a 4-bit immediate value), and all other instructions are 1 nibble instructions containing simply the opcode. Handling of variable length instructions will be described in detail in the Control Unit section.

Datapath

Our basic architecture to support the instruction set is outlined in the datapath figure . The datapath supports a 4-bit data bus and an 8-bit address bus. There are two data registers, A and B, which are the accumulator and general purpose register, respectively. (An additional data register, ALUin, mimics the contents of A and sits in front of the ALU for the sole purpose of buffering the accumulator to avoid a direct feedback loop through the ALU. This is accomplished by having registers A and ALUin latch on different clocks, as will be discussed in section 2.6 on Registers.) All ALU operations take A (through ALUin) and B as inputs, and write the results to A. Register A can be loaded through the load from memory ( lma ) and load value ( lva ) instructions and can be written to memory through sma . Register B can only be loaded with the contents of A through a mov instruction.

The ALU also maintains a register of flags or condition codes. The first flag is the comparison flag (Z). Whenever the ALU performs a comparison ( gt or eq ), it sets the comparison flag with the result. bz (branch on comparison) checks this flag to decide whether to perform the jump to the specified address. The next flag is the add overflow, or carry flag (C). When an add instruction produces a carry-out signaling overflow, or when a sub instruction produces an underflow signaling a negative result, this flag is set. bc (branch carry) checks this flag and branches conditionally. The third flag is the shift overflow flag (S). This flag contains the shifted out bit on a shift right operation (the LSB of A before the shr instruction). bs (branch shift) checks this flag before jumping.

The last four registers are the program counter (PC), the instruction register (IR), a high address register (ADRH) for the upper nibble, and a low address register (ADRL) for the lower nibble. The PC holds the address of the next instruction to execute or the location of the remaining instruction nibbles (for the address or immediate) for multi-nibble instructions. The program counter is auto-incremented for simplicity and overwritten as necessary on a branch. Note that the PC is an 8 bit (2 nibble) register. The instruction register (IR) holds the opcode for the instruction that is currently being executed. This opcode is sent to the control unit which sends out signals to the rest of the datapath. ADRH and ADRL are only used when executing the load and store to memory and the branch instructions ( lma , sma , bz , bc , bs ). The (8-bit) address is loaded into these registers whenever we want to access a memory location.

Control Unit

The state machine that controls the jwC consists of 8 states, as outlined in the control unit figure . There are 6 processing states, a reset state (that the controller goes to when the restart signal is asserted), and an error state (that the controller goes to when the contents of the IR are changed during an instruction-- ideally, this should never happen). Basically, our finite state machine has a memory access in every processing state. The outputs of the control unit are VbSa. This means that the control unit changes state on clock b (fb), and the registers (A, B, IR, etc.) are latched on clock a (fa). The outputs to the address bus are also VbSa, and we assume that the data on the data bus will be Va on the same cycle.

For a 1-nibble instruction, the opcode is fetched from memory into the IR and decoded, and then appropriate latching and/or multiplexing signals are asserted by the control unit to complete the instruction, all within a single getMemPCIR state. No more memory accesses are required for this instruction, so we return to the same getMemPCIR state on the next state transition to begin the next instruction fetch, at which time the results of this current instruction become latched.

Some instructions, however, are longer than a single nibble, as mentioned previously. The load immediate ( lva ) instruction consists of two nibbles: one for the opcode and one for the 4-bit immediate value. This instruction requires two memory accesses and therefore two states: one to load the opcode into IR and one to load the value from memory into A. The branch instructions ( bz, bs , and bc ) are 3 nibbles long: one for the opcode and two more for the 8-bit target address. The branch instructions thus take 3 memory accesses (and 3 states) to execute: one to load the opcode into IR, one to load the high part of the target address into ADRH, and one to load the low part of the target address into ADRL and decide whether or not to load the PC with the target address based on the flags. The final multi-nibble instructions, loading from memory into A ( lma ) and storing to memory out of A ( sma ), both require 4 states to execute. They need the same initial three states as the branch instructions (loading the opcode and the address) but also require a fourth state to perform the read or write to memory.

Arithmetic and Logical Unit (ALU)

The ALU is capable of performing eight different operations: addition, subtraction, shift right, logical bitwise AND, logical bitwise OR, logical bitwise NOT, greater-than comparison, and equal-to comparison. Subtraction is performed as A - B, not B - A, while greater-than comparison is performed as B > A. Logical bitwise NOT and shift right are on A. The other ALU operations are on A and B and are symmetric (see instruction set above).

Our 4-bit adder/subtractor is built from simpler 1-bit adders that are cascaded through their carry-in and carry-out. This scheme is known as ripple adding. Each simple 1-bit adder with carry-in and carry-out involves 3 AND gates and 2 XOR gates. We determined the optimal approach to be implementing the AND gates (they determine the carry-out) on the transistor level. Originally, we built the XOR gates from a specialized design using transmission gates and inverters. These devices are smaller and faster, but are subject to switch-level simulation failures. After many simulation problems, we abandoned them in favor of gate-level XOR units. Although this involved a greater size as well as the need for inverted input signals, we believe that it was worth the effort. In the end, we had a very compact design that lent itself well to the design principles of alternate mirroring and Vdd/GND routing. Also, we found that we could implement subtraction by merely inverting B, setting carry-in to 1, and adding as normal. This seems to be the least hardware-intensive approach. This method requires availability of Bbar and a multiplexer to select between B for addition and Bbar for subtraction. Despite the additional hardware, this approach was the simplest.

The chain of multiplexers at the input of the 4-bit adder acts to find the complement of B for subtraction. Carry-in is tied to the same signal that is used to indicate a subtraction so that it is conveniently zero for addition and one for subtraction. We also added buffers to each of the B and Bbar input signals, to compensate for high fan-in and low drive strength.

The rest of the ALU operations are fairly straightforward. The shift operation is merely a feed-through, matching the appropriate bits of A. Bitwise logical functions are implemented with their respective gates. The greater-than comparison is done by implementing a subtraction and then setting the comparison flag to be the inverse of the subtraction's resulting carry-out (since carry-out is zero on subtraction when B > A). The equal comparison is also implemented through subtraction. We test the subtraction result against zero with a 4-input NOR gate and set the comparison flag to the NOR output. This is a simpler technique than using a dedicated equal comparitor.

Program Counter (PC)

Reliable instruction mobility is crucial to a quality microprocessor. Our program counter is the largest sub-block and second to the control unit in complexity. It has an eight-bit register in a master-slave configuration and performs only two functions: incrementing and loading. For most instructions, the PC is simply incremented in preparation for the following instruction or following instruction nibbles. Exceptions are the branches, in which the PC is conditionally loaded with a value from ADRL and ADRH.

The inputs to PC are: AX[0:7] - the values to be loaded on a branch, PCpp - increment signal, LatchPC - load signal, and the two clocks. The incrementing function clearly requires much logic, which was implemented with compound gates as often as possible. The master latch is latched with a fa-qualified signal (PCpp OR LatchPC), and the slave latch is always latched with fb. The counter determines its next values from its current ones, so to avoid feedback and oscillations, the master-slave design was needed. The PC also needs to be able to reset itself to zero on startup. The most straightforward way to implement this is to load the program counter with zeros. Four 4-input multiplexers are added for this purpose.

Registers and Other Components

We put special effort into designing a robust and compact register. The generic 4-bit level-sensitive latch was modified into a clock-qualified 4-bit latch, and this clock-qualified 4-bit latch is used 6 times and one additional time with a slight modification, so layout optimization was a worthwhile investment of time. The IR, ADRH, ADRL, A, B, and ALUin are all of the same register sub-block, and they are qualified with fa with the exception of register ALUin which is qualified with fb. The flags register is a collection of 3 one-bit registers, each with its own fa qualified latch signal.

In addition to the main sub-blocks (control, PC, ALU, and registers), there are several more components that are required. There are two 8-bit transmission gates between the address bus and both the PC and the ADRX registers that select between the ADRX registers and the PC for control of the address bus. An 8-to-4 multiplexer is also required in front of register A, to select between the output of the ALU and the data bus.

Input / Output (I/O)

The padframe is supplied to us by MOSIS as 2-micron SCN (N-well), referred to as "TinyChip Pads Set". The corners of the padframe (#5 & #15 - Vdd, #25 & #35 - GND) contain power pads for use by the padframe itself. In addition, the padframe includes a Vdd pad (#10) and GND pad (# 30) for use within the circuit. The remainder of the pads are for I/O connections. Below is a "complete" pin list:

Pin # Type Name Description
3 Out MemReadOut Read signal for external memory
4 Out MemEnableOut Enable signal for external memory
5 (pad power) (Vdd) (Pad Vdd)
6 Out ErrorOut Error signal
8 In Clkbin Clock B (fb)
9 In Clkain Clock A (fa)
10 (chip power) GND Internal GND
11 Out Q0out State bit 0
12 Out Q1out State bit 1
13 Out Q2out State bit 2
15 (pad power) (Vdd) (Pad Vdd)
16 In RestartIn Restart signal
17 Out ABUS7 Address bus bit 7
18 Out ABUS6 Address bus bit 6
19 Out ABUS5 Address bus bit 5
20 Out ABUS4 Address bus bit 4
21 Out ABUS3 Address bus bit 3
22 Out ABUS2 Address bus bit 2
23 Out ABUS1 Address bus bit 1
24 Out ABUS0 Address bus bit 0
25 (pad power) (GND) (Pad GND)
30 (chip power) Vdd Internal Vdd
35 (pad power) (GND) (Pad GND)
36 In/Out Data_Bus0 Data bus bit 0
37 In/Out Data_Bus1 Data bus bit 1
38 In/Out Data_Bus2 Data bus bit 2
39 In/Out Data_Bus3 Data bus bit 3
40 Out MemWriteOut Write signal for external memory
* The unlisted pin numbers are unused and are set as inputs.

The address bus pins and the MemReadOut, MemWriteOut, and MemEnableOut pins are output pins connected to the external memory for communication. The data bus pins are bi-directional and are also connected to the external memory. As input pins, they allow programs to be loaded during instruction fetch and allow data to be read from memory on a load. As output pins, they allow data to be written to memory on a store.

All clock and power signals are supplied by the chip user. RestartIn sets the system to the initial state, and ErrorOut asserts if there is a program flow error in the execution of instructions. The state bits, Q0out, Q1out, and Q2out, can be used to monitor the internal state of the control unit. This will be invaluable for troubleshooting the chip.

Floorplan and Other Design Decisions

For the actual layout, we chose to design all individual subunits as compound gate blocks instead of assembling pre-made logical gates. This meant smaller and faster devices, possibly at the expense of some layout clarity. We laid out the large sub-blocks in such a manner as to minimize routing, while preserving some of the spatial coherence of our datapath schematic. Due to superior layout techniques, design, and planning, we had ample space with which to work.

Performance and Testing

We were careful to check the functionality of each component as the project progressed. The three simulation tools at our disposal are Irsim, Crystal, and SPICE. Irsim proved the most valuable for testing the functionality of our circuits. We used Crystal to compare path lengths and to find the longest path for determining the maximum clock frequency that our chip will support. SPICE proved to be a useful tool for more detailed electrical and timing data than what was afforded to us by Irsim.

Due to the complexity of the circuits, it is not feasible to test every possible functional case. We used a "random data sample" technique to test the functions of each of the different sub-blocks until no errors were found with any particular data sample. Hopefully, the tested data samples were sufficient in finding all errors.

Once every unit is in place and wired up, it is necessary to flatten the entire layout and test the chip from pad to pad, as if the layout were the chip itself. This way, pad errors and other quirks will be revealed. Note that the only way that we could "see" our results was by implementing an sma command, which eventually places the value of register A on the data bus. This is easy to see in the waveforms as the assertion of the MemWriteOut signal.

(NOTE: Irsim test plots of various components and other test related material can be found in the hardcopy report and are not included here.)

In-Depth Performance Analysis

The two critical sub-circuits in our chip are the program counter and the arithmetic logic unit. Both of these have multiple layers of gates with many inputs. We ran Crystal on the PC and the ALU, and the ALU turned out to have a much greater maximum propagation delay (173 ns compared to 27 ns). So, we concentrated our performance analysis on the ALU.

The longest Crystal path is from the SADD input (the signal from the control unit telling the ALU that we are doing an add/subtract or comparison operation) to the flagZ output (the flag that holds the result of a comparison). The path had a delay of 173.73 ns. Note that we have to multiply the maximum delay time by 2 because we are only guaranteed a half of a clock cycle in which to do the calculation. Also, the delay found was within the ALU and does not include latch delays, etc. This resultant path length means that our maximum clocking speed is below 2.87 MHz.

We tried to re-create in SPICE the longest path that Crystal found, but it would not converge for this case. We ran a different SPICE test on the ALU in which A started at value 1000, B started at value 0111, and then A0 was changed from 0 to 1 on an addition. This causes a ripple through all four bits (and the carry bit), changing each along the way. The time that it took from when A0 changed to when flagC (the carry bit) was stable was 80 ns. The delay between each stage in the 4 bit adder is about 15 ns. The lower value returned by SPICE leads us to suspect that Crystal may be slightly pessimistic, and our operating frequency could be higher than 2.87 MHz.

References

Assorted "man" pages and tutorials printed in the Rice University "ELEC422 - VLSI Design I" course packet, 1996.

Wakerly, John F., Digital Design: Principles and Practices. Prentice Hall Inc., 1994.

Weste, N. H., and Eshraghian, K., Principles of CMOS VLSI Design. Addison-Wesley Publishing, 1994.

Last but not least, we would like to thank Dr. Aria Nosratinia for his generous support and encouragement throughout the semester.


Created by : Charles Gamiz, Justin Romberg, and Wayne Vuong
Last Updates : 12/23/96
Click here to send us mail.