Abstract
In today’s world there is a new form of microprocessor called superscalar. In this several instructions can be initiated simultaneously and executed independently during the same clock cycle. The limitation of this feature is the handling of data dependencies. If not handled effectively, execution rate of more than one instruction per cycle is difficult to achieve. This case study uses multi bit scoreboard architecture to handle data conflicts for out of order execution and completion of instructions. The paper analyses the performance of the superscalar microprocessor by using two stimulation models which uses benchmark programs and one calculation model which uses queuing networks to derive the formula for data deficiencies from the peak performance.
Introduction
The single bit scoreboard is sufficient to detect dependency in processors with only one pipeline and also stop the flow of instructions until the line is cleared. Whereas a multi bit scoreboard is used in processors with multiple instructions.
In this the multiple bit scoreboard in combination with temporary result registers will maintain the flow of the instructions and also to achieve peak performance a branch prediction unit is included.
Multi bit scoreboard architecture Model
In this model the pipeline architecture has been implemented and it consists of four stages instruction fetch, instruction decodes, execution, and write-back. Instructions are fetched from the external memory or the cache memory to the instruction buffers and then transferred into the decoding units.
The set of temporary registers are used as renaming registers for instructions with output and anti-dependencies. The branch prediction unit predicts the next stream of instructions. Data needed by the load/store instructions is handled by the data cache. In case of an interrupt the retire unit restores the proper processor states and also keeps track of instructions in the pipe. The execution unit has many functional units which handles a different class of operations: branch, load/store, integer, ALU, shifter. Instructions are executed with the help of a queue buffer.
The buffer basically holds instructions for more than one instruction dispatched from decoding units, and when the functional unit is busy executing a pervious instruction. For handling of data dependencies In this the instructions reference the register file directly and only load/store instructions can access external memory for data. A set off scoreboard bits in the register file indicate how the register is being utilised by the current instructions. READ is a multi-bit which indicates that the instruction is a source operand. WRITE is a single bit which indicates that the instruction will store the result data into the register.
TEMP is a single bit which indicates that a conflict has occurred with the prior instruction and will store the result data in a temporary register until the conflict is over. The decoding unit checks the scoreboard for dependencies and accordingly sets the scoreboard bits while accessing the register. Algorithm for setting status bits are as follows: READ OPERAND: If TEMP is set, the instructions must wait in decode. It is said to have True dependency. Else if WRITE is set, the instruction must wait in decode and also the READ should increase for anti-dependency checking. Else increase the read and the instruction can be dispatched.
If TEMP is set, the instruction must wait in decode and only one level of temporary register is allowed for each register. Else if WRITE or READ is set, then TEMP is set, and a temporary register is assigned to the instruction; and the instruction can be dispatched. Else sets WRITE and the instruction can be dispatched. Evaluation Method For stimulation we make a C program model. The input is generated by HighC29K compiler and the decoding is based on the AM29K instruction set. To rearrange instructions for better efficiency we use list scheduling and loop unrolling.
The configuration of the machine includes features which can be disabled for evaluation. The machine variables stimulate the stimulation model to achieve highest performance. They are the sizes of caches, register file, temporary registers, decoding units, instruction buffers, branch history table. Integer benchmark programs are used for evaluation. The table is divided into two sections. The first section shows the integer throughputs of the processor of a single bit scoreboard. With the increasing number of decoding units the overall % change also increases. The average maximum execution rate is 1. 98 instructions per clock cycle. The second section shows the performance of the multi bit scoreboard in which the performance gain over the single bit is 29. 2%.
Queuing Network Model
This model consists of a single queue of arriving customers and a single or multiple server to service each customer with a fixed time. The decoding units and functional units process the structure of the model with instruction as the customer. The arriving time of the instruction depends on the instruction-cache-hit ratio and the access time of the external memory. Like the multi bit scoreboard the instructions are fetched from the cache or the external memory.
After the decoding unit it enters the functional unit. The service time for integer operations is one clock cycle and one clock cycle plus the external memory access time for the load/store if the data cache misses. Random number inputs are used for instruction opcodes and operands. Random selection is a good technique for selecting operands. inputs to the program are: the external memory access time, the instruction cache hit ratio, the number of decoding units, the number of registers in the register file, the number of read buses, the number of result buses, the data cache hit ratio, the prediction correctness ratio.
The AM29K has a branch-target cache hit ratio of 0. 6, and an external memory access time of two clock cycles. The instruction cache hit ratio is at 95%, the data cache hit ratio is at 90%, and the correctness prediction ratio is 80% for the model.
Maximum Instruction Issue Rate
In realistic models, the maximum issue rate is affected by instruction fetch bandwidth, data bandwidth, disruption in the flow control, and data dependencies. The instruction bandwidth is enhanced by the instruction cache and instruction prefetch buffer. The data bandwidth is enhanced by the data cache. Branch prediction reduces the disruption in the flow control.
Since the machine variables are very large, the data dependency should be the only factor in delaying execution of instructions in the ideal model. The ideal maximum instruction issue rate is the number of decoding units (D). The register renaming technique has solved the problem with anti- and output dependencies. The four types of dependency which can occur are true dependency (Dtrue), temporary dependency for renaming, bus contention, and resource contention. For maximum performance, only true-dependency should stop the pipe, while the processor should minimize the effect of the other dependencies.
Dtrue is a linear function of the number of decoding units which makes data dependency penalty a linear function of Dtrue for any D. If all instructions execute in one cycle, the penalty is one clock cycle. The extra penalty is from instructions taking longer than one cycle in execution such as load-miss. To calculate the data dependency penalty, we need the data cache miss ratio (Dmiss), percentage of load instructions (Linst), and the external memory access time (Tmem).
CONCLUSION
The implementation of the multi-bit scoreboard has a simple algorithm for checking data dependency. To enhance its performance, the processor is equipped with the following features: a branch-history table, an instruction cache, a data cache, simple temporary registers to handle branch and some read write contentions, multiple decoding units, instruction buffers, an instruction-retire buffer to keep track of the sequential state, a large register file, and many functional units. The multi-bit scoreboard effectively handles data dependencies to reduce pipe stalls.
The multi-bit scoreboard is an extension of the single-bit scoreboard. With the addition of 5 bits, the trade-off is 29% increment in performance for four decoding units. The queuing network model is more realistic. Major problem with the queuing network model is the use of the random number generator for the register file references. They give a lower true dependency ratio than the actual simulation number. The analysis shows that the main factor in performance degradation is the true dependency.
This problem can only be reduced by software. The compiler should efficiently schedule instructions to take full advantage of the hardware. Future studies may provide a superscalar-optimization compiler which can better exploit parallelism of scalar code.
SOLUTION
In advanced technologies it is possible to implement a single-chip multiprocessor in the same area as a wide issue superscalar processor. Applications with little parallelism the performance of the two microarchitectures is comparable.
For applications with large amounts of parallelism, the multiprocessor microarchitecture outperforms the superscalar architecture by a significant margin. Single-chip multiprocessor architectures have the advantage in that they offer localized implementation of a high-clock rate processor for inherently sequential applications and low latency interprocessor communication for parallel applications. Micro architectural innovations employed by recent microprocessors include multiple instruction issue, dynamic scheduling, speculative execution and non-blocking caches.