LIMITATION OF SUPERSCALAR MICROPROCESSOR PERFORMANCE By: - Akshita Banthia (11BCE0475)
In today’s world there is a new form of microprocessor called superscalar. In this several instructions can be initiated simultaneously and executed independently during the same clock cycle. The limitation of this feature is the handling of data dependencies. If not handled effectively, execution rate of more than one instruction per cycle is difficult to achieve. This case study uses multi bit scoreboard architecture to handle data conflicts for out of order execution and completion of instructions. The paper analyses the performance of the superscalar microprocessor by using two stimulation models which uses benchmark programs and one calculation model which uses queuing networks to derive the formula for data deficiencies from the peak performance. Introduction
The single bit scoreboard is sufficient to detect dependency in processors with only one pipeline and also stop the flow of instructions until the line is cleared. Whereas a multi bit scoreboard is used in processors with multiple instructions. In this the multiple bit scoreboard in combination with temporary result registers will maintain the flow of the instructions and also to achieve peak performance a branch prediction unit is included. Multi bit scoreboard architecture
In this model the pipeline architecture has been implemented and it consists of four stages instruction fetch, instruction decodes, execution, and write-back. Instructions are fetched from the external memory or the cache memory to the instruction buffers and then transferred into the decoding units. The set of temporary registers are used as renaming registers for instructions with output and anti-dependencies. The branch prediction unit predicts the next stream of instructions. Data needed by the load/store instructions is handled by the data cache. In case of an interrupt the retire unit restores the proper processor states and also keeps track of instructions in the pipe. The execution unit has many functional units which handles a different class of operations: branch, load/store, integer, ALU, shifter. Instructions are executed with the help of a queue buffer. The buffer basically holds instructions for more than one instruction dispatched from decoding units, and when the functional unit is busy executing a pervious instruction.
For handling of data dependencies
In this the instructions reference the register file directly and only load/store instructions can access external memory for data. A set off scoreboard bits in the register file indicate how the register is being utilised by the current instructions. READ is a multi-bit which indicates that the instruction is a source operand. WRITE is a single bit which indicates that the instruction will store the result data into the register. TEMP is a single bit which indicates that a conflict has occurred with the prior instruction and will store the result data in a temporary register until the conflict is over. The decoding unit checks the scoreboard for dependencies and accordingly sets the scoreboard bits while accessing the register. Algorithm for setting status bits are as follows:
If TEMP is set, the instructions must wait in decode. It is said to have True dependency. Else if WRITE is set, the instruction must wait in decode and also the READ should increase for anti-dependency checking. Else increase the read and the instruction can be dispatched.
If TEMP is set, the instruction must wait in decode and only one level of temporary register is allowed for each register. Else if WRITE or READ is set, then TEMP is set, and a temporary register is assigned to the instruction; and the instruction can be dispatched. Else sets WRITE and the instruction can be dispatched.
For stimulation we make a C program model. The input is generated by HighC29K compiler...
Please join StudyMode to read the full document