The Virtual Register Mapping Problem: Can We Combine Superscalar And EPIC Ideas? 1. Introduction The method of register renaming has been in use for several decades, based on the idea of a new identification tag being used for each new result from a load/arithmetic unit to distinguish it from previous values and correctly establish the producer-consumer relationship between instructions. In superscalar execution, the tags correspond to virtual registers, and each result goes directly to the consumer instructions holding its tag in their operand fields; in EPIC machines each tag represents a real register, and consumer instructions acquire their operands from the register file after sufficient time has elapsed for each required new result to have been stored into its tag-designated register. The mapping of virtual register tags in superscalar execution is dynamic, i.e., tags are determined at runtime depending on what tag is free (or "graduated"); in current EPIC designs, tags are determined at compile time and static during execution. This results in a very significant difference relating to the impact of a load cache miss: in superscalar execution, the consumer instructions simply get delayed, and the operand tag for the load result has its graduation delayed too and is not available for mapping till later, but there is no further impact on the execution of other instructions. In current EPIC machines, a cache miss causes execution to stall because the hardware does not have information about the consumer-producer relations contained in the instruction bundles, which were statically produced by the compiler, not mapped at runtime by the hardware. It is certainly possible to partially adopt superscalar methods in EPIC machines, by assigning register tags and generating instruction bundles at runtime. Instead of actual EPIC instructions, the object code stored in memory would be some kind of abstract machine code (e.g. Java bytecode), which is converted to instructions carrying virtual register numbers by a tagging unit, e.g., given abstract machine code load A load B * store C the tagging unit would produce T1 load A load acquires T1 and pushes copy on tag stack T2 load B load acquires T2 and pushes copy on tag stack T3 * T1 T2 * pops T1 T2 from tag stack, acquires T3 pushing copy T3 store store pops T3 from tag stack assuming that all the tags are free. The execution would then have load A into register 1, load B into register 2, * acquiring the two operands from the registers when both are Ready, putting its result into register 3, from which the store instruction would acquire the value. If either A or B suffers a cache miss, * and store would be delayed and the tags involved would graduate later, but the tagging and execution of other instructions are not affected. In superscalar execution, the tagged instructions would be directly dispatched to their execution units and will wait for any unavailable operands to be sent to them before execution. In EPIC-like execution, those instructions with tags corresponding to ready registers can be placed into execution bundles and queue up for dispatching, while those whose registers are still waiting for load/arithmetic results will wait. This would achieve a kind of compromise between superscalar and EPIC modes of operation. A second advantage of generating EPIC code at runtime is that abstract machine code is generally much shorter than VLIW code, which tends to contain a high portion of empty slots. Shorter object code reduces not only space, but also the time taken to fetch the code and the chance of cache misses with a larger portion of the program being in the CPU. The particular form of abstract machine code used to represent the source (high level language) program has just a minor impact on our idea's implementation details, as would be pointed out below. The main concern of EPIC designers against such compromises is over the time and circuit complexities: tagging appears to be a sequential process, and the register-readiness waiting and operand acquisition with N tagged instructions and M registers requires hardware complexity of order NM. In this study, we wish to establish that both processes can be made scalable. 2. Multiple tagging units: We first assume that the object code is stack-based abstract machine code. In typical programs, there are various sections that can start and finish with an empty stack, e.g., C = A+B; If C==D X = 1 Else X = -1; For I = 1 To 10 Y[I] = Z[I]+X; the arithmetic expression, if-then-else and loop can each start executing on an empty stack and will finish with an empty stack. Object code from the three sections can undergo the tagging process in separate units: they do not pass execution results to the following code sections via the stack; any data transfers occur through the memory so their executions are sequential, but their tagging need not be. The compiler can insert markers before each such code section so that it can be sent to a separate tagging unit, without adding significantly to the burden of compiling and code optimization. Note that each iteration of a loop requires a separate tagging process to enable iterations to execute in parallel, since they must use different registers to hold their load and arithmetic results. What if the program is such that such convenient sections cannot be found? This can occur in certain complex arithmetic expressions in which each part leaves some result on the stack for the next part to use, or in a form of loop carrying in which each iteration leaves some results on the stack for the subsequent iteration(s) to use; it would seem to us that the compiler can assist, such as breaking such stack-carried links by storing these results into memory locations and having the later sections bring them back, creating separately taggable sections. We have also devised a tag conversion scheme whereby each tagging unit can pass on a set of tags to the next unit to identify the elements at the stack base needed by the next section of the program, but before a tagging unit receives the information it has already performed the tagging process representing the unknown tags using a set of dummy tags, which are converted to actual tags when these are received from the previous tagging unit. For object code based on a register architecture (abstract or real) the situation is not much different; the compiler can still identify/create separately taggable sections from the structure of the high level language program, each loading its needed data from memory without "inheriting" register operands from earlier sections. The dummy tags method can again be used to handle unavoidable inherited register references. The tagging process itself is largely the same as the commonly used register renaming process. To operate multiple tagging units, the set of register tags would be partitioned into a number of private tag pools, one for each tagging unit, plus a shared pool that serves two purposes: first as a backup pool from which a tagging unit can "borrow" tags when it uses up its own, then for use as "inherited" tags that are shared between different tagging units. (A third use of the shared pool will be discussed below.) Thus, the instructions produced by each tagging unit would have the unit's private tags and common pool tags only. 3. Operand acquisition: The partitioning of tags necessarily results in a partitioning of the operand acquisition process: previously N tagged instructions could acquire operands from any of M registers, but if there are t tagging units, tags are divided into t+1 pools and each tagging unit produces ~N/t instructions (the distribution is not necessarily even as code sections are of different sizes and not all the sections start tagging at the same time), each group only acquires operands from 2M/(t+1) registers, those in its private pool and from the shared pool. The circuit complexity of the acquisition is therefore o(2MN/t). However, whereas for each private pool, a crossbar of N/t x M/(t+1) is used, the shared pool uses a larger crossbar of N x M/(t+1) that may have a greater level of actual complexity and operate more slowly than the t smaller crossbars. The important point is however that the design is scalable: given a greater degree of instruction level parallelism and larger number of registers, more tagging units and tag pools would serve to keep down both tagging latency and operand acquisition complexity. A comment on register graduation is in order. When a tag represents an operand placed on the stack, the operand is used once only, so the register graduates immediately after its content has been acquired by the consumer instruction. It is however possible for the compiler to identify repeatedly loaded operands and cause them to be "retained", i.e., the register does not immediately graduate after use if the Retain flag is set. The compiler then converts a subsequent load/arithmetic operations involving this variable into an arithmetic operation using this register instead, the conversion being similar to part of a stack folding operation. Because the use of such retained values typically extends over multiple separately tagged code sections, the registers involved must be from the common pool. If the producer instruction initially used a tag from the private pool, a tag conversion is again required. Again the situation for a register-based abstract machine is not much different, with the compiler needing to distinguish between use-once and use multiple times register allocations, adding somewhat to its burden of register use optimization. We feel that there is potential to implement virtual register mapping by combining the features of superscaler and EPIC execution and benefit from the superior features of each. Superscalar Execution of Stack Programs Using Reorder Buffer C K Yuen School of Computing National University of Singapore Kent Ridge, Singapore 119260 Abstract It is shown that stack machine programs can be executed with a reorder buffer and have the same instruction level parallelism as register machine programs with register renaming, and the operation of the buffer is simpler because of single use stack operands. Related issues of pushing stack into memory, multithreading and branch predication are also discussed. 1. Introduction In the past 20 years, while phenomenal progress was taking place in microprocessor architecture, stack machines, which once enjoyed some commercial success (Burroughs 6700, HP3000, ICL2900, Data General Eclipse) and favour of some academicians [1][2], were all but forgotten. It is generally thought that the superscalar techniques, which have been used in register machines especially those with RISC architectures, are not applicable to stack architectures, because stack programs lack instruction level parallelism, and register renaming does not apply. With instructions taking operands from the top of the stack and leaving results there, stack programs appear to have a high level of data dependency, and with instructions displaying no source and destination register references, data dependency relations are thought to be difficult to analyze. In this paper we seek to dispel these notions, by showing that the method of the reorder buffer already provides the solution. Typical stack program examples reveal the same amount of instruction level parallelism as equivalent register machine programs, and are easier to parallelize because operands are erased once used by an operator. Hence, an operand need only be supplied to one operator, uniquely identified by its reorder buffer tag, and there is no need for a common data bus that seeks out all instructions that require a new result emerging from an execution unit. Once used, a new result is immediately discarded without being actually stored into the stack, in contrast to new register contents which must be written back to registers from the reorder buffer even if they may already have been superseded by later writes. Attaching operand tags to an operator is also easily done, using a stack of tags, whereas a register machine has to search the reorder buffer to find the tags of the required source registers. To start, consider the follow two expressions with corresponding stack programs: g = a*b+(c+d) - LD A, LD B, *, LD C, LD D, +, +, ST G g = a*(b+c*d) - LD A, LD B, LD C, LD D, *, +, *, ST G The first has some instruction level parallelism: if the fetching of a or b is slow (e.g., cache miss), a superscalar machine would proceed with the fetch of c and d into other registers and produce the result cd out of order, and dispatch the + instruction for execution as soon as ab emerges from the multiply unit and is forwarded to the instruction. The second has little parallelism, though if the fetch of b is delayed, c*d would proceed in parallel. We now show how the same parallelism can be achieved in a stack machine that uses a reorder buffer to "rename" stack location with tags, and an operand tag stack to identify source operands with operators that consume them. Using the first example above, the renaming unit would use a new tag for every instruction tha leaves a result on the stack and leave this tag on the tag stack for attachment to a later instruction that consumes the operand: instr naming unit tag stack ld a T1 ld a T1 ld b T2 ld b T1 T2 * T3 * T1 T2 T3 ;* picks up two tags leaving its own tag ld c T4 ld c T3 T4 ld d T5 ld d T3 T4 T5 + T6 + T4 T5 T3 T6 + T7 + T3 T6 T7 st e st T7 e ;store consumes tag without leaving one After an instruction executes in a load/store or arithmetic unit, its result is delivered to the later instruction that carries its tag, e.g., the first two Load instructions would deliver their operands to the *, and the last two loads to +, which will execute and deliver results to the second +, in the same manner as other superscalar machines. This can be achieved using a common data bus or in the reorder buffer itself. In the following sections, the reorder buffer method will be explained as it only requires a table rather than a series of diagrams, though the principle is the same. 2. Simple Examples Below is an 8-slot stack reorder buffer. Tags 0 to 7 are used from bottom upward, and each entry may be Idle or Busy depending on whether it is occupied by an operator. The result of this operator may be Committed to a destination operator (e.g., in the first expression above, the results of both * operators would be Committed to the +), which may use it as the Left or Right operand if the operator is binary. The tag of the destination operator appears in the DTag field. The operator result is Free if the destination operator has not yet appeared in the reorder buffer. A Free result may be Available now, if the operator has already executed, and the result would be stored in the Value field, or it may be Waiting, either because its own operands have not yet arrived, or if they have arrived earlier and the operator has been dispatched for execution, but the result has not yet been returned. If a binary operator has already received one operand, it is stored in the Value field, and the R or L bit set depending on whether it is the Right or Left operand. When the second operand arrives, R and L bits are both set, and the operator is dispatched, so that there is no need to store the second value. Note that a Committed operator cannot be Available, because when the result arrives, it is immediately sent to the destination operator, and the source operator will be deleted from the reorder buffer; the result would not be stored. This is why the W/A flag for a Free entry may be reused as the R/L flag when the entry become Committed. Before any execution has taken place, the reorder buffer has the following state: tag flags DTag Op L R Value 7 I F W -- - 0 0 -- 6 I F W -- - 0 0 -- 5 I F W -- - 0 0 -- 4 I F W -- - 0 0 -- 3 I F W -- - 0 0 -- 2 I F W -- - 0 0 -- 1 I F W -- - 0 0 -- F-> 0 I F W -- - 0 0 -- Operand tag stack: {} All entries are Idle, Free and Waiting, with no destination tags, opcodes or operands. The operand tag stack is empty, while the Free Entry pointer is at the bottom. Now two load instructions from the stack program ab*cd*+ arrive, changing the situation to: tag flags DTag Op L R Value 7 I F W -- - 0 0 -- 6 I F W -- - 0 0 -- 5 I F W -- - 0 0 -- 4 I F W -- - 0 0 -- 3 I F W -- - 0 0 -- F-> 2 I F W -- - 0 0 -- 1 B F W -- LD 1 1 @B 0 B F W -- LD 1 1 @A OTS: {0,1} LD has one operand, the address, which is already present; hence, it can be immediately dispatched. Its R and L bits are both set by default. The two loads are allocated the entries 0 and 1, and these tags appear in the Operand Tag Stack. The two entries are now Busy, but still Free, and have no DTag values. The * operator now arrives, and takes the two operand tags from the OTS. It is given entry 2 of the buffer, and 2 is entered as the Destination instruction tags of entries 0 and 1, which are now Committed, one being the Right operand and the other the Left operand of the * operator. The Operand Tag Stack now has the tag 2 because the * operator would erase the two operands and leave behind its own result, so that only one operand is available on the stack: tag flags DTag Op L R Value 7 I F W -- - 0 0 -- 6 I F W -- - 0 0 -- 5 I F W -- - 0 0 -- 4 I F W -- - 0 0 -- F-> 3 I F W -- - 0 0 -- 2 B F W -- * 0 0 -- 1 B C R 2 LD 1 1 @B 0 B C L 2 LD 1 1 @A OTS: {2} Suppose the two loads achieve cache hits and return values now, each will use its own entry to find the DTag 2, and will then be sent to the * operator. Their entries are now Idle again. Assuming that the * also completes quickly, then we would have tag flags DTag Op L R Value 7 I F W -- - 0 0 -- 6 I F W -- - 0 0 -- 5 I F W -- - 0 0 -- 4 I F W -- - 0 0 -- F-> 3 I F W -- - 0 0 -- 2 B F A -- * 1 1 (a*b) 1 I F W -- - 0 0 -- 0 I F W -- - 0 0 -- OTS: {2} The arrival of the next two loads and * would produce a similar process, giving: tag flags DTag Op L R Value 7 I F W -- - 0 0 -- F-> 6 I F W -- - 0 0 -- 5 B F A -- * 1 1 (c*d) 4 I F W -- - 0 0 -- 3 I F W -- - 0 0 -- 2 B F A -- * 1 1 (a*b) 1 I F W -- - 0 0 -- 0 I F W -- - 0 0 -- OTS: {2,5} When + arrives, it can immediately obtain the operand values in 2 and 5 and proceed to execution, leaving its own tag in the Operand Tag Stack: tag flags DTag Op L R Value F-> 7 I F W -- - 0 0 -- 6 B F W -- + 1 1 -- 5 I F W -- - 0 0 -- 4 I F W -- - 0 0 -- 3 I F W -- - 0 0 -- 2 I F W -- - 0 0 -- 1 I F W -- - 0 0 -- 0 I F W -- - 0 0 -- OTS: {6} If in the mean time ST arrives, it would take tag 6 as its right operand (the left operand of a store is the store address), putting its Dtag 7 into entry 6, and will proceed to execution when the result of + is returned: tag flags DTag Op L R Value 7 I F W -- ST 1 0 @G 6 B C W 7 + 1 1 -- 5 I F W -- - 0 0 -- 4 I F W -- - 0 0 -- 3 I F W -- - 0 0 -- 2 I F W -- - 0 0 -- 1 I F W -- - 0 0 -- F-> 0 I F W -- - 0 0 -- OTS: {} Note that the stack is empty again since the store removes the final result. Now consider an alternative scenario, that the fetches all encounter cache misses so that no load succeeds before the arrival of ST: tag flags DTag Op L R Value 7 B F W -- ST 1 0 @G 6 B C R 7 + 0 0 -- 5 B C R 6 * 0 0 -- 4 B C R 5 LD 1 1 @D 3 B C L 5 LD 1 1 @C 2 B C L 6 * 0 0 -- 1 B C R 2 LD 1 1 @B F-> 0 B C L 2 LD 1 1 @A OTS: {} None of the arithmetic operators have started executing, but the stack is empty since all the operands are committed and no further operators can be accepted because there will be nothing for them to process. Now suppose A, C and D are loaded before B; the second * is dispatched but the first is still waiting: tag flags DTag Op L R Value 7 B F W -- ST 1 0 @G 6 B C R 7 + 0 0 -- 5 B C R 6 * 1 1 -- 4 I F W -- -- 0 0 -- 3 I F W -- -- 0 0 -- 2 B C L 6 * 1 0 (A) 1 B C R 2 LD 1 1 @B F-> 0 I F W -- -- 0 0 -- OTS: {} After the second * completes and the first starts, we have tag flags DTag Op L R Value 7 B F W -- ST 1 0 @G 6 B C R 7 + 0 1 (C*D) 5 I F W -- -- 0 0 -- 4 I F W -- -- 0 0 -- 3 I F W -- -- 0 0 -- 2 B C L 6 * 1 1 -- 1 I F W -- -- 0 0 -- F-> 0 I F W -- -- 0 0 -- OTS: {} Eventually, we will have the + executing, followed by the ST, as we have seen before. The second scenario shows that we achieve the same instruction level parallelism as the case of the register machine, without requiring the compiler to make any special effort. In contrast, in a register machine it is necessary to ensure that c*d is using different registers from a*b, which can only be done if there are free registers available. Otherwise, one may need to store away some registers beforehand. Such compiler optimizations are vital for the good performance of register machines, but are automatic in a stack machine because each subsequent subexpression would be evaluated in a higher part of the stack and have different reorder buffer entries. Note that if at the time an operator is sent for execution, it is already Committed, then the DTag is attached to it so that its result could be directly sent to its consumer operator and its entry may be deleted. A Free operator which remains Free when the result returns will store its result to wait for the consumer operator. An operator that becomes Committed after being sent for execution retains its entry until the result returns and the DTag is used to forward the result to to consumer operator. 3. Complications Now consider the second expression abcd*+*, with deeper stack nestings: tag flags DTag Op L R Value 7 B F W -- ST 1 0 @G 6 B C R 7 * 0 0 -- 5 B C R 6 + 0 0 -- 4 B C R 5 * 0 0 -- 3 B C R 4 LD 1 1 @C 2 B C L 4 LD 1 1 @D 1 B C L 5 LD 1 1 @B F-> 0 B C L 6 LD 1 1 @A OTS: {} Again the same instruction level parallelism is achieved: if the fetch on B is delayed by a cache miss, but C and D can be fetched immediately, then the first * executes in parallel with the wait for B, after which +, second * and ST would proceed sequentially. Note however that deeply nested expressions tend to fill up the reorder buffer quickly: the operands must be fetched and stacked up, before any operators can execute to free entries for later operands and operators. For register machines, this causes one to run out of free registers, but one can often avoid storing and retrieving registers by shuffling things around them. This cannot be done with the stack, and the only solution is to store the oldest part of the stack into memory and retrieve it later. In the above example this is still not forced, because we can wait for operators to execute and free the entries, but if the example is changed to abcdefgh*+*+*+*, then waiting would not solve the problem because the reorder buffer is filled with operands but there are no operators to consume them: tag flags DTag Op L R Value 7 B F W -- LD 1 1 @H 6 B F W -- LD 1 1 @G 5 B F W -- LD 1 1 @F 4 B F W -- LD 1 1 @E 3 B F W -- LD 1 1 @D 2 B F W -- LD 1 1 @C 1 B F W -- LD 1 1 @B F-> 0 B F W -- LD 1 1 @A OTS: {0,1,2,3,4,5,6,7} SIM: Base-1 How does one detect this condition? We note that the Operand Tag Stack is also full: there can be at most 8 entries in the stack since there are only 8 different tags. If some of the entries are occupied by binary operators that consume two operands to produce one result, then the OTS would have been reduced, and it can only become full if all entries are operands. When this happens, we need to cache the entry indicated by the bottom of OTS in memory because it must be the oldest. Hence, each OTS must have be associated with a Stack In Memory pointer which shows the top of the cached portion of the stack. In the above example, SIM is Base- 1 since that part is currently empty. For the example abcdefgh*+*+*+*, we would then have, assuming that A, G and H have been fetched but none of the others have: tag flags DTag Op L R Value 7 I F W -- - 0 0 -- 6 I F W -- - 0 0 -- 5 B F W -- LD 1 1 @F 4 B F W -- LD 1 1 @E 3 B F W -- LD 1 1 @D 2 B F W -- LD 1 1 @C F-> 1 B F W -- LD 1 1 @B 0 B F W -- * 1 1 -- OTS: {1,2,3,4,5,0} SIM: Base Note that SIM now points to the base of the stack frame, where A has been stored to release entry 0 for the operator *, while H and G, having been committed to the *, have been fetched and * is being executed, and there are two free entries in the reorder buffer for later instructions. Then tag flags DTag Op L R Value F-> 7 I F W -- -- 0 0 -- 6 B F W -- + 0 1 (F) 5 I F W -- -- 0 0 -- 4 B F A -- LD 1 1 (E) 3 B F A -- LD 1 1 (D) 2 B F A -- LD 1 1 (C) 1 B F A -- LD 1 1 (B) 0 B C L 6 * 1 1 -- OTS: {1,2,3,4,6} SIM: Base which depicts the situation after B-F were all loaded while the first * was in execution, but in the mean time operator + arrives, taking operand tags 0 and 5, freeing the entry for F after taking the value already Available. Note that there are 6 operands (a in memory, b,c,d,e, and g*h+f) still available. Eventually we will have the situation tag flags DTag Op L R Value F-> 7 I F W -- - 0 0 -- 6 B F W -- + 1 1 -- 5 I F W -- - 0 0 -- 4 I F W -- - 0 0 -- 3 I F W -- - 0 0 -- 2 I F W -- - 0 0 -- 1 I F W -- - 0 0 -- 0 I F W -- - 0 0 -- OTS: {6} SIM: Base with the last + operator still executing, and last * arriving. It will take tag 6 for its right operand, but there is no tag in the operand tag stack for the left operand, meaning that the operand must be read from the memory, presumably from the cache if not much time has elapsed since the last save. After this is done, SIM would return to the value Base-1. The process of save and retrieve is similar to the caching of top-of-stack registeres used in stack machines. It is clear that deeply nested expressions are undesirable, though with considerably larger reorder buffers than in the example, the overflow situation would not occur so quickly. For mathematical expressions, programmers and compilers can modify the code to reduce nesting depth, but the problem would still arise because of function calls within expressions, e.g., x = a+b*(c+F(exp1,exp2...)). A stack frame in memory would be created for F, and the argument expressions must be evaluated. These would then stored into the argument locations in the stack frame, and the body of F is entered, with new variable values loaded onto the stack to execute the code therein. All this will be done with a,b and c already in the reorder buffer, and if functions call functions, the stack nestings become inevitably deep, so that moving from reorder buffer to SIM and later retrieving would be quite normal. Their actual impact on performance would need to be further studied and quantitatively analyzed. 4. Reservation stations In the above discussion, instructions with missing operands wait in reorder buffer itself. It is also possible to have reservation stations with each arithmetic unit to hold the waiting instructions and have a common data bus that searches for instructions having a particular result tag. When a result emerges from an arithmetic unit, it must be delivered to the consumer instruction with a matching tag. Doing this at high speed can be expensive. With a stack architecture each result has only one consumer instruction, and a tag conversion scheme, changing the result tag of an instruction to the reservation station address of the consumer instruction, if the consumer arrives for execution while the producer instruction is still in a reservation station, allows the result to be directly delivered to the consumer. specifically: 1. when an instruction is dispatched to a reservation station, its reservation station address is left on the stack; the stack location is attached to the instruction as well as its unique tag 2. if the consumer instruction comes in for execution while the producer instruction is still in the reservation station, the consumer instruction inserts its own reservation station address to replace the tag of the producer instruction; when the producer instruction is executed, its result can be directly sent to the consumer instruction; the producer reservation station address left on the stack is erased (just like an operand is erased by the consumer instruction) 3. if the producer instruction goes into the arithmetic unit for execution before the arrival of the consumer instruction, its tag has not been converted; the reservation station address left on the stack is replaced by the tag so that the consumer instruction, if it arrives while its operand is being computed, would pick up the tag; when the result is produced, the tag is used to search for the consumer instruction among the reservation stations; if the consumer instruction still has not arrived for execution, the stack address is used to put the value into the stack so that the consumer instruction will later pick up a value rather than a reservation station address or a tag 4. if an instruction finds both its operands waiting on the stack, it does not get a reservation station but goes directly for execution; it need to go through the tag matching process like in 3b 5. If we set up a table with one entry per tag, a consumer instruction can put its own reservation station address into the table when it finds that its operand producer has already gone for execution; when the result emerges, the table is used to convert the tag to the address then so that the operand can be directly delivered; the table can be quite large however, and would be sparsely used, so the "tag conversion after execution" method can be rather expensive 6. note that in the scheme, the reorder buffer is merged with the stack, which may contain reservation station addresses, tags, or values Note that tag conversion does not work for register machines, because there may be multiple consumer instructions for a result We can still have a reorder buffer, which only holds operands not picked up by instructions, just as the stack holds tags not yet picked up. It then becomes natural to integrate the two. That is, each entry of the stack can hold either a tag or a value, and a new instruction coming from the fetch queue would pick up either operand values or operand tags from the stack, leaving behind its result tag. When the instruction has executed, the common data bus would deliver the value to an instruction in a reservation station or to the stack, dependidng on where the tag is found, after which the value need not be retained. The main advantage of this arrangement is simpler handling of branch prediction: in the event of an incorrect prediction, the stack must be restored to the state just before the branch. This can be achieved by placing a branch marker on the stack, and whenever the stack is popped below the marker, the items popped off are saved on a buffer stack. If the prediction is confirmed, the marker is removed from the stack and any items in the buffer stack above the marker may be erased. If the prediction is incorrect, the stack content above the marker is erased and the items saved on the buffer stack are returned to the execution stack. If prediction within prediction occurs, the second branch is handled similarly. If the first prediction is incorrect, only items saved for the first prediction are brought back to the execution stack. It is simpler to place a marker and forbid popping below it, but this prevents such optimized code as ld x blt a st y br b a: st z b: ... for if x<0 then z=x else y=x; and for if x<0 then a=b+c else a=b-c the code has to be compiled "as is" instead of being optimized to ld b, ld c, if x<0 then + else -, st a If we stick to the method used in the last section, each entry need to have a prediction indicator (say 2-bits to permit three levels of prediction), and an entry produced during a predicted branch is marked accordingly so that if the prediction fails, all entries with indicators at that predication level or higher are marked as "free". An entry consumed during branch prediction is not immediately freed, but only after the prediction is confirmed. This is achieved by reusing free entries only if they have zero prediction indicators. A branch is somewhat like a store, in that it consumes one operand from stack and makes use of an address, except that it changes the program counter instead of a memory location. To take an example: LD A, LD B, CMP, BEQ X, we have tag flags DTag Op L R Value 7 I F W -- - 0 0 -- 6 I F W -- - 0 0 -- 5 I F W -- - 0 0 -- F-> 4 I F W -- - 0 0 -- 3 B F W -- BEQ 1 0 @X 2 B C W 3 CMP 0 0 -- 1 B C R 2 LD 1 1 @B 0 B C L 2 LD 1 1 @A OTS: {} which depicts the situation when all four instructions have arrived, but neither A nor B have been fetched. The operand stack is empty as the code leaves no result on the stack for later operators. Branch prediction table is not affected by changing to a stack machine, since it uses past branch history without involving registers or stack. The stack reorder buffer idea is also compatible with multithreading, since the operand and operator entries in the reorder buffer for each thread are linked together using tags, and can mix with entries for other threads. Creating a new thread requires the allocation of memory for its stack frame, which contains its variables, together with space for working storage that is only used when a thread need to save entries from the reorder buffer. Each thread's context consists of a program counter, an SIM pointer referring to its own stack frame (which may be linked to stack frames of other threads if they are part of the same multi tasking program), and an operand tag stack, which will have a size limit much smaller than the size of the reorder buffer, so that no thread can easily monopolize the whole buffer. A deeply nested thread would soon strike the OTS size limit, triggering memory saves and retrievals, thus slow down its own execution, hence encouraging code revisions. A shallowly nested thread which quickly consumes operands from the stack can have more reorder buffer entries than OTS entries however. One way leading to multithreading is that, when a function call is encountered in the source program F(exp1,exp2...), the compiler may decide after analysis that the argument expressions are complex but free of side effects, and spawn separate threads for each. The threads share the stack frame of the original thread global variable access, but must have individual stack frames for working storage, which is pointed to by the SIM. Because the threads must have no side effects, writing into the shared stack frame is not permitted, nor can they communicate with each other through messages or shared space unless the source language has suitable facilities that permit thread spawning without requiring them to be pure. It is possible for the operating system to automatically save the oldest entries listed at bottom of the OTS of a thread into memory when it is blocked, in order to free entries for other threads to use, provided values are already Available. These are the operands which have not yet committed to a destination instruction, and are placed into the SIM portion. Committed ones cannot be saved because they will be needed by an operator already in the reorder buffer, but they will be freed after use in due course, so that their non-removal does not have a long term effect. Uncommitted unAvailable entries also cannot be removed because values may arrive later and have nowhere to go, and saving into SIM must wait until then. Because operators and operands of a suspended thread can still be in the reorder buffer, its execution does not completely stop until later, but no new instructions will be allocated entries and the OTS need not be available in the CPU. Being compatible with both multithreading and branches, the stack reorder buffer idea is also compatible with branch predication, which creates new threads by executing both paths of a conditional branch, and confirming one while cancelling the other when the Boolean condition is known. While they are executing unconfirmed (speculative execution), the two threads must share the same stack frame that the original thread was using before the branch, and saving reorder buffer entries into the memory need to be forbidden. To cancel a thread, its context (PC, SIM and OTS) is removed so that no further allocation of reorder buffer entries will be made, previous entries can be cancelled by using the content of OTS, which shows all the uncommitted entries, and each entry may also be used to recursively free additional entries that committed to it. Consider the simple example IF AB THEN X = Y ELSE X = Z; which has the object code LD A, LD B, CMP, BEQ else, LD Y, ST X, BR, else: LD Z, ST X, ... . IF A and B cause cache misses, it would obviously be beneficial to start the fetching of Y and Z, but without executing the store into X. Thus we have tag flags DTag Op L R Value 7 I F W -- - 0 0 -- F-> 6 I F W -- - 0 0 -- 5 B F W -- LD 1 1 @Y 4 B F W -- LD 1 1 @Z 3 B F W -- BEQ 1 0 @else 2 B C W 3 CMP 0 0 -- 1 B C R 2 LD 1 1 @B 0 B C L 2 LD 1 1 @A then context - OTS1: {4} PC -> ST X, BR else context - OTS2: {5} PC -> ST X, ... If BEQ produces FALSE, then the second context is cancelled, with deletion of entry 5 (there are no other entries committed to 5), while the first context is allowed to proceed further. Another reason for holding back store (and also branch) instructions is that they leaves no result on the stack, hence no tag on the OTS, and cannot be so readily identified for cancellation. All these elaborations need to be further considered. 5. EPIC It is also possible to let the language processor go through the ILP identifications process, equivalent to what was described earlier under reorder buffer or reservations station/common data bus, identifying instructions that can fill explicitly parallel slots in a multi-instruction bundle while enforcing the true dependencies and avoiding false ones (which arise from the reuse of tags freed after operand consumption). Such a machine would have real instead of virtual registers, with each tag representing a register. When the load unit/arithmetic unit produces a new value, it is delivered to the register numbered by the tag, from which the consumer instruction retrieves it, releasing the register for use by future instructions. As the registers are released immediately after use, register allocation is automatically optimal. Because in a stack virtual machine each tag/register value is used once only, we can apply the tag conversion method allowing results to be direct delivered to the consumer instructions instead of the instructions suspending waiting for the registers to receive the values, which can occur because of cache misses delaying the loading of registers from memory. If compilation has been done using a stack architecture, each operand from memory only has one consumer instruction, and in the event of a instruction unable to execute owing to a cache miss, the instruction is diverted to a temporary instruction buffer, and the target of the load is converted to the buffer operand address. The original target register is then freed for use by later instructions. Alternative to the tag conversion idea is to simply keep the register busy till the cache miss has been rectified, at which point the instructions shunted aside to the wait buffer can resume, but this assumes enough other registers are available so that keeping a number of them busy would not stall later execution because of lack of free tags. The more difficult issue is how to get the stalled instruction back from the wait buffer into the execution pipelines without delaying ongoing execution. Predication tags can be used to deal with branches in a way similar to what was done in the Itanium. Each branch grows its own stack on a common trunk existing before the branch is encountered. As each pushes new operands on the stack, different tags are deployed, corresponding to different CPU registers being allocated to the instructions as destinations. Hoisting load instructions in front of branch needs more complex handling however, because the compiler must first assign a tag to the load as its destination, then move the load forward replacing it in the branch part by a check instruction, making sure that it selects a tag that is free between the branch part and the forward part. This implies that an EPIC machine would need a larger number of registers than the number of tags deployed in a machine using reservations stations. There is also a basic issue here: if the stack is used merely as a way to assign registers to instructions, do we really have a stack architecture at all, or merely an abstract machine design as a means of compiler optimization? The other issue is: EPIC provides a fixed parallel instruction window to reduce the amount of work the hardware does at execution time but there may be periods when not all the slots can be filled; the common data bus approach adapts to varying conditions by buffering instructions in reservation stations for later execution. To some extent, this is also helped by speculative loads when data fetches causing hard to predict delays can be confirmed before use, but at the expense of tying up registers for longer periods. EPIC instructions use registers assigned by compilers, whereas in a reservation station/reorder buffer machine tags are assigned dynamically. If a subroutine call occurs, a tag stack machine would assign unused tags to values loaded from memory and results of instructions as they are pushed on the stack, different from those used in the calling program. It is not necessary to save values waiting in the reorder buffer or in the instructions in the reservation stations to allow the tags to be reused in the subroutine. If registers are assigned at compile time, then it is difficult to ensure that the subroutine uses different registers from all its callers, and saving/restoring registers is generally necessary. As mentioned earlier, the use of a stack architecture as the intermediary abstract machine before compiling into EPIC code, permits a limited form of out-of-order execution through tag conversion. This removes a major problem of EPIC performance due to the uncertain delay for data loads from memory because of possible cache misses. When a load is hoisted before a branch, a check instruction takes its place within the branch thread to confirm a load, so that any exception caused by the pre-fetch takes effect only with the check. In addition to poison flags, the registers can also have ready flags similar to those used in scoreboard machines, so that the check instructions hang till any pending loads complete. However, this hangs the whole pipeline, if we do not send later instructions to reservations stations out of order. More importantly, the use of stack code can provide solution to the frequent vacant slots problem of EPIC/VLIW machines: space is used up to store and time take to fetch instruction bundles with only partly filled slots. The idea of using data compression algorithms to reduce code size stored in memory, to be decompressed when fetched for execution, causes the problem of branch pre-fetching: the branch destination address no longer matches the instruction's location in' the memory. If we store the stack program in the memory, and use the tagging scheme to convert the instructions to the equivalent, virtual register referencing form, then a branch instruction in the pre-conversion code would refer to the actual branch destination address in the memory and the code there can be pre-fetched. In the reorder buffer/ reservation station scheme, the conversion takes place in the execution unit, and registers are virtual since producer instructions send their results directly to consumer instructions. In the EPIC scheme, the conversion occurs before instructions are dispatched for execution, and registers are real, to which producer instructions send results for consumer instructions to retrieve. In short, the relative merits of the real versus virtual registers approaches can only be determined after considerable actual experience, which we hope to gain from the new Itanium as well as its various enhancements. 6. tag retention scheme Our solution to the repeatedly needed variable problem mentioned earlier is to retain reused values by retaining their tags, instead of freeing a tag after use. suppose we want to compute y=(((ax+b)x+c)x+d)x+e the value of x is repeatedly needed; the compiler inserts after the Load x instruction a Retain instruction requesting that tag for the loaded value be retained as Retained Tag 1, and the tag allocated by the execution unit to represent the value, which is placed on the stack after the load instruction has been issued, be copied to entry 1 of the Retained Tags (RT) store; the tag is picked up by the consumer instruction in the usual way and the loaded value will be delivered to the consumer as usual. However, later uses of the value then require the execution of the ReUseTag instruction, which would put a previously allocated tag to be read from the RT store and placed on the tag stack the consumer instruction will pick up the tag and then retrieve the value from the reorder buffer or physical register depending on whether we are using virtual or real registers; each reorder buffer entry will have a Retain flag as well as a Busy flag, and the flag is set when a Retain instruction is executedi on a newly allocated tag which has just been placed on the top of the stack; the flag is cleared, and the tag returned to th free tags set and deleted from the RT store, when a FreeTag instruction is executed the method is slightly more complex if reorder buffer and stack are integrated, though we are not implementing this at the moment: instead of a retained tags store, we will need a retained values store, and value instead of tag is copied from store to stack for re-use; the first load causes the allocated tag to be put into the retained values store, and when the value is read from the memory, it goes to the stack (or consumer instruction if it has already gone to the reservation station) as well as the retained values stack for the previous expression the instructions are: L a L x Retain 1 sets Retain flag for x's ROB entry/register and copies x's tag into entry 1 of RT store * L b + RUT 1 gets tag stored in entry 1 of RT store and puts it on stack * this will consume result of + and retained x value L c + RUT 1 * L d + RUT 1 * L e + S y FT 1 clears Retain flag for x's value and frees tag; entry 1 of RT store no long in use in fact the need for stack folding disappears: if there are two retained values a b whose tags are in RT store entries 0 1, and we want to c = a+b, we simply compile the instructions RUT 0 RUT 1 + S c if the result is not needed in the future but is needed repeatedly now, then instead of storing it, a Retain instruction is executed: RUT 0 RUT 1 + Retain 2 which causes the tag allocated to the result of + to be retained in entry 2 of the RT store and the Retain flag for the reorder buffer entry/register to be set, and RUT 2 can then be used in further execution till the tag is freed 7. Stack and Vector Processing Turning to another front, with the increasing complexity of superscalar microprocessors, it is practical to consider the inclusion of vector support in such systems soon, though one would expect that the methods used in the vector supercomputers [6][7][8][9] need to be modified inorder to be applied to microprocessors. In particular, it may be difficult to adopt the vector registers that such supercomputers use, in view of the demands they make on the chip surface area. It seems more realistic to have schemes whereby source and destination vectors can be accommodated in the cache rather than vector registers. That is, input data to the arithmetic pipelines are pumped out of the cache, and results are returned to the cache, deploying new hardware structures to control the loading of vectors from memory to cache and the buffering of computed results in the cache, with or without writing back to the main memory. As we know, the on-chip caches of current microprocessors are both fast and quite large. It is entirely realistic to expect that caches of future generation processors can accommodate several moderately long vectors, e.g., 8 x 64 floating point numbers without straining the overall capacity. While the speed of the cache might still be lower than that of supercomputer vector registers, this is not a significant drawback since we would expect the microprocessor arithmetic pipelines to be simpler and slower than their supercomputer counterparts too, and pumping vectors from the cache fast enough to fill a computation pipeline seems quite realistic. Before studying the issue further, let us first ask: what is the benefit of having such architectural support? Consider the vector operation C = A + B, which is typically compiled into the following loop: initialization Loop LD R1 A(R0) LD R2 B(R0) ADD R1 R2 ST R2 C(R0) INC R0 CMP R0 Limit BLT Loop Within each iteration, several instructions are fetched and executed, but only one, the ADD, does something productive. Further, the loop ties up several CPU registers (Limit will probably need a register, though this can be avoided by counting downwards), and the code has low instruction level parallelism. Unless the CPU is multithreaded, the code would lead to a period of low utilization in the execution units, because the branch prediction for the BLT would generally predict TRUE, so that the code following the loop would not be executed, even if it has no data dependency on the code in the loop. In contrast, if we could do the following touch A ! to bring vector A into cache from memory touch B init C ! to get cache space for C without memory read VADD A B C scalar code then we not only avoid the high overheads of the add loop, but also make it possible for the subsequent scalar instructions to be dispatched in parallel, since they have no data dependency on the vector instructions. In other words, though higher vector computation speed would be an incentive, there is an equally strong incentive to have architectural support for vector arithmetic in order to improve scalar computation (or rather, to minimize the slowdown caused by vector arithmetic). In machines like the Cray, the content of a vector register represents a vector frame, i.e., a set of elements selected according to some criteria, such as consecutive elements within a longer vector, elements spaced at some constant interval, or a less regular selection pattern using a mask or index vector (Fig 1). A number of frames are brought from the memory into vector registers, for use as source operands to be fed into the arithmetic pipelines, and results are returned to vector registers, before being written back to memory in some cases. If operands and results of vector pipelines are stored in the cache, the CPU would need architectural support to retrieve and store vector frames. We call these frame registers, and a simple frame register structure might have the following parts: |Base|Interval|Index|Limit|Size|...| Such a frame register would be able to specify consecutive or evenly spaced elements of a vector to participate in the execution of a vector instruction, which would have the form op VFx VFy VFz i.e., input frame x and frame y into the op pipeline and return results to frame z. Suppose we have VFx - |10000| 1|ix|64|4|...| VFy - |20000|128|iy|64|4|...| VFz - |30000| 1|iz|64|4|...| then the instruction + VFx VFy VFz fetches 64 32-bit operands from cache locations for addresses 10000 to 10063, and corresponding operands from cache locations for addresses 20000, 20128, 20256, etc., add the two vectors, and store the results in locations for addresses 30000 to 30063, with the index/increment of each frame register used to select the individual elements. But before such an instruction can be executed, the required elements of the two source arrays must be in the cache, and space must be already available for the result array. This is achieved by loading the array information into the frame registers: Touch Vx A Touch Vy B Init Vz C + Vx Vy Vz Touch causes information about an array to be loaded into a frame register, and a series of Read signals are sent to the memory to read one element per cache block that make up this frame. This ensures that all the wanted elements of the vector frame will be in the cache, but which particular elements one reads is unimportant. Init also loads the frame information, but causes one write (the value written is also unimportant) per cache block. In most cache systems, this would not cause any reads from the memory into the cache, since the content of the blocks are being overwritten. The actual execution of + would just involve the two source frame registers sending successive Read operations to the cache, using the information in the register to compute individual addresses, and sending pairs of elements to the vector arithmetic pipe, returning output to the cache with successive Write operations. The important point to note is that having frame registers and vector pipelines does not require new cache and memory mechanisms: they simply operate in the way most cache/ memory systems are designed to work. However, if a frame consists of non-consecutive elements, then a Touch brings into cache all the in-between elements not needed in the frame, and in fact, in a set-associative cache for which the vector frame interval happens to be a multiple of the number of sets, too many blocks might map to the same set. A programmer may have to rearrange array structures to produce the right frames taking part in vector instructions, and to maximize utilization of data read into the cache. Given such architectural support, a vector expression like d = (a+b)*c could be programmed as Touch VF0 A Touch VF1 B Init VF2 Dummy + VF0 VF1 VF2 Touch VF3 C Init VF4 D Wait VF2 * VF2 VF3 VF4 That is, even if there are separate + and * pipelines, the results of the first operation must be returned to the cache in some temporary array, and one need to wait for this vector to become complete, before pumping out the elements to participate in the following step. In other words, the frame registers would have some kind of status flags (Incomplete: index < limit), but instead of having Wait instructions, one could implement the capability in the dispatch unit to hold back an instruction if the input frame registers have the Incomplete status, like in superscalar execution. It is sometimes desirable to store results back in one of the source vectors, e.g. a = a+b. This need to be programmed as Touch VF0 A Touch VF1 B Copy VF0 VF2 + VF0 VF1 VF2 i.e., separate source and destination registers must be used even though they refer to the same frame, since the two have different index values during execution - if the + pipeline has n stages, then while A[i] is being stored, A[i+n] is being fetched, and the fetching and storing have to be controlled separately. Now in the earlier program, if we point to Dummy using two frame registers, one for destination and one for source, it might be possible to fetch the result of the + as soon as they are stored and send them to take part in the * operation, rather than waiting for Dummy to become Complete. This is the idea of chaining used in the Cray. However, it is necessary to impose the restriction that the index in the source frame is always less than the index in the destination frame. This is tricky to implement because such a relation may occur between any pair of frame registers, requiring complex interconnection between them. Another way to introduce chaining is to modify the capabilities of the individual frame registers without involving interconnections. We would like to have a program like: Touch VF0 A Touch VF1 B Touch VF3 C Init VF4 D Reserve VF2 + VF0 VF1 VF2 * VF2 VF3 VF4 Note that VF2 was reserved but is not "engaged" with any vector frame in the cache, in contrast to the other four frame registers, because VF2 is only a pathway between two instructions, the destination of the +, and the source of the *. Like superscalar processors, vector instructions can start and complete out of order if they use separate arithmetic units and have no data dependency. Consider a system with 8 frame registers plus one pipeline each for add, multiply and divide, and the following program a = b+c d = e*f g = h/i The first two statements can execute in parallel, and six frame registers are used, leaving insufficient registers for the third, even though the wanted pipeline is free. So it need to wait for one of the two to finish and reuse a register, but which will finish first? This is unpredictable because it depends on factors like cache misses and previous usage of pipelines. The same problem arises in superscalar machines with respect to CPU registers, but it is possible to provide a large number of registers so that we could usually choose one not in use, but frame registers being complex, it might be difficult to provide a large number of them. Further, supposed we have a multithreaded machine, so that each thread has its own context, requiring separate program counters and registers. But again, the complexity of the frame registers makes it undesirable to provide each thread with a separate set, but if all the thread share the same set, there must be some way of dynamically allocating available registers and establishing the correct links between them. The suitable method turns out to be a stack, which allows all operators to find their operands at the top of the stack, without requiring knowledge of addresses or register numbers. In fact, looking at the final program example in the last section, we may note that the result of each operation only goes to one "consumer", which could be another instruction or the cache, and this fits the stack behaviour that an operator erases its operands, so that each operand is only used once. In this arrangement, each step of a program is allocated a new frame register to channel its result, and the number of the register is left on the stack for the next operator to use as source. For example d = (a+b)*c is first compiled into a stack program Init D Touch A Touch B + Touch C *finish That is, since the frame registers are assigned at runtime by hardware, in the same way as reorder buffer entries in superscalr machines are allocated to instructions that change registers, sourcecode does not need to show any register numbers. Suppose all 8 frame registers are free, then Init D is given VF0, and the number is left on the stack. Touch A gets VF1, and Touch B VF2. + is allocated VF3, but since it requires two source operands, VF2 and VF1 are popped from the stack, leaving VF3 and VF0. The number 3 are inserted into VF1 and VF2 as their destination, while 1 and 2 are inserted into VF3 as its left and right sources. The next instruction, Touch C, is allocated VF4, and * takes VF4 and VF3 as its left/right sources, but being the last in the chain, it does not get a new frame, but takes VF0, the last remaining item on the stack. Hence, 0 is inserted into VF3 and VF4 as their destination, and 3 and 4 appear in VF0 as the left and right sources. Hence we have the frame register contents below: VF0 ..|* |Cache|3 |4 | VF1 ..|Read|3L |Cache|----| VF2 ..|Read|3R |Cache|----| VF3 ..|+ |0L |1 |2 | VF4 ..|Read|0R |Cache|----| To execute in a multithreading machine, with competition for the arithmetic units, chaining requires all the needed units to be reserved before execution can start. Also, the required frames must be in the cache. In the example, VF1, VF2 and VF4 would start Touch operations as soon as frame information is loaded, and after each frame is in the cache, would read the first element, but wait for the consumer to accept it before reading further elements. In the mean time the final instruction, *, makes its reservation, and transmits "ready" to its right source VF3 after verifying that the first element of C from the right source VF4 is already waiting. VF3, after reserving the + pipeline, accepts the first source elements already read out by VF1 and VF2 and send them to the + unit, at the same time causing a new cache read by VF1 and VF2 for the next pair of operands. The + output, when it arrives at VF0, causes the first element of C to be accepted and dispatched to the * pipeline, causing VF4 to read the subsequent element. When the * output is returned to VF0, it is stored into D. In this arrangement, each operator is allocated any free frame register available, transparently to the sourcecode programmer. It does not make any difference whether the frame registers are used by earlier instructions in the same thread or by some other thread. For the example a = b+c, d = e*f, g = h/i, we would have VF0 ..|+ |Cache|1 |2 | VF1 ..|Read |0L |Cache|-----| VF2 ..|Read |0R |Cache|-----| VF3 ..|* |Cache|4 |5 | VF4 ..|Read |3L |Cache|-----| VF5 ..|Read |3R |Cache|-----| VF6 ..| |Cache| | | VF7 ..|Read | |Cache|-----| Stack {VF6,VF7} PC -> Touch i, /finish during the execution of + and * , with only two frame registers allocated to g = h/i for the moment. Suppose * finishes first and VF3 becomes free first, then it is allocated to Touch i so that we have VF0 ..|+ |Cache|1 |2 | VF1 ..|Read |0L |Cache|-----| VF2 ..|Read |0R |Cache|-----| VF3 ..|Read | |Cache|-----| VF4 ..| | | | | VF5 ..| | | | | VF6 ..| |Cache| | | VF7 ..|Read | |Cache|-----| Stack {VF6,VF7,VF3} PC -> /finish The last operator / takes VF7 and VF3 as its operands, but does not get a new frame register, and instead uses the remaining register indicated in the stack, VF6. This is entered into VF3 and VF7 as their destination, giving VF0 ..|+ |Cache|1 |2 | VF1 ..|Read |0L |Cache|-----| VF2 ..|Read |0R |Cache|-----| VF3 ..|Read |6R |Cache|-----| VF4 ..| | | | | VF5 ..| | | | | VF6 ..|* |Cache|7 |3 | VF7 ..|Read |6R |Cache|-----| Stack {} PC -> The stack and the instruction queue are both empty now as there are no more operands waiting to be allocated to instructions, nor instructions remaining in the program. Note that if a scalar is to be combined with a vector, its ID is simply pushed on the stack for the relevant instruction to pick up, in the same way as a frame register ID. In a multithreading system, each thread's context must have its stack along with PC, condition codes, processor status, and other hardware supported information. While the vector frame registers are clearly valuable for pipelined whole vector operations, they can also be used to program the processing of individual vector elements, they contain such needed information as index, limit and increment which are used to access each element and control looping, e.g., to take the absolute value of a vector we would use Set VF0 A Loop VLD R0 VF0 BGE Skip Negate R0 VST R0 VF0 Skip IncIndex VF0 TestIndex VF0 BLT Loop The Vector Load and Vector Store instructions read or write a vector element using the base and Index values in a frame register, and the IncIndex instruction adds the Increment field of a frame register to the Index field, while the TestIndex instruction compares the Index field with the Limit field. Implementing such operations require adding a complete set of frame register instructions alongside general purpose register operations, though the variety of operations and numbers of registers would be smaller. Further, a simplification results from the fact that typical vector operations involve certain common instruction sequences that use the same frame register, e.g., in the above example, VST, IncIndex and TestIndex are all on VF0. This makes it possible to employ the idea of context dependent machine instructions [5][6] that cater for typical sequences of instructions using the same register: Set VF0 A Loop VLD R0 VF0 BGE Skip Negate R0 VST R0 VF0 Skip IncIndex TestIndex BLT Loop We economize on instruction bits by retaining the frame register number in the condition codes each time an instruction carrying a VF number, like VLD or VST, is executed, and following it with other instructions that use the same register. This brings again to the ideas of 10 years ago[5]. We see that there are various levels at which one could commit to cache based vector processing support. We would only expect a very gradual evolvement towards the more radical architectural features in time. Before the proposed methods could be implemented, it would be necessary to study the demand the frame registers place on the cache when executing typical vector instruction sequences, using past experience from Cray and similar machines for guidance. In particular, the set arrangements of the cache may need to be modified to efficiently accommodate the new demands. 7. Discussion We have explored the possibility of superscalar execution of stack programs, and found the situation promising. We believe it is high time that the potential of stack machine be considered again because of their ability for supporting well structured programs that produce compact object code. We are aware of the shortcomings of stack machines for array processing, but feel solutions for this can also be found[3]. With the maturing of the current technology for superscalar processors, injection of new ideas is called for and stack architectures should be seriously looked at. The execution behaviour of a stack program with reorder buffer is data driven, and the buffer plays a role similar to that of a data token matching unit of dataflow machines. There is however no reuse of an operator with different pairs of matched data tokens. The possibility of array operators that receive pairs of vector addresses, and initiating whole vector operations through pipelines, is another idea to be further explored. References [1] E I Organick, Computer System Organization, Academic Press, 1973. [2] P M Kogge, The Architecture of Symbolic Computers, McGraw-Hill, 1991. [3] W F Wong and C K Yuen, SARC - A stack and register computer, International Symposium on Computer Architecture and Digital Signal Processing, October 1989, Hong Kong, pp. 194-199. 00000xxx set current frame no. to xxx 00001xxx compare current index with xxx index leave result on stack (current unchanged; use with BGT/BLT/BEQ) 00010xxx load element to stack from frame xxx, base+index*size, sets current to xxx 00011xxx store element from stack to frame xxx 00100000 add current frame inc to index 00100001 cmp index to limit leaving result on stack 00100010 subtract current frame inc from index 00100011 store from stack to current frame index 00100100 load current frame index to stack 00100101 load current frame no. to stack 00100110 load previous frame no. 00100111 switch current and previous frame nos 00101000 set current to global frame 00101001 set current to caller frame 00101010 set current to host frame 00101011 set current to own stack frame 00101100+B load word to stack using byte offset from current frame base 00101101+B store with byte offset 00101110+H load with 16 bit offset 00101111+H store with 16 bit offset 00110000+W change current frame to new base & clear index 00110001+W change to new base & index=limit 00110010+H add H to base (i.e., change to another array/record in same frame) 00110011+B add B to base 00110100 stack a new frame(add limit to base saving old base, set frame using infomation on stack) 00110101 pop frame (recover old base and limit=current-old) 00110110+B chain frame (get new base from (base+B) ) 00110111+H chain frame (get new base from (base+H) ) 00111000+B load next frame base (from base+B) to stack 00111001+B load next frame base with current index, limit, inc/size to stack 00111010 load current frame information to stack 00111011 load current frame information setting index=0 00111100 store info on stack into current frame 00111101 store info on stack into current frame setting index=limit 00111110 load info from frame addressed from stack 00111111 make frame using frame information on stack 0100000x LDI x (0 or 1; F/T if used for BT/BF or Boolean op) 01000010+B LDI B with leading 0s 01000011+B LDI B not changing rest of word 01000100+H LDI H with leading 0s 01000101+H LDI H not changing leading half 01000110+W LDI W 01000111+D LDI D 010010xx+F LD from Full Address, size xx=00 B, 01 H, 10 W, 11 D 010011xx+F ST to F size as above 0101000x add integer, size x=0 W; 1 D 0101001x sub 01010100 mul (W operands D result) 01010101 div (W dividend, divisor, quotient, remainder) 01010110 inc 01010111 dec 01011000 = integers to Boolean (use with BT/BF) 01011001 /= 01011010 > 01011011 < 01011100 >= 01011101 <= 01011110 same sign 01011111 diff sign 0110000x fadd 0110001x fsub 0110010x fmul 0110011x fdiv 01101ccc fcmpW (ccc: = /= > < >= <= ss ds) float to Boolean 01110ccc fcmpD double float to Boolean 01111000 divD (Integer D dividend, quotient; W divisor, remainder) 01111001 neg integer 01111010 AND on all 32 bits 01111011 OR 01111100 XOR 01111101 EQ 01111110 Invert 01111111 mask 1000000x+A BR T/F or 1/0 A is byte offset 1000001x+H BR T/F or 1/0 16 bit offset 100001bx+A BGT/LT integer, x: single/double 100010bx+A BGE/LE 100011bx+A BEQ/NE 1001000x+A DFBGT/BLT testing DFP mantissa sign 10010ccc+A FPB (ccc 010-111 GT LT GE LE EQ NE) 100110xx+B+A BR if B matches byte selected by xx 1001110x+H+A BR if H matches halfword selected by x 10011110+A BR 10011111+F Long Jump 1010000000xbbbbb+A BR if bit bbbbb = x 1010000001xxbbbb+A BR if bit pair = xx 101000001xxxxbbb+A BR if hex digit = xxxx 101000xx+B Trap (xx=01-11: type of trap; B selects interrupt vector) 101001xx+AA JSR (xx: 00 AA - B; 01 H; 10 W; 11 full) 10101000 Return from subroutine 10101001 RTI 101xxxxx unspecified xxxxx > 9 110000000sdxxxxx shift d: l/r; s: W/D 110000001sdxxxxx arithmetic shift 110000010sdxxxxx rotate 110000011sdxxxxx rotate with carry 11000xxx erase xxx-1 words xxx = 010-111 11001000 reverse top 2 words 11001001 reverse top 2 D words 1100101d cycle top 3 D word d: direction 110011xd cycle top x+3 words d: direction 110100xx replicate stack top word xx+1 times 110101xx replicate stack top D word xx+1 times 11011000 duplicate halfword within one word 11011001 quadruplicate byte within one word 11011010 split word into four bytes each occupying word 11011011 split word into two half words each occupying word 110111xx count 1 bits in byte, halfword, word, D word 11100000 set all current frame elements to = operand on stack 11100001 search current frame using operand on stack return index 11100010 search current frame and return total no. found 11100011 search current frame and return T/F bits on stack with limit 11100100 copy to current frame from previous frame returning smaller limit 11100101 compare current/previous frames and return index of first = 11100110 compare frames and return index of first /= 11100111 compare frames and return T/F bits plus smaller limit 111010xx add/sub/mul/div to current frame from previous frame returning smaller limit to stack 111011xx floating add/sub/mul/div to current frame from previous returning smaller limit to stack 11110000 000fffdl use displacement from stack with frame base fff to load (l=0) or store (l=1) element; d 0 B 1 H 001DDfdl f 0 current frame 1 previous frame DD - data size 01DDssdl ss 00 own frame 01 host 10 caller 11 global 1111xxxx unspecified The machine is assume to have 8 vector frame registers, and instruction have been provided to bring data to the top of stack for processing from either vectors or records. Registers pointing to the top of stack, the current stack frame, and the frames of the caller, the global code segment and the code segment that encloses the current segment, are also assumed to exist. All the opcodes are byte sized, with a limited number that require immediate operands, addresses or parameters extending the instructions beyond one byte.