FOR EACH PAPER. Summary: Problem paper is trying to solve, key ideas/insights, mechanism, implementation. You will include key results and...


FOR EACH PAPER.1. Summary: Problem paper is trying to solve, key ideas/insights, mechanism, implementation. You will include key results and implementations.2. Strenghts: Most important ones, does it solve the problem well?Coming Challenges in Microarchitecture andArchitectureRONNY RONEN, SENIOR MEMBER, IEEE, AVI MENDELSON, MEMBER, IEEE, KONRAD LAI,SHIH-LIEN LU, MEMBER, IEEE, FRED POLLACK, AND JOHN P. SHEN, FELLOW, IEEEInvited PaperIn the past several decades, the world of computers andespecially that of microprocessors has witnessed phenomenaladvances. Computers have exhibited ever-increasing performanceand decreasing costs, making them more affordable and, in turn,accelerating additional software and hardware developmentthat fueled this process even more. The technology that enabledthis exponential growth is a combination of advancements inprocess technology, microarchitecture, architecture, and designand development tools. While the pace of this progress has beenquite impressive over the last two decades, it has become harderand harder to keep up this pace. New process technology requiresmore expensive megafabs and new performance levels requirelarger die, higher power consumption, and enormous design andvalidation effort. Furthermore, as CMOS technology continuesto advance, microprocessor design is exposed to a new set ofchallenges. In the near future, microarchitecture has to considerand explicitly manage the limits of semiconductor technology, suchas wire delays, power dissipation, and soft errors. In this paper,we describe the role of microarchitecture in the computer world,present the challenges ahead of us, and highlight areas wheremicroarchitecture can help address these challenges.Keywords—Design tradeoffs, microarchitecture, microarchitecturetrends, microprocessor, performance improvements, power issues,technology scaling.I. INTRODUCTIONMicroprocessors have gone through significant changesduring the last three decades; however, the basic computationalmodel has not changed much. A program consists ofinstructions and data. The instructions are encoded in a specificinstruction set architecture (ISA). The computationalManuscript received January 1, 2000; revised October 1, 2000.R. Ronen and A. Mendelson are with the Microprocessor Research Laboratories,Intel Corporation, Haifa 31015, Israel.K. Lai and S.-L. Lu are with the Microprocessor Research Laboratories,Intel Corporation, Hillsboro, OR 97124 USA.F. Pollack and J. P. Shen are with the Microprocessor Research Laboratories,Intel Corporation, Santa Clara, CA 95052 USAPublisher Item Identifier S 0018-9219(01)02069-2.model is still a single instruction stream, sequential executionmodel, operating on the architecture states (memory andregisters). It is the job of the microarchitecture, the logic, andthe circuits to carry out this instruction stream in the "best"way. "Best" depends on intended usage—servers, desktop,and mobile—usually categorized as market segments. Forexample, servers are designed to achieve the highest performancepossible while mobile systems are optimized for bestperformance for a given power. Each market segment has differentfeatures and constraints.A. Fundamental AttributesThe key metrics for characterizing a microprocessor include:performance, power, cost (die area), and complexity.Performance is measured in terms of the time it takesto complete a given task. Performance depends on manyparameters such as the microprocessor itself, the specificworkload, system configuration, compiler optimizations,operating systems, and more. A concise characterization ofmicroprocessor performance was formulated by a numberof researchers in the 1980s; it has come to be known as the"iron law" of central processing unit performance and isshown belowPerformance Execution TimeIPC Frequency Instruction Countwhere is the average number of instructions completedper cycle, is the number of clock cycles persecond, and is the total number ofinstructions executed. Performance can be improved byincreasing IPC and/or frequency or by decreasing instructioncount. In practice, IPC varies depending on the environment—the application, the system configuration, and more.Instruction count depends on the ISA and the compilerused. For a given executable program, where the instruction0018-9219/01$10.00 © 2001 IEEEPROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001 325stream is invariant, the relative performance depends only onIPC Frequency. Performance here is measured in millioninstructions per second (MIPS).Commonly used benchmark suites have been defined toquantify performance. Different benchmarks target differentmarket segments, such as SPEC [1] and SysMark [2]. Abenchmark suite consists of several applications. The timeit takes to complete this suite on a certain system reflects thesystem performance.Power is energy consumption per unit time, in watts.Higher performance requires more power. However, poweris constrained due to the following.• Power density and Thermal: The power dissipated bythe chip per unit area is measured in watts/cm . Increasesin power density causes heat to generate. Inorder to keep transistors within their operating temperaturerange, the heat generated has to be dissipatedfrom the source in a cost-effective manner. Power densitymay soon limit performance growth due to thermaldissipation constraints.• Power Delivery: Power must be delivered to a verylarge scale integration (VLSI) component at a prescribedvoltage and with sufficient amperage for thecomponent to run. Very precise voltage regulator/transformercontrols current supplies that can vary withinnanoseconds. As the current increases, the cost andcomplexity of these voltage regulators/transformersincrease as well.• Battery Life: Batteries are designed to support a certainwatts hours. The higher the power, the shorter thetime that a battery can operate.Until recently, power efficiency was a concern only in batterypowered systems like notebooks and cell phones. Recently,increased microprocessor complexity and frequencyhave caused power consumption to grow to the level wherepower has become a first-order issue. Today, each marketsegment has its own power requirements and limits, makingpower limitation a factor in any new microarchitecture. Maximumpower consumption is increased with the microprocessoroperating voltage ( ) and frequency (Frequency) asfollows:where is the effective load capacitance of all devices andwires on the microprocessor.Within some voltage range, frequencymay go up with supply voltage (). This is a good way to gain performance,but power is also increased (proportional to ). Anotherimportant power related metric is the energy efficiency. Energyefficiency is reflected by the performance/power ratioand measured in MIPS/watt.Cost is primarily determined by the physical size of themanufactured silicon die. Larger area means higher (evenmore than linear) manufacturing cost. Bigger die area usuallyimplies higher power consumption and may potentiallyimply lower frequency due to longer wires. Manufacturingyield also has direct impact on the cost of each microprocessor.Complexity reflects the effort required to design, validate,and manufacture a microprocessor. Complexity is affected bythe number of devices on the silicon die and the level of aggressivenessin the performance, power and die area targets.Complexity is discussed only implicitly in this paper.B. Enabling TechnologiesThe microprocessor revolution owes it phenomenalgrowth to a combination of enabling technologies: processtechnology, circuit and logic techniques, microarchitecture,architecture (ISA), and compilers.Process technology is the fuel that has moved the entireVLSI industry and the key to its growth. A new process generationis released every two to three years. A process generationis usually identified by the length of a metal-oxidesemidconductorgate, measured in micrometers (10 m, denotedas m). The most advanced process technology today(year 2000) is 0.18 m [3].Every new process generation brings significant improvementsin all relevant vectors. Ideally, process technologyscales by a factor of 0.7 all physical dimensions of devices(transistors) and wires (interconnects) including those verticalto the surface and all voltages pertaining to the devices[4]. With such scaling, typical improvement figures are thefollowing:• 1.4-1.5 times faster transistors;• two times smaller transistors;• 1.35 times lower operating voltage;• three times lower switching power.Theoretically, with the above figures, one would expect potentialimprovements such as the following.• Ideal Shrink: Use the same number of transistors togain 1.5 times performance, two times smaller die, andtwo times less power.• Ideal New Generation: Use two times the number oftransistors to gain three times performance with no increasein die size and power.In both ideal scenarios, there is three times gain in MIPS/wattand no change in power density (watts/cm ).In practice, it takes more than just process technologyto achieve such performance improvements and usuallyat much higher costs. However, process technology is thesingle most important technology that drives the microprocessorindustry. Growing 1000 times in frequency (from1 MHz to 1 GHz) and integration (from 10k to 10Mdevices) in 25 years was not possible without processtechnology improvements.Innovative circuit implementations can provide better performanceor lower power. New logic families provide newmethods to realize logic functions more effectively.Microarchitecture attempts to increase both IPC andfrequency. A simple frequency boost applied to an existingmicroarchitecture can potentially reduce IPC and thusdoes not achieve the expected performance increase. For326 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001Fig. 1. Impact of different pipeline stalls on the execution flow.example, memory accesses latency does not scale with microprocessorfrequency. Microarchitecture techniques suchas caches, branch prediction, and out-of-order execution canincrease IPC. Other microarchitecture ideas, most notablypipelining, help to increase frequency beyond the increaseprovided by process technology.Modern architecture (ISA) and good optimizing compilerscan reduce the number of dynamic instructions executedfor a given program. Furthermore, given knowledge ofthe underlying microarchitecture, compilers can produce optimizedcode that lead to higher IPC.This paper deals with the challenges facing architectureand microarchitecture aspects of microprocessor design. Abrief tutorial/background on traditional microarchitecture isgiven in Section II, focusing on frequency and IPC tradeoffs.Section III describes the past and current trends in microarchitectureand explains the limits of the current approachesand the new challenges. Section IV suggests potential microarchitecturalsolutions to these challenges.II. MICROARCHITECTURE AT A GLANCEMicroprocessor performance depends on its frequency andIPC. Higher frequency is achieved with process, circuit, andmicroarchitectural improvements. New process technologyreduces gate delay time, thus cycle time, by 1.5 times. Microarchitectureaffects frequency by reducing the amount ofwork done in each clock cycle, thus allowing shortening ofthe clock cycle.Microarchitects tend to divide the microprocessor's functionalityinto three major components [5].• Instruction Supply: Fetching instructions, decodingthem, and preparing them for execution;• Execution: Choosing instructions for execution, performingactual computation, and writing results;• Data Supply: Fetching data from the memory hierarchyinto the execution core.A rudimentary microprocessor would process a completeinstruction before starting a new one. Modern microprocessorsuse pipelining. Pipelining breaks the processing ofan instruction into a sequence of operations, called stages.For example, in Fig. 1, a basic four-stage pipeline breaksthe instruction processing into fetch, decode, execute, andwrite-back stages. A new instruction enters a stage as soonas the previous one completes that stage. A pipelined microprocessorwith pipeline stages can overlap the processingof instructions in the pipeline and, ideally, can delivertimes the performance of a nonpipelined one.Pipelining is a very effective technique. There is a cleartrend of increasing the number of pipe stages and reducingthe amount of work per stage. Some microprocessors (e.g.,Pentium Pro microprocessor [6]) have more than ten pipelinestages. Employing many pipe stages is sometimes termeddeep pipelining or super pipelining.Unfortunately, the number of pipeline stages cannot increaseindefinitely.• There is a certain clocking overhead associated witheach pipe stage (setup and hold time, clock skew). Ascycle time becomes shorter, further increase in pipelinelength can actually decrease performance [7].• Dependencies among instructions can require stallingcertain pipe stages and result in wasted cycles, causingperformance to scale less than linearly with the numberof pipe stages.For a given partition of pipeline stages, the frequency of themicroprocessor is dictated by the latency of the slowest pipestage. More expensive logic and circuit optimizations helpto accelerate the speed of the logic within the slower pipestage, thus reducing the cycle time and increasing frequencywithout increasing the number of pipe stages.It is not always possible to achieve linear performance increasewith deeper pipelines. First, scaling frequency linearlywith the number of stages requires good balancing of theoverall work among the stages, which is difficult to achieve.Second, with deeper pipes, the number of wasted cycles,termed pipe stalls, grows. The main reasons for stalls are resourcecontention, data dependencies, memory delays, andcontrol dependencies.• Resource contention causes pipeline stall when an instructionneeds a resource (e.g., execution unit) that iscurrently being used by another instruction in the samecycle.• Data dependency occurs when the result of one instructionis needed as a source operand by another instruction.The dependent instruction has to wait (stall)until all its sources are available.RONEN et al.: COMING CHALLENGES IN MICROARCHITECTURE AND ARCHITECTURE 327Table 1Out-Of-Order Execution Example• Memory delays are caused by memory related datadependencies, sometimes termed load-to-use delays.Accessing memory can take between a few cycles tohundreds of cycles, possibly requiring stalling the pipeuntil the data arrives.• Control dependency stalls occur when the controlflow of the program changes. A branch instructionchanges the address from which the next instructionis fetched. The pipe may stall and instructions are notfetched until the new fetch address is known.Fig. 1 shows the impact of different pipeline stalls on theexecution flow within the pipeline.In a 1-GHz microprocessor, accessing main memory cantake about 100 cycles. Such accesses may stall a pipelinedmicroprocessor for many cycles and seriously impact theoverall performance. To reduce memory stalls at a reasonablecost, modern microprocessors take advantage of the localityof references in the program and use a hierarchy of memorycomponents. A small, fast, and expensive (in $/bit) memorycalled a cache is located on-die and holds frequently useddata. A somewhat bigger, but slower and cheaper cache maybe located between the microprocessor and the system bus,which connects the microprocessor to the main memory. Themain memory is yet slower, but bigger and inexpensive.Initially, caches were small and off-die; but over time,they became bigger and were integrated on chip with themicroprocessor. Most advanced microprocessors today employtwo levels of caches on chip. The first level is 32-128kB—it typically takes two to three cycles to access and typicallycatches about 95% of all accesses. The second level is256 kB to over 1 MB—it typically takes six to ten cycles toaccess and catches over 50% of the misses of the first level.As mentioned, off-chip memory accesses may elapse about100 cycles.Note that a cache miss that eventually has to go to themain memory can take about the same amount of time asexecuting 100 arithmetic and logic unit (ALU) instructions,so the structure of memory hierarchy has a major impact onperformance. Much work has been done in improving cacheperformance. Caches are made bigger and heuristics are usedto make sure the cache contains those portions of memorythat are most likely to be used [8], [9].Change in the control flow can cause a stall. The lengthof the stall is proportional to the length of the pipe. Ina super-pipelined machine, this stall can be quite long.Modern microprocessors partially eliminate these stalls byemploying a technique called branch prediction. When abranch is fetched, the microprocessor speculates the direction(taken/not taken) and the target address where a branchwill go and starts speculatively executing from the predictedaddress. Branch prediction uses both static and runtimeinformation to make its predictions. Branch predictors todayare very sophisticated. They use an assortment of per-branch(local) and all-branches (global) history information and cancorrectly predict over 95% of all conditional branches [10],[11]. The prediction is verified when the predicted branchreaches the execution stage and if found wrong, the pipe isflushed and instructions are fetched from the correct target,resulting in some performance loss. Note that when a wrongprediction is made, useless work is done on processinginstructions from the wrong path.The next step in performance enhancement beyondpipelining calls for executing several instructions in parallel.Instead of "scalar" execution, where in each cycle only oneinstruction can be resident in each pipe stage, superscalarexecution is used, where two or more instructions canbe at the same pipe stage in the same cycle. Superscalardesigns require significant replication of resources in orderto support the fetching, decoding, execution, and writingback of multiple instructions in every cycle. Theoretically,an -way superscalar pipelined microprocessor canimprove performance by a factor of over a standardscalar pipelined microprocessor. In practice, the speedup ismuch smaller. Interinstruction dependencies and resourcecontentions can stall the superscalar pipeline.The microprocessors described so far execute instructionsin-order. That is, instructions are executed in the programorder. In an in-order processing, if an instruction cannot continue,the entire machine stalls. For example, a cache missdelays all following instructions even if they do not need theresults of the stalled load instruction. A major breakthroughin boosting IPC is the introduction of out-of-order execution,where instruction execution order depends on data flow, noton the program order. That is, an instruction can execute if itsoperands are available, even if previous instructions are stillwaiting. Note that instructions are still fetched in order. Theeffect of superscalar and out-of-order processing is shown inan example in Table 1 where two memory words mem1 andmem3 are copied into two other memory locations mem2 andmem4.Out-of-order processing hides some stalls. For example,while waiting for a cache miss, the microprocessor canexecute newer instructions as long as they are independentof the load instructions. A superscalar out-of-ordermicroprocessor can achieve higher IPC than a superscalarin-order microprocessor. Out-of-order execution involvesdependency analysis and instruction scheduling. Therefore,it takes a longer time (more pipe stages) to process an328 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001Fig. 2. Processor frequencies over years. (Source: V. De, Intel, ISLPED, Aug. 1999.)instruction in an out-of- order microprocessor.With a deeperpipe, an out-of-order microprocessor suffers more frombranch mispredictions. Needless to say, an out-of-ordermicroprocessor, especially a wide-issue one, is much morecomplex and power hungry than an in-order microprocessor[12].Historically, there were two schools of thought on how toachieve higher performance. The "Speed Demons" schoolfocused on increasing frequency. The "Brainiacs" focusedon increasing IPC [13], [14]. Historically, DEC Alpha [15]was an example of the superiority of "Speed Demons" overthe "Brainiacs." Over the years, it has become clear that highperformance must be achieved by progressing in both vectors(see Fig. 4).To complete the picture, we revisit the issues of performanceand power. A microprocessor consumes a certainamount of energy, , in processing an instruction. Thisamount increases with the complexity of the microprocessor.For example, an out-of-order microprocessor consumesmore energy per instruction than an in-order microprocessor.When speculation is employed, some processed instructionsare later discarded. The ratio of useful to total numberof processed instructions is . The total IPC including speculatedinstructions is therefore IPC/ . Given these observationsa number of conclusions can be drawn. The energy persecond, hence power, is proportional to the amount of processedinstructions per second and the amount of energy consumedper instruction, that is (IPC/ ) Frequency . Theenergy efficiency, measured in MIPS/watt, is proportional to. This value deteriorates as speculation increases andcomplexity grows.One main goal of microarchitecture research is to design amicroprocessor that can accomplish a group of tasks (applications)in the shortest amount of time while using minimumamount of power and incurring the least amount of cost. Thedesign process involves evaluating many parameters and balancingthese three targets optimally with given process andcircuit technology.III. MICROPROCESSORS—CURRENT TRENDS ANDCHALLENGESIn the past 25 years, chip density and the associated computerindustry has grown at an exponential rate. This phenomenonis known as "Moore's Law" and characterizes almostevery aspect of this industry, such as transistor density,die area, microprocessor frequency, and power consumption.This trend was possible due to the improvements in fabricationprocess technology and microprocessor microarchitecture.This section focuses on the architectural and the microarchitecturalimprovements over the years and elaborateson some of the current challenges the microprocessor industryis facing.A. Improving PerformanceAs stated earlier, performance can be improved by increasingIPCand/or frequencyorbydecreasing the instructioncount. Several architecture directions have been taken toimprove performance. Reduced instruction set computer(RISC) architecture seeks to increase both frequency and IPCvia pipelining and use of cache memories at the expense ofincreased instruction count.Complexinstruction setcomputer(CISC) microprocessors employ RISC-like internal representationto achieve higher frequency while maintaining lowerinstruction count. Recently, the very long instruction word(VLIW) [16] concept was revived with the Explicitly ParallelInstruction Computing (EPIC) [17]. EPIC uses the compilerto schedule instruction statically. Exploiting parallelism staticallycanenablesimplercontrol logicandhelpEPICto achievehigherIPCandhigher frequency.1) Improving Frequency via Pipelining: Process technologyand microarchitecture innovations enable doublingthe frequency increase every process generation. Fig. 2presents the contribution of both: as the process improves,the frequency increases and the average amount of workdone in pipeline stages decreases. For example, the numberof gate delays per pipe stage was reduced by about threeRONEN et al.: COMING CHALLENGES IN MICROARCHITECTURE AND ARCHITECTURE 329Fig. 3. Frequency and performance improvements—synthetic model. (Source: E. Grochowski,Intel, 1997.)times over a period of ten years. Reducing the stage lengthis achieved by improving design techniques and increasingthe number of stages in the pipe. While in-order microprocessorsused four to five pipe stages, modern out-of-ordermicroprocessors can use over ten pipe stages. With frequencieshigher than 1 GHz, we can expect over 20 pipelinestages.Improvement in frequency does not always improveperformance. Fig. 3 measures the impact of increasing thenumber of pipeline stages on performance using a syntheticmodel of an in-order superscalar machine. Performancescales less than frequency (e.g., going from 6 to 12 stagesyields only a 1.75 times speedup, from 6 to 23 yields only 2.2times). Performance improves less than linearly due to cachemisses and branch mispredictions. There are two interestingsingular points in the graph that deserve special attention.The first (at pipeline depth of 13 stages) reflects the pointwhere the cycle time becomes so short that two cycles areneeded to reach the first level cache. The second (at pipelinedepth of 24 stages) reflects the point where the cycle timebecomes extremely short so that two cycles are neededto complete even a simple ALU operation. Increasing thelatency of basic operations introduces more pipeline stallsand impacts performance significantly. Please note that thesetrends are true for any pipeline design though the specificdata points may vary depending on the architecture and theprocess. In order to keep the pace of performance growth,one of the main challenges is to increase the frequencywithout negatively impacting the IPC. The next sectionsdiscuss some IPC related issues.2) Instruction Supply Challenges: The instructionsupply is responsible for feeding the pipeline with usefulinstructions. The rate of instructions entering the pipelinedepends on the fetch bandwidth and the fraction of usefulinstructions in that stream. The fetch rate depends on theeffectiveness of the memory subsystem and is discussedlater along with data supply issues. The number of usefulinstructions in the instruction stream depends on the ISA andthe handling of branches. Useless instructions result from: 1)control flow change within a block of fetched instructions,leaving the rest of the cache block unused; and 2) branchmisprediction brings instructions from the wrong path thatare later discarded. On average, a branch occurs every fourto five instructions. Hence, appropriate fetch bandwidth andaccurate branch prediction are crucial.Once instructions are fetched into the machine they aredecoded. RISC architectures, using fixed length instructions,can easily decode instructions in parallel. Parallel decoding isa major challenge for CISC architectures, such as IA32, thatuse variable length instructions. Some implementations [18]use speculative decoders to decode from several potential instructionaddresses and later discard the wrong ones; others[19] store additional information in the instruction cache toease decoding. Some IA32 implementations (e.g., the PentiumII microprocessor) translate the IA32 instructions intoan internal representation (micro-operations), allowing theinternal part of the microprocessor to work on simple instructionsat high frequency, similar to RISC microprocessors.3) Efficient Execution: The front-end stages of thepipeline prepare the instructions in either an instruction330 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001Fig. 4. Landscape of microprocessor families.window [20] or reservation stations [21]. The execution coreschedules and executes these instructions. Modern microprocessorsuse multiple execution units to increase parallelism.Performancegainislimitedbytheamountofparallelismfoundinthe instructionwindow.Theparallelism intoday'smachinesis limited by the data dependencies in the program and bymemorydelaysandresource contention stalls.Studies show that in theory, high levels of parallelism areachievable [22]. In practice, however, this parallelism is notrealized, even when the number of execution units is abundant.More parallelism requires higher fetch bandwidth, alarger instruction window, and a wider dependency trackerand instruction scheduler. Enlarging such structures involvespolynomial complexity increase for less than a linear performancegain (e.g., scheduling complexity is on the order ofO of the scheduling window size [23]). VLIW architectures[16] such as IA64 EPIC [17] avoid some of this complexityby using the compiler to schedule instructions.Accurate branch prediction is critical for deep pipelines inreducing misprediction penalty. Branch predictors have becomelarger and more sophisticated. The Pentium microprocessor[18] uses 256 entries of 2-bit predictors (the predictorand the target arrays consume 15 kB) that achieve 85%correct prediction rate. The Pentium III microprocessor [24]uses 512 entries of two-level local branch predictor (consuming30 kB) and yields 90% prediction rate. The Alpha21 264 [25] uses a hybrid multilevel selector predictor with5120 entries (consuming 80 kB) and achieves 94% accuracy.As pipelines become deeper and fetch bandwidth becomeswider, microprocessors will have to predict multiplebranches in each cycle and use bigger multilevel branchprediction structures similar to caches.B. Accelerating Data SupplyAll modern microprocessors employ memory hierarchy.The growing gap between the frequency of the microprocessorthat doubles every two to three years and the mainmemory access time that only increases 7% per year imposea major challenge. The latency of today's main memoryis 100 ns, which approximately equals 100 microprocessorcycles. The efficiency of the memory hierarchy is highly dependenton the software and varies widely for different applications.The size of cache memories increases according toMoore's Law. The main reason for bigger caches is tosupport a bigger working set. New applications such asmultimedia and communication applications use larger datastructures, hence bigger working sets, than traditional applications.Also, the use of multiprocessing and multithreadingin modern operating systems such asWindowsNT and Linuxcauses frequent switches among applications. This results infurther growth of the active working set.Increasing the cache memory size increases its accesstime. Fast microprocessors, such as the Alpha or the PentiumIII microprocessors, integrate two levels of cacheson the microprocessor die to get improved average accesstime to the memory hierarchy. Embedded microprocessorsintegrate bigger, but slower dynamic random access memory(DRAM) on the die. DRAM on die involves higher latency,manufacturing difficulty, and software complexity andis, therefore, not attractive for use in current generationgeneral-purpose microprocessors. Prefetching is a differenttechnique to reduce access time to memory. Prefetchinganticipates the data or instructions the program will accessin the near future and brings them to the cache aheadof time. Prefetching can be implemented as a hardwaremechanism or can be instrumented with software. Manymicroprocessors use a simple hardware prefetching [26]mechanism to bring ahead "one instruction cache line" intothe cache. This mechanism is very efficient for manipulatinginstruction streams, but less effective in manipulating datadue to cache pollution. A different approach uses ISAextensions; e.g., the Pentium III microprocessor prefetchRONEN et al.: COMING CHALLENGES IN MICROARCHITECTURE AND ARCHITECTURE 331instruction hints to the hardware, to prefetch a cache line. Toimplement prefetching, the microarchitecture has to supporta "nonblocking" access to the cache memory hierarchy.C. Frequency Versus IPCSPEC rating is a standard measure of performance basedon total execution time of a SPEC benchmark suite. Fig. 4plots the "landscape" of microprocessors based on theirperformance. The horizontal axis is the megahertz rating ofa microprocessor's frequency. The vertical axis is the ratioof SpecINT/MHz, which roughly corresponds to the IPCassuming instruction count remains constant. The differentcurves represent different levels of performance with increasingperformance as we move toward curves in the upperright corner. All points on the same curve represent the sameperformance level, i.e., SPEC rating. Performance can beincreased by either increasing the megahertz rating (movingtoward the right) or by increasing the SpecINT/MHz ratio(moving toward the top) or by increasing both. For a givenfamily of microprocessors with the same ISA (and, hence,the same instruction count), the SpecINT/MHz ratio iseffectively the measure of their relative IPC.For example, let us examine the curve that represents theIntel IA32 family of microprocessors. The first point in thecurve represents the Intel386 microprocessor. The next pointrepresents the Intel486 microprocessor. The


FOR EACH PAPER. Summary: Problem paper is trying to solve, key ideas/insights, mechanism, implementation. You will include key results and...

  • Written in: 17-Oct-2019
  • Paper ID: 363004
Price: $ 15

Use the Order Now button below to Place a New Order for an Original, High-Quality Paper that passes all Plagiarism Checkers

Order Now

About this Question




Oct 17, 2019




Need Help with an Assignment?

Get Help with an Assignment

SSL Certificate website security