Microarchitectural concepts Microarchitecture
1 microarchitectural concepts
1.1 instruction cycle
1.2 increasing execution speed
1.3 instruction set choice
1.4 instruction pipelining
1.5 cache
1.6 branch prediction
1.7 superscalar
1.8 out-of-order execution
1.9 register renaming
1.10 multiprocessing , multithreading
microarchitectural concepts
instruction cycle
in general, cpus, single-chip microprocessors or multi-chip implementations run programs performing following steps:
the instruction cycle repeated continuously until power turned off.
increasing execution speed
complicating simple-looking series of steps fact memory hierarchy, includes caching, main memory , non-volatile storage hard disks (where program instructions , data reside), has been slower processor itself. step (2) introduces lengthy (in cpu terms) delay while data arrives on computer bus. considerable amount of research has been put designs avoid these delays as possible. on years, central goal execute more instructions in parallel, increasing effective execution speed of program. these efforts introduced complicated logic , circuit structures. initially, these techniques implemented on expensive mainframes or supercomputers due amount of circuitry needed these techniques. semiconductor manufacturing progressed, more , more of these techniques implemented on single semiconductor chip. see moore s law.
instruction set choice
instruction sets have shifted on years, simple complex (in various respects). in recent years, load-store architectures, vliw , epic types have been in fashion. architectures dealing data parallelism include simd , vectors. labels used denote classes of cpu architectures not particularly descriptive, cisc label; many designs retroactively denoted cisc in fact simpler modern risc processors (in several respects).
however, choice of instruction set architecture may affect complexity of implementing high performance devices. prominent strategy, used develop first risc processors, simplify instructions minimum of individual semantic complexity combined high encoding regularity , simplicity. such uniform instructions fetched, decoded , executed in pipelined fashion , simple strategy reduce number of logic levels in order reach high operating frequencies; instruction cache-memories compensated higher operating frequency , inherently low code density while large register sets used factor out of (slow) memory accesses possible.
instruction pipelining
one of first, , powerful, techniques improve performance use of instruction pipeline. processor designs carry out of steps above 1 instruction before moving onto next. large portions of circuitry left idle @ 1 step; instance, instruction decoding circuitry idle during execution , on.
pipelines improve performance allowing number of instructions work way through processor @ same time. in same basic example, processor start decode (step 1) new instruction while last 1 waiting results. allow 4 instructions in flight @ 1 time, making processor 4 times fast. although 1 instruction takes long complete (there still 4 steps) cpu whole retires instructions faster.
risc makes pipelines smaller , easier construct cleanly separating each stage of instruction process , making them take same amount of time — 1 cycle. processor whole operates in assembly line fashion, instructions coming in 1 side , results out other. due reduced complexity of classic risc pipeline, pipelined core , instruction cache placed on same size die otherwise fit core alone on cisc design. real reason risc faster. designs sparc , mips ran on 10 times fast intel , motorola cisc solutions @ same clock speed , price.
pipelines no means limited risc designs. 1986 top-of-the-line vax implementation (vax 8800) heavily pipelined design, predating first commercial mips , sparc designs. modern cpus (even embedded cpus) pipelined, , microcoded cpus no pipelining seen in area-constrained embedded processors. large cisc machines, vax 8800 modern pentium 4 , athlon, implemented both microcode , pipelines. improvements in pipelining , caching 2 major microarchitectural advances have enabled processor performance keep pace circuit technology on based.
cache
it not long before improvements in chip manufacturing allowed more circuitry placed on die, , designers started looking ways use it. 1 of common add ever-increasing amount of cache memory on-die. cache fast memory, memory can accessed in few cycles opposed many needed talk main memory. cpu includes cache controller automates reading , writing cache, if data in cache appears , whereas if not processor stalled while cache controller reads in.
risc designs started adding cache in mid-to-late 1980s, 4 kb in total. number grew on time, , typical cpus have @ least 512 kb, while more powerful cpus come 1 or 2 or 4, 6, 8 or 12 mb, organized in multiple levels of memory hierarchy. speaking, more cache means more performance, due reduced stalling.
caches , pipelines perfect match each other. previously, didn t make sense build pipeline run faster access latency of off-chip memory. using on-chip cache memory instead, meant pipeline run @ speed of cache access latency, smaller length of time. allowed operating frequencies of processors increase @ faster rate of off-chip memory.
branch prediction
one barrier achieving higher performance through instruction-level parallelism stems pipeline stalls , flushes due branches. normally, whether conditional branch taken isn t known until late in pipeline conditional branches depend on results coming register. time processor s instruction decoder has figured out has encountered conditional branch instruction time deciding register value can read out, pipeline needs stalled several cycles, or if s not , branch taken, pipeline needs flushed. clock speeds increase depth of pipeline increases it, , modern processors may have 20 stages or more. on average, every fifth instruction executed branch, without intervention, s high amount of stalling.
techniques such branch prediction , speculative execution used lessen these branch penalties. branch prediction hardware makes educated guesses on whether particular branch taken. in reality 1 side or other of branch called more other. modern designs have rather complex statistical prediction systems, watch results of past branches predict future greater accuracy. guess allows hardware prefetch instructions without waiting register read. speculative execution further enhancement in code along predicted path not prefetched executed before known whether branch should taken or not. can yield better performance when guess good, risk of huge penalty when guess bad because instructions need undone.
superscalar
even of added complexity , gates needed support concepts outlined above, improvements in semiconductor manufacturing allowed more logic gates used.
in outline above processor processes parts of single instruction @ time. computer programs executed faster if multiple instructions processed simultaneously. superscalar processors achieve, replicating functional units such alus. replication of functional units made possible when die area of single-issue processor no longer stretched limits of reliably manufactured. late 1980s, superscalar designs started enter market place.
in modern designs common find 2 load units, 1 store (many instructions have no results store), 2 or more integer math units, 2 or more floating point units, , simd unit of sort. instruction issue logic grows in complexity reading in huge list of instructions memory , handing them off different execution units idle @ point. results collected , re-ordered @ end.
out-of-order execution
the addition of caches reduces frequency or duration of stalls due waiting data fetched memory hierarchy, not rid of these stalls entirely. in designs cache miss force cache controller stall processor , wait. of course there may other instruction in program data available in cache @ point. out-of-order execution allows ready instruction processed while older instruction waits on cache, re-orders results make appear happened in programmed order. technique used avoid other operand dependency stalls, such instruction awaiting result long latency floating-point operation or other multi-cycle operations.
register renaming
register renaming refers technique used avoid unnecessary serialized execution of program instructions because of reuse of same registers instructions. suppose have 2 groups of instruction use same register. 1 set of instructions executed first leave register other set, if other set assigned different similar register, both sets of instructions can executed in parallel (or) in series.
multiprocessing , multithreading
computer architects have become stymied growing mismatch in cpu operating frequencies , dram access times. none of techniques exploited instruction-level parallelism (ilp) within 1 program make long stalls occurred when data had fetched main memory. additionally, large transistor counts , high operating frequencies needed more advanced ilp techniques required power dissipation levels no longer cheaply cooled. these reasons, newer generations of computers have started exploit higher levels of parallelism exist outside of single program or program thread.
this trend known throughput computing. idea originated in mainframe market online transaction processing emphasized not execution speed of 1 transaction, capacity deal massive numbers of transactions. transaction-based applications such network routing , web-site serving increasing in last decade, computer industry has re-emphasized capacity , throughput issues.
one technique of how parallelism achieved through multiprocessing systems, computer systems multiple cpus. once reserved high-end mainframes , supercomputers, small-scale (2–8) multiprocessors servers have become commonplace small business market. large corporations, large scale (16–256) multiprocessors common. personal computers multiple cpus have appeared since 1990s.
with further transistor size reductions made available semiconductor technology advances, multi-core cpus have appeared multiple cpus implemented on same silicon chip. used in chips targeting embedded markets, simpler , smaller cpus allow multiple instantiations fit on 1 piece of silicon. 2005, semiconductor technology allowed dual high-end desktop cpus cmp chips manufactured in volume. designs, such sun microsystems ultrasparc t1 have reverted simpler (scalar, in-order) designs in order fit more processors on 1 piece of silicon.
another technique has become more popular multithreading. in multithreading, when processor has fetch data slow system memory, instead of stalling data arrive, processor switches program or program thread ready execute. though not speed particular program/thread, increases overall system throughput reducing time cpu idle.
conceptually, multithreading equivalent context switch @ operating system level. difference multithreaded cpu can thread switch in 1 cpu cycle instead of hundreds or thousands of cpu cycles context switch requires. achieved replicating state hardware (such register file , program counter) each active thread.
a further enhancement simultaneous multithreading. technique allows superscalar cpus execute instructions different programs/threads simultaneously in same cycle.
Comments
Post a Comment