高等計算機系統結構

上傳人：e*** IP屬地：湖北上傳時間：2021-10-12 格式：PPT 頁數：67 大小：549.50KB 積分：19 舉報 版權申訴

已閱讀5頁，還剩62頁未讀，繼續免費閱讀

版權說明：本文檔由用戶提供并上傳，收益歸屬內容提供方，若內容存在侵權，請進行舉報或認領

文檔簡介

1、現代計算機體系結構1 現代計算機體系結構現代計算機體系結構通信郵箱：提交作業郵箱：tju_ 2013年 2 The Main Contents課程主要內容 Chapter 1. Fundamentals of Quantitative Design and Analysis Chapter 2. Memory Hierarchy Design Chapter 3. Instruction-Level Parallelism and Its Exploitation Chapter 4. Data-Level Parallelism in Vector, SIMD, and GPU Arch

2、itectures Chapter 5. Thread-Level Parallelism Chapter 6. Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Appendix A. Pipelining: Basic and Intermediate Concepts 現代計算機體系結構3 Instruction-Level Parallelism (ILP) Instructions are evaluated in parallel. Pipelining Two approac

3、hes to exploiting ILP Dynamic F0=array element ADD.DF4, F0, F2;add scalar in F2 S.DF4, 0(R1);store the result DADDIUR1, R1, #-8;decrement pointer 8 bytes BNER1, R2, LOOP ;branch R1 != R2 Floating-point data dependences Integer data dependence Data Dependences 現代計算機體系結構12 The order must be preserved

4、for correct execution. If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped. Data dependence between DADDIU and BNE = Branch test for the MIPS pipeline in the ID stage (2nd stage). Data Dependences 現代計算機體系結構13 Data Hazards Consider the instruction se

5、quence: ADDR1,R2,R3 SUBR4,R1,R3 ANDR6,R1,R7 ORR8,R1,R9 XORR10,R1,R11 The result in R1 is produced after it is required by the last three instructions. 現代計算機體系結構14 I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Reg

6、ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Data Hazard on R1 Time (clock cycles) IFID/RF EXMEMWB 現代計算機體系結構15 Pipelined Datapath PC Instruction memory 4 Registers M u x M u x M u x ALU EX M WB M WB WB ID/EX 0 EX/MEM MEM/WB Data memory M u x Hazard detection unit Forwarding unit

7、IF.Flush IF/ID Sign extend Control M u x = Shift left 2 M u x 現代計算機體系結構16 A data dependence conveys three things indicates the possibility of a hazard, determines the order in which results must be calculated, and sets an upper bound on how much parallelism can possibly be exploited. Data Dependence

8、s 現代計算機體系結構17 How to Overcome a Dependence Maintaining the dependence but avoiding a hazard Code scheduling (by the compiler or by the hardware) Eliminating a dependence by transforming the code Eliminating a dependence by forwarding 現代計算機體系結構18 Time (clock cycles) I n s t r. O r d e r lw r1, 0(r2)

9、sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 Data Hazard Even with Forwarding Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg 現代計算機體系結構19 Data Hazard Even with Forwarding Time (clock cycles) or r8,r1,r9 I n s t r. O r d e r lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7

10、Reg ALU DMemIfetch Reg RegIfetch ALU DMem Reg Bubble Ifetch ALU DMem RegBubbleReg Ifetch ALU DMemBubbleReg 現代計算機體系結構20 Name Dependence Occurs when two instructions use the same register or memory location (name), but there is no flow of data between the instructions associated with that name. When i

11、 precedes j in program order: Antidependence: Instruction j writes a register or memory location that instruction i reads. Output dependence: Instructions i and j write the same register or memory location. No value transmitted between instructions. 現代計算機體系結構21 Register Renaming Instructions involve

12、d in a name dependence can execute simultaneously or be reordered, if the name (register number or memory location) used in the instructions is changed so the instructions do not conflict. (Especially for register operands) Statically by a compiler or dynamically by the hardware. 現代計算機體系結構22 Registe

13、r Renaming example Antidependence R3:=R3 + R5; (I1) R4:=R3 + 1; (I2) R3:=R5 + 1; (I3) R7:=R3 + R4; (I4) I3 can not complete before I2 starts as I2 needs a value in R3 and I3 changes R3 現代計算機體系結構23 Register Renaming example Register renaming R3b:=R3a + R5a (I1) R4b:=R3b + 1 (I2) R3c:=R5a + 1 (I3) R7b

14、:=R3c + R4b (I4) Without subscript refers to logical register in instruction With subscript is hardware register allocated Note R3a R3b R3c 現代計算機體系結構24 Hazards A hazard is created whenever there is a dependence between instructions, and they are close enough that the overlap during execution, caused

15、 by pipelining, or other reordering of instructions, would change the order of access to the operand involved in the dependence. 現代計算機體系結構25 3 Data Hazards The goal of S/W and H/W Techniques in our course is to preserve the program order only where it affects the outcome of the program to maximize I

16、LP. When instruction i occurs before instruction j in program order, RAW (Read after Write): j tries to read a source before i writes it. WAW (Write after Write): j tries to write an operand before it is written by i. WAR (Write after Read): j tries to write a destination before it is read by i. 現代計

17、算機體系結構26 Read After Write (RAW) InstrJ tries to read operand before InstrI writes it Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. Three Generic Data Hazards I: add r1,r2,r3 J: sub r4,r1,r3 現代計算機體系結構27 Write After Read (WAR) InstrJ wr

18、ites operand before InstrI reads it Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”. Cant happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 I: sub r4,r1,r3 J: add r1,r2,r

19、3 K: mul r6,r1,r7 Three Generic Data Hazards 現代計算機體系結構28 Three Generic Data Hazards Write After Write (WAW) InstrJ writes operand before InstrI writes it. Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Cant happen in MIPS 5 stage pipeline because: Al

20、l instructions take 5 stages, and Writes are always in stage 5 I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 現代計算機體系結構29 Control Dependences Caused by branch instructions For example If p1 S1; ; If p2 S2; ; S1 is control dependent on p1,and S2 is control dependent on p2 but not on p1. 現代計算機體系結構30

21、Control Dependences There are two constraints imposed by control dependences An instruction that is control dependent on a branch cannot be moved before the branch. An instruction that is not control dependent on a branch cannot be moved after the branch. Consider this code sequence: DADDU R2,R3,R4

22、BEQZ R2,L1 LW R1,0(R2) L1: 現代計算機體系結構31 Control dependence is not the critical property that must be preserved. We can violate the control dependences, if we can do so without affecting the correctness of the program. (e.g. branch prediction) Control Dependences 現代計算機體系結構32 Basic Compiler Techniques

23、for Exposing ILP 現代計算機體系結構33 To avoid a pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction. Goal: to keep a pipeline full. 現代計算機體系結構34 Basic Pipeline Scheduling and Loop Unrollin

24、g 現代計算機體系結構35 Latencies Inst. producing result Inst. using result Latency in cycles FP ALU opAnother FP op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Branch: 1, Integer ALU op branch: 1 Integer load: 1 Integer ALU - integer ALU: 0 Latency: the number of clock cycles neede

25、d to avoid a stall between a producer and a consumer Functional units are fully pipelined or replicated. = No structural hazard 現代計算機體系結構36 Example for ( i = 1000; i 0; i = i 1) xi = xi + s; Loop: L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, # -8 BNER1, R2, LOOP MIPS Assembly code 現代計算機體系

26、結構37 Without Any Scheduling Clock cycle issued Loop: L.DF0, 0(R1)1 stall2 ADD.DF4, F0, F23 stall4 stall5 S.DF4, 0(R1)6 DADDIUR1, R1, # -87 stall8 BNER1, R2, LOOP9 stall10 現代計算機體系結構38 With Scheduling Clock cycle issued Loop: L.DF0, 0(R1)1 DADDIUR1, R1, # -82 ADD.DF4, F0, F23 stall4 BNER1, R2, LOOP5 S

27、.DF4, 8(R1)6 not trivial delayed branch 現代計算機體系結構39 The actual work of operating on the array element takes 3 cycles (load, add, store). The remaining 3 cycles Loop overhead (DADDIU, BNE) Stall To eliminate the 3 cycles, we need to get more operations within the loop relative to the number of overhe

28、ad instructions. = Loop Unrolling With Scheduling 現代計算機體系結構40 Reducing Loop Overhead Loop Unrolling Simple scheme for increasing the number of instructions relative to the branch and overhead instructions Simply replicates the loop body multiple times, adjusting the loop termination code. Improves s

29、cheduling It allows instructions from different iterations to be scheduled together. Uses different registers for each iteration. 現代計算機體系結構41 Unrolled Loop (No Scheduling) Clock cycle issued Loop:L.DF0, 0(R1)12 ADD.DF4, F0, F234 5 S.DF4, 0(R1)6 L.DF6, -8(R1)78 ADD.DF8, F6, F2910 11 S.DF8, -8(R1)12 L

30、.DF10, -16(R1)1314 ADD.DF12, F10, F21516 17 S.DF12, -16(R1)18 L.DF14, -24(R1)1920 ADD.DF16, F14, F22122 23 S.DF16, -24(R1)24 DADDIUR1, R1, # -322526 BNER1, R2, LOOP2728 DADDIU and BNE dropped 現代計算機體系結構42 Loop Unrolling Loop unrolling is normally done early in the compilation process, so that redunda

31、nt computations can be exposed and eliminated by the optimizer. Unrolling improves the performance of the loop by eliminating overhead instructions. 現代計算機體系結構43 Loop Unrolling (Scheduling) Clock cycle issued Loop:L.DF0, 0(R1)1 L.DF6, -8(R1)2 L.DF10, -16(R1)3 L.DF14, -24(R1)4 ADD.DF4, F0, F25 ADD.DF8

32、, F6, F26 ADD.DF12, F10, F27 ADD.DF16, F14, F28 S.DF4, 0(R1)9 S.DF8, -8(R1)10 DADDIUR1, R1, # -3211 S.DF12, 16(R1)12 BNER1, R2, LOOP13 S.DF16, 8(R1)14 現代計算機體系結構44 Summary The key to most hardware and software ILP techniques is to know when and how the ordering among instructions may be changed. This

33、 process must be performed in a methodical fashion either by a compiler or by hardware. 現代計算機體系結構45 To obtain the final unrolled code, Determine that it is legal to move the S.D after the DADDIU and BNE, and find the amount to adjust the S.D offset. Determine that unrolling the loop will be useful b

34、y finding that the loop iterations are independent, except for the loop maintenance code. Use different registers to avoid unnecessary constraints. Eliminate the extra test and branch instructions and adjust the loop termination and iteration code. Summary 現代計算機體系結構46 Determine that the loads and st

35、ores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing the memory addresses and finding that they do not refer to the same address. Schedule the code, preserving any dependences needed to

36、yield the same result as the original code. Summary 現代計算機體系結構47 Loop Unrolling I (Unoptimized, No Delayed Branch) Loop:L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, #-8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, #-8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, #-8 L

37、.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, # -8 BNER1, R2, LOOP By symbolically computing the intermediate value of R1 現代計算機體系結構48 Loop Unrolling I (Unoptimized, No Delayed Branch) Loop:L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) L.DF0, -8(R1) ADD.DF4, F0, F2 S.DF4, -8(R1) L.DF0, -16(R1) A

38、DD.DF4, F0, F2 S.DF4, -16(R1) L.DF0, -24(R1) ADD.DF4, F0, F2 S.DF4, -24(R1) DADDIUR1, R1, # -32 BNER1, R2, LOOP name dependence true dependence Remove name dependences using Register Renaming 現代計算機體系結構49 Loop Unrolling II (Register Renaming) Loop:L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) L.DF6, -8(R

39、1) ADD.DF8, F6, F2 S.DF8, -8(R1) L.DF10, -16(R1) ADD.DF12, F10, F2 S.DF12, -16(R1) L.DF14, -24(R1) ADD.DF16, F14, F2 S.DF16, -24(R1) DADDIUR1, R1, # -32 BNER1, R2, LOOP true dependence 現代計算機體系結構50 With the renaming, the copies of each loop body become independent and can be overlapped or executed in

40、 parallel. Problem: potential shortfall in registers Register pressure It arises because scheduling code to increase ILP causes the number of live values to increase. It may not be possible to allocate all the live values to registers. The combination of unrolling and aggressive scheduling can cause

41、 this problem. 現代計算機體系結構51 Loop unrolling is a simple but useful method for increasing the size of straight- line code fragments that can be scheduled effectively. 現代計算機體系結構52 One Memory Port/Structural Hazards I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 Reg ALU DMe

42、mIfetch Reg Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Cycle 1 Cycle 2 Cycle 3 Cycle 4Cycle 6 Cycle 7Cycle 5 Reg ALU DMemIfetch Reg 現代計算機體系結構53 One Memory Port/Structural Hazards I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Stall Instr 3 Reg ALU DMemIfetch

43、Reg Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Cycle 1 Cycle 2 Cycle 3 Cycle 4Cycle 6 Cycle 7Cycle 5 Reg ALU DMemIfetch Reg BubbleBubbleBubbleBubbleBubble 現代計算機體系結構54 Reducing Branch Costs with Prediction 現代計算機體系結構55 Static Branch Prediction Delayed branch? To reorder code around branches, need t

44、o predict branch statically when compile There are several different methods to statically predict branch behavior Simplest scheme is to predict a branch as taken Average misprediction = untaken branch frequency = 34% SPEC 現代計算機體系結構56 CSCE 430/830, Instruction Level Parallelism56 Static Branch Predi

45、ction More accurate scheme predicts branches using profile information collected from earlier runs, and modify prediction based on last run: IntegerFloating Point 現代計算機體系結構57 Dynamic Branch Prediction Why does prediction work? Underlying algorithm has regularities Data that is being operated on has

46、regularities Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems Is dynamic branch prediction better than static branch prediction? Seems to be There are a small number of important branches in programs which have dynamic behavior 現代計算機體系結構58 Dy

47、namic Branch Prediction Performance = (accuracy, cost of misprediction) The simplest dynamic branch-prediction scheme is a branch-prediction buffer or branch history table Branch History Table (BHT) : Lower bits of PC address index table of 1-bit values Says whether or not branch taken last time No

48、address check 現代計算機體系結構59 Dynamic Branch Prediction Problem: in a loop, 1-bit BHT will cause two mispredictions End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping 現代計算機體系結構60 Solution: 2-bit scheme

49、where change prediction only if get misprediction twice Adds hysteresis to decision making process Dynamic Branch Prediction T TNT NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T NT 現代計算機體系結構61 BHT Accuracy Mispredict because either: Wrong guess for that branch Got branch history of wrong branch when index the table 4096 entry table: Integer

人人文庫> 全部分類> 行業資料 > 信息產業

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯系上傳者。文件的所有權益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網頁內容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
5. 人人文庫網僅提供信息存儲空間，僅對用戶上傳內容的表現方式做保護處理，對用戶上傳分享的文檔內容本身不做任何修改或編輯，并不能對任何下載內容負責。
6. 下載文件中如有侵權或不適當內容，請與我們聯系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

老太爷的乳妓h开裆裤,久久久久久精品国产三级非禁歌 ,久久久久久久99精品国产片,免费观看交性大片

高等計算機系統結構

文檔簡介

溫馨提示

最新文檔

評論

老太爷的乳妓h开裆裤,久久久久久精品国产三级非禁歌 ,久久久久久久99精品国产片,免费观看交性大片

高等計算機系統結構

文檔簡介

溫馨提示

最新文檔

評論

相關文檔