




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
1、現代計算機體系結構1 現代計算機體系結構現代計算機體系結構 通信郵箱: 提交作業郵箱:tju_ 2013年 2 The Main Contents課程主要內容 Chapter 1. Fundamentals of Quantitative Design and Analysis Chapter 2. Memory Hierarchy Design Chapter 3. Instruction-Level Parallelism and Its Exploitation Chapter 4. Data-Level Parallelism in Vector, SIMD, and GPU Arch
2、itectures Chapter 5. Thread-Level Parallelism Chapter 6. Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Appendix A. Pipelining: Basic and Intermediate Concepts 現代計算機體系結構3 Instruction-Level Parallelism (ILP) Instructions are evaluated in parallel. Pipelining Two approac
3、hes to exploiting ILP Dynamic F0=array element ADD.DF4, F0, F2;add scalar in F2 S.DF4, 0(R1);store the result DADDIUR1, R1, #-8;decrement pointer 8 bytes BNER1, R2, LOOP ;branch R1 != R2 Floating-point data dependences Integer data dependence Data Dependences 現代計算機體系結構12 The order must be preserved
4、for correct execution. If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped. Data dependence between DADDIU and BNE = Branch test for the MIPS pipeline in the ID stage (2nd stage). Data Dependences 現代計算機體系結構13 Data Hazards Consider the instruction se
5、quence: ADDR1,R2,R3 SUBR4,R1,R3 ANDR6,R1,R7 ORR8,R1,R9 XORR10,R1,R11 The result in R1 is produced after it is required by the last three instructions. 現代計算機體系結構14 I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Reg
6、ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Data Hazard on R1 Time (clock cycles) IFID/RF EXMEMWB 現代計算機體系結構15 Pipelined Datapath PC Instruction memory 4 Registers M u x M u x M u x ALU EX M WB M WB WB ID/EX 0 EX/MEM MEM/WB Data memory M u x Hazard detection unit Forwarding unit
7、IF.Flush IF/ID Sign extend Control M u x = Shift left 2 M u x 現代計算機體系結構16 A data dependence conveys three things indicates the possibility of a hazard, determines the order in which results must be calculated, and sets an upper bound on how much parallelism can possibly be exploited. Data Dependence
8、s 現代計算機體系結構17 How to Overcome a Dependence Maintaining the dependence but avoiding a hazard Code scheduling (by the compiler or by the hardware) Eliminating a dependence by transforming the code Eliminating a dependence by forwarding 現代計算機體系結構18 Time (clock cycles) I n s t r. O r d e r lw r1, 0(r2)
9、sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 Data Hazard Even with Forwarding Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg 現代計算機體系結構19 Data Hazard Even with Forwarding Time (clock cycles) or r8,r1,r9 I n s t r. O r d e r lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7
10、Reg ALU DMemIfetch Reg RegIfetch ALU DMem Reg Bubble Ifetch ALU DMem RegBubbleReg Ifetch ALU DMemBubbleReg 現代計算機體系結構20 Name Dependence Occurs when two instructions use the same register or memory location (name), but there is no flow of data between the instructions associated with that name. When i
11、 precedes j in program order: Antidependence: Instruction j writes a register or memory location that instruction i reads. Output dependence: Instructions i and j write the same register or memory location. No value transmitted between instructions. 現代計算機體系結構21 Register Renaming Instructions involve
12、d in a name dependence can execute simultaneously or be reordered, if the name (register number or memory location) used in the instructions is changed so the instructions do not conflict. (Especially for register operands) Statically by a compiler or dynamically by the hardware. 現代計算機體系結構22 Registe
13、r Renaming example Antidependence R3:=R3 + R5; (I1) R4:=R3 + 1; (I2) R3:=R5 + 1; (I3) R7:=R3 + R4; (I4) I3 can not complete before I2 starts as I2 needs a value in R3 and I3 changes R3 現代計算機體系結構23 Register Renaming example Register renaming R3b:=R3a + R5a (I1) R4b:=R3b + 1 (I2) R3c:=R5a + 1 (I3) R7b
14、:=R3c + R4b (I4) Without subscript refers to logical register in instruction With subscript is hardware register allocated Note R3a R3b R3c 現代計算機體系結構24 Hazards A hazard is created whenever there is a dependence between instructions, and they are close enough that the overlap during execution, caused
15、 by pipelining, or other reordering of instructions, would change the order of access to the operand involved in the dependence. 現代計算機體系結構25 3 Data Hazards The goal of S/W and H/W Techniques in our course is to preserve the program order only where it affects the outcome of the program to maximize I
16、LP. When instruction i occurs before instruction j in program order, RAW (Read after Write): j tries to read a source before i writes it. WAW (Write after Write): j tries to write an operand before it is written by i. WAR (Write after Read): j tries to write a destination before it is read by i. 現代計
17、算機體系結構26 Read After Write (RAW) InstrJ tries to read operand before InstrI writes it Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. Three Generic Data Hazards I: add r1,r2,r3 J: sub r4,r1,r3 現代計算機體系結構27 Write After Read (WAR) InstrJ wr
18、ites operand before InstrI reads it Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”. Cant happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 I: sub r4,r1,r3 J: add r1,r2,r
19、3 K: mul r6,r1,r7 Three Generic Data Hazards 現代計算機體系結構28 Three Generic Data Hazards Write After Write (WAW) InstrJ writes operand before InstrI writes it. Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Cant happen in MIPS 5 stage pipeline because: Al
20、l instructions take 5 stages, and Writes are always in stage 5 I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 現代計算機體系結構29 Control Dependences Caused by branch instructions For example If p1 S1; ; If p2 S2; ; S1 is control dependent on p1,and S2 is control dependent on p2 but not on p1. 現代計算機體系結構30
21、Control Dependences There are two constraints imposed by control dependences An instruction that is control dependent on a branch cannot be moved before the branch. An instruction that is not control dependent on a branch cannot be moved after the branch. Consider this code sequence: DADDU R2,R3,R4
22、BEQZ R2,L1 LW R1,0(R2) L1: 現代計算機體系結構31 Control dependence is not the critical property that must be preserved. We can violate the control dependences, if we can do so without affecting the correctness of the program. (e.g. branch prediction) Control Dependences 現代計算機體系結構32 Basic Compiler Techniques
23、for Exposing ILP 現代計算機體系結構33 To avoid a pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction. Goal: to keep a pipeline full. 現代計算機體系結構34 Basic Pipeline Scheduling and Loop Unrollin
24、g 現代計算機體系結構35 Latencies Inst. producing result Inst. using result Latency in cycles FP ALU opAnother FP op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Branch: 1, Integer ALU op branch: 1 Integer load: 1 Integer ALU - integer ALU: 0 Latency: the number of clock cycles neede
25、d to avoid a stall between a producer and a consumer Functional units are fully pipelined or replicated. = No structural hazard 現代計算機體系結構36 Example for ( i = 1000; i 0; i = i 1) xi = xi + s; Loop: L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, # -8 BNER1, R2, LOOP MIPS Assembly code 現代計算機體系
26、結構37 Without Any Scheduling Clock cycle issued Loop: L.DF0, 0(R1)1 stall2 ADD.DF4, F0, F23 stall4 stall5 S.DF4, 0(R1)6 DADDIUR1, R1, # -87 stall8 BNER1, R2, LOOP9 stall10 現代計算機體系結構38 With Scheduling Clock cycle issued Loop: L.DF0, 0(R1)1 DADDIUR1, R1, # -82 ADD.DF4, F0, F23 stall4 BNER1, R2, LOOP5 S
27、.DF4, 8(R1)6 not trivial delayed branch 現代計算機體系結構39 The actual work of operating on the array element takes 3 cycles (load, add, store). The remaining 3 cycles Loop overhead (DADDIU, BNE) Stall To eliminate the 3 cycles, we need to get more operations within the loop relative to the number of overhe
28、ad instructions. = Loop Unrolling With Scheduling 現代計算機體系結構40 Reducing Loop Overhead Loop Unrolling Simple scheme for increasing the number of instructions relative to the branch and overhead instructions Simply replicates the loop body multiple times, adjusting the loop termination code. Improves s
29、cheduling It allows instructions from different iterations to be scheduled together. Uses different registers for each iteration. 現代計算機體系結構41 Unrolled Loop (No Scheduling) Clock cycle issued Loop:L.DF0, 0(R1)12 ADD.DF4, F0, F234 5 S.DF4, 0(R1)6 L.DF6, -8(R1)78 ADD.DF8, F6, F2910 11 S.DF8, -8(R1)12 L
30、.DF10, -16(R1)1314 ADD.DF12, F10, F21516 17 S.DF12, -16(R1)18 L.DF14, -24(R1)1920 ADD.DF16, F14, F22122 23 S.DF16, -24(R1)24 DADDIUR1, R1, # -322526 BNER1, R2, LOOP2728 DADDIU and BNE dropped 現代計算機體系結構42 Loop Unrolling Loop unrolling is normally done early in the compilation process, so that redunda
31、nt computations can be exposed and eliminated by the optimizer. Unrolling improves the performance of the loop by eliminating overhead instructions. 現代計算機體系結構43 Loop Unrolling (Scheduling) Clock cycle issued Loop:L.DF0, 0(R1)1 L.DF6, -8(R1)2 L.DF10, -16(R1)3 L.DF14, -24(R1)4 ADD.DF4, F0, F25 ADD.DF8
32、, F6, F26 ADD.DF12, F10, F27 ADD.DF16, F14, F28 S.DF4, 0(R1)9 S.DF8, -8(R1)10 DADDIUR1, R1, # -3211 S.DF12, 16(R1)12 BNER1, R2, LOOP13 S.DF16, 8(R1)14 現代計算機體系結構44 Summary The key to most hardware and software ILP techniques is to know when and how the ordering among instructions may be changed. This
33、 process must be performed in a methodical fashion either by a compiler or by hardware. 現代計算機體系結構45 To obtain the final unrolled code, Determine that it is legal to move the S.D after the DADDIU and BNE, and find the amount to adjust the S.D offset. Determine that unrolling the loop will be useful b
34、y finding that the loop iterations are independent, except for the loop maintenance code. Use different registers to avoid unnecessary constraints. Eliminate the extra test and branch instructions and adjust the loop termination and iteration code. Summary 現代計算機體系結構46 Determine that the loads and st
35、ores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing the memory addresses and finding that they do not refer to the same address. Schedule the code, preserving any dependences needed to
36、yield the same result as the original code. Summary 現代計算機體系結構47 Loop Unrolling I (Unoptimized, No Delayed Branch) Loop:L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, #-8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, #-8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, #-8 L
37、.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, # -8 BNER1, R2, LOOP By symbolically computing the intermediate value of R1 現代計算機體系結構48 Loop Unrolling I (Unoptimized, No Delayed Branch) Loop:L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) L.DF0, -8(R1) ADD.DF4, F0, F2 S.DF4, -8(R1) L.DF0, -16(R1) A
38、DD.DF4, F0, F2 S.DF4, -16(R1) L.DF0, -24(R1) ADD.DF4, F0, F2 S.DF4, -24(R1) DADDIUR1, R1, # -32 BNER1, R2, LOOP name dependence true dependence Remove name dependences using Register Renaming 現代計算機體系結構49 Loop Unrolling II (Register Renaming) Loop:L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) L.DF6, -8(R
39、1) ADD.DF8, F6, F2 S.DF8, -8(R1) L.DF10, -16(R1) ADD.DF12, F10, F2 S.DF12, -16(R1) L.DF14, -24(R1) ADD.DF16, F14, F2 S.DF16, -24(R1) DADDIUR1, R1, # -32 BNER1, R2, LOOP true dependence 現代計算機體系結構50 With the renaming, the copies of each loop body become independent and can be overlapped or executed in
40、 parallel. Problem: potential shortfall in registers Register pressure It arises because scheduling code to increase ILP causes the number of live values to increase. It may not be possible to allocate all the live values to registers. The combination of unrolling and aggressive scheduling can cause
41、 this problem. 現代計算機體系結構51 Loop unrolling is a simple but useful method for increasing the size of straight- line code fragments that can be scheduled effectively. 現代計算機體系結構52 One Memory Port/Structural Hazards I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 Reg ALU DMe
42、mIfetch Reg Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Cycle 1 Cycle 2 Cycle 3 Cycle 4Cycle 6 Cycle 7Cycle 5 Reg ALU DMemIfetch Reg 現代計算機體系結構53 One Memory Port/Structural Hazards I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Stall Instr 3 Reg ALU DMemIfetch
43、Reg Reg ALU DMemIfetch Reg Reg ALU DMemIfetch Reg Cycle 1 Cycle 2 Cycle 3 Cycle 4Cycle 6 Cycle 7Cycle 5 Reg ALU DMemIfetch Reg BubbleBubbleBubbleBubbleBubble 現代計算機體系結構54 Reducing Branch Costs with Prediction 現代計算機體系結構55 Static Branch Prediction Delayed branch? To reorder code around branches, need t
44、o predict branch statically when compile There are several different methods to statically predict branch behavior Simplest scheme is to predict a branch as taken Average misprediction = untaken branch frequency = 34% SPEC 現代計算機體系結構56 CSCE 430/830, Instruction Level Parallelism56 Static Branch Predi
45、ction More accurate scheme predicts branches using profile information collected from earlier runs, and modify prediction based on last run: IntegerFloating Point 現代計算機體系結構57 Dynamic Branch Prediction Why does prediction work? Underlying algorithm has regularities Data that is being operated on has
46、regularities Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems Is dynamic branch prediction better than static branch prediction? Seems to be There are a small number of important branches in programs which have dynamic behavior 現代計算機體系結構58 Dy
47、namic Branch Prediction Performance = (accuracy, cost of misprediction) The simplest dynamic branch-prediction scheme is a branch-prediction buffer or branch history table Branch History Table (BHT) : Lower bits of PC address index table of 1-bit values Says whether or not branch taken last time No
48、address check 現代計算機體系結構59 Dynamic Branch Prediction Problem: in a loop, 1-bit BHT will cause two mispredictions End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping 現代計算機體系結構60 Solution: 2-bit scheme
49、where change prediction only if get misprediction twice Adds hysteresis to decision making process Dynamic Branch Prediction T TNT NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T NT 現代計算機體系結構61 BHT Accuracy Mispredict because either: Wrong guess for that branch Got branch history of wrong branch when index the table 4096 entry table: Integer
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 移動辦公設備貸款協議
- 《網絡廣告互動性研究》課件
- 雙語列車長車票的發售規定課件
- 雙語列車長火災爆炸事故的應急處理課件
- 中醫與傳統文化課件
- 家居設計合同范本
- 版個人房產轉讓合同樣本
- 四位創始股東合作合同書
- 【課件】電荷+課件+-高二上學期物理人教版(2019)必修第三冊+
- 景德鎮藝術職業大學《中醫養生與康復學》2023-2024學年第二學期期末試卷
- 《大數據導論(第2版)》全套教學課件
- 新疆能源(集團)有限責任公司招聘筆試題庫2024
- AECOPD合并呼吸衰竭護理查房
- 2024年全國高中數學聯賽北京賽區預賽一試試題(解析版)
- 2025屆新高考化學熱點精準復習 高三化學復習備考的方法與策略
- 新高考II卷01(含聽力)2024年高考英語一輪復習測試卷(考試版)
- 西游記閱讀指導課評課
- 2024年鄭州信息科技職業學院單招職業適應性測試題庫學生專用
- 2023-2024學年安徽省合肥八中高一(下)期中數學試卷(含解析)
- CHT 9008.2-2010 基礎地理信息數字成果1:500 1:1 000 1:2 000數字高程模型
- 測量學-第五版-配套課件
評論
0/150
提交評論