高性能計算發展概括_第1頁
高性能計算發展概括_第2頁
高性能計算發展概括_第3頁
高性能計算發展概括_第4頁
高性能計算發展概括_第5頁
已閱讀5頁,還剩65頁未讀 繼續免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

1、高性能計算及應用任課教師王云嵐 EMAIL : wangyl趙天海 EMAIL:zhaothnwpu.高性能計算研究與發展中心辦公室: 勇字樓3樓電話:88493434(O)2課程目標掌握高性能計算編程工具,解決相關問題課程主要內容:介紹高性能計算系統體系結構、高性能并行程序程序設計方法及高性能計算技術最新方向。主要包括:高性能處理機、多處理機系統;集群計算系統、Linux集群系統配置方法,集群資源管理與作業調度,多線程編程及性能優化等;并行編程程序工具:OpenMP、MPI、CUDA、MapReduce等。交流平臺2013年高性能計算課程qq群:158463721參考書目:John L.He

2、nnessy, David A.Patterson,賈洪峰(譯者),計算機體系結構:量化研究方法(第5版)李靜梅 (編者), 吳艷霞 (編者) ,新一代計算機體系結構楊曉東,陸松,牟勝梅著,并行計算機體系結構技術與分析,科學出版社,2009年1月劉鵬,云計算(第二版),電子工業出版社,2011 年5月曾宇 等著,高效能計算機系統-若干關鍵技術分析,高等教育出版社,2010年1月張武生,薛巍,李建江,鄭緯民編著,MPI并行程序設計實例教程,清華大學出版社,2009Michael J. Quinn著 陳文光, 武永衛等譯,MPI與OpenMP并行程序設計:C語言版,清華大學出版社,2004.10/

3、作業高性能計算相關研究熱點的技術報告云計算CPU/GPU技術虛擬化實驗報告集群環境構建并行應用編程:MPI,openMP,Cuda高性能計算及應用課程1:高性能計算發展概述課程內容提綱應用需求計算機體系結構的發展高性能計算的核心技術:并行計算并行編程的重要性應用需求High performance computing高性能計算與科研,產業需求與意義基礎科研領域的計算需求物理化學生物材料工業領域的需求銀行輔助設計醫藥石油氣象在線服務信息安全傳統的科學研究difficult, 例如建造大型風洞expensive, 例如建造樣機slow, 例如等待氣候的變化,天體的演化dangerous, 例如武器

4、開發,藥品,大氣試驗,電力系統分析基于計算科學的科學研究物理原理和數值方法理論分析設計試驗富有挑戰性的計算問題遍及科學與工程的各個領域ScienceGlobal climate modelingAstrophysical modelingBiology: genomics; protein folding; drug designComputational ChemistryComputational Material Sciences and NanosciencesEngineeringCrash simulationSemiconductor designEarthquake and s

5、tructural modelingComputation fluid dynamics (airplane design)Combustion (engine design)Oil field applicationsBusinessFinancial and economic modelingTransaction processing, web services and search enginesDefenseNuclear weapons - test by simulationsCryptographyUnits of High Performance Computing計算能力存

6、儲能力全球氣候模擬計算問題:f(經度, 緯度, 海拔, 時間) 溫度, 氣壓, 適度, 風速做法:域的離散化分解,10公里解析度(Discretize the domain, e.g., a measurement point every 10 km)給定時間t設計算法預測t +dt的天氣(Devise an algorithm to predict weather at time t+dt given t)應用:主要事件預測(Predict major events, e.g., El Nino)用于確定大氣散射標準(Use in setting air emissions standard

7、s)來源: /chammp/chammp.html大氣環流模擬需求解Navier-Stokes方程1分鐘時間間隔100個浮點運算/網格點對計算的需求為確保時效需1分鐘執行5 x 1011 flops=8 Gflop/s以天為單位的7 天天氣預報需要56 Gflop/s以月為單位的50年氣候預測需要4.8 Tflop/s以12小時為單位的50年預測288 Tflop/s 如果提高網格解析度則計算復雜性將呈8x,16x增加 更高的精確預測模型則需要綜合考慮大氣,海洋,冰川,陸地,加上地球化學等因素 千年氣候模型分析目前無法對此進行有效計算全球氣候模擬高性能計算已經成為復雜系統工程的必備手段航空高性

8、能計算領域高端需求主要集中在CAE領域氣動力計算結構計算氣動彈性分析多學科設計優化飛行載荷計算隱身設計計算穩定性和操縱計算需求飛行仿真其他高性能計算需求數字化裝配數字樣機主要特點計算能力vs計算規模先導性研究vs工程應用超音速巡航大攻角機動武器系統內埋式發射CFD終極目標:虛擬飛行試驗虛擬風洞(CFD)設計經驗風洞試驗虛擬飛行試驗計算設備/用戶/內容Today2015Source:IDF2012大數據現象“Data are becoming the new raw material of business: an economic input almost on a par with capi

9、tal and labor” The Economist, 2010“Information will be the oil of the 21st century” Gartner,2010Source:IDF20122015 Cloud VisionCoexistence of Opportunities and Challenges Source:IDF2012Trends to Exascale PerformanceRoughly 10 x performance every 4 years, predicts that well hit Exascale performance i

10、n 2018-19Source:IDF2012計算機體系結構的發展計算機體系結構的發展趨勢體系結構的改進將技術創新轉變為計算機的處理性能計算機體系結構歷史:電子管、晶體管、集成電路、大規模集成電路超大規模集成電路(Very Large Scale Integration)的發展階段可以看做為并行處理的探索過程并行處理是提高計算機處理性能的核心技術體系結構的發展: 并行方法的探索Greatest trend in VLSI generation is increase in parallelism1970 - 1985:位級并行(bit level parallelism) 4-bit - 8

11、bit - 16-bitslows after 32 bit adoption of 64-bit now under way, 128-bit far (not performance issue)80年代中期 to 90年代中期: 指令級別并行( instruction level parallelism)pipelining and simple instruction sets, + compiler advances (RISC)on-chip caches and functional units = superscalar executiongreater sophisticat

12、ion: out of order execution, speculation, predictionto deal with control transfer and latency problemsNow: 線程級并行(thread level parallelism)VLSI三個階段Three phases:Bit-level Instruction-level Thread-levelVLSI Technology TrendsIntel announced that they have reach 1.7 billion with Itanium processorGigascal

13、e Integration (GSI) = 1 billion transistors per chip/jeff/ece4420/technology.pdf單處理器的性能增長變化VAX: 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ?%/year 2002 to present處理器功耗發展的趨勢不在提供時鐘頻率,而轉變為每個芯片的CPU數量風冷芯片最大功耗的瓶頸Recent Intel Processors“We are dedicating all of our future product d

14、evelopment to multicore designs. We believe this is a key inflection point for the industry.” Intel President Paul Otellini, IDF 2005ProcessorsYearFabrication(nm)Clock(GHz)Power(W)Pentium 420001801.80-4.0035-115Pentium M200390/1301.00-2.265-27Core 2 Duo2006652.60-2.9010-65Core 2 Quad2006652.60-2.904

15、5-105Core i7(Quad)2008452.93-3.6095-130Core i5(Quad)2009453.20-3.6073-95Pentium Dual-Core 2010452.80-3.3365-130Core i3(Duo)2010322.93-3.3318-732nd Gen i3(Duo)2011322.50-3.4035-652nd Gen i5(Quad)2011323.10-3.8045-952nd Gen i7(Quad/Hexa)2011323.80-3.9065-1303rd Gen i3(Duo)201222/322.80-3.4035-553rd Ge

16、n i5(Quad)201222/323.20-3.8035-773rd Gen i7(Quad/Hexa)201222/323.70-3.9045-77Xeon E5(8-cores)2013221.80-2.9060-130Xeon Phi(60-cores)2013221.10300Intels Many Core and Multi-coreIntel 80-core TeraScale Processor (Vangal et al. 2008)億級處理器developed a solver (single precision) for this chip that ran at 1

17、 TFLOP with only 97 WattsSource: Tim Mattson, Intel LabsTrends are putting all onto one chipThe future belongs to heterogeneous, many core SOC as the standard building block of computingSOC = system on a chipSource: Tim Mattson, Intel Labs集群系統的發展趨勢Large-Scale Computing Systems大規模集群計算系統Franklin (NERS

18、C-5): Cray XT49,532 compute nodes; 38,128 coresEach node has an AMD quad core processorand 8 GB of memory25 Tflop/s on applications; 352 Tflop/s peakHPSS Archival Storage40 PB capacity4 Tape librariesNERSC Global Filesystem (NGF)Uses IBMs GPFS1.5 PB; 5.5 GB/sClusters 105 Tflops total CarverIBM iData

19、plex clusterPDSF (HEP/NP)Linux cluster (1K cores)Magellan Cloud testbedIBM iDataplex clusterAnalyticsEuclid (512 GB shared memory)Dirac GPU testbed (48 nodes)Hopper (NERSC-6): Cray XE6 Phase 1: Cray XT5, 668 nodes, 5344 coresPhase 2: 1 Pflop/s peak (2 sockets/node, 12 cores/socket)Tianhe-I(A)6,144 c

20、ompute nodes; 24576 cores2560 AMD Radeon HD 4870*2 GPU98TB memory in totalRpeak: 4.700 pflops; Rmax: 2.566 pflopsJaguar:(Cray XT5)224,256 x86-based AMD Opteron processor coresRpeak:2.331 pflops; Rmax :1.759 pflops西工大高性能計算中心高性能集群設備浪潮天梭TS10000NX5440 刀片計算節點浪潮TS10K Clusters計算能力:73 Tflops total 153 計算刀片3

21、 MIC 加速節點4 GPU 加速節點并行存儲 179TB光纖存儲系統 40TBLinux 操作系統集群的基本組成光纖存儲系統管理、登錄、IO節點計算節點并行存儲Top 10 list in June 2012RankSiteComputer1DOE/NNSA/LLNLUnited StatesSequoia - BlueGene/Q, Power BQC 16C 1.60 GHz, CustomIBM2RIKEN Advanced Institute for Computational Science (AICS)JapanK computer, SPARC64 VIIIfx 2.0GHz,

22、 Tofu interconnectFujitsu3DOE/SC/Argonne National LaboratoryUnited StatesMira - BlueGene/Q, Power BQC 16C 1.60GHz, CustomIBM4Leibniz RechenzentrumGermanySuperMUC - iDataPlex DX360M4, Xeon E5-2680 8C 2.70GHz, Infiniband FDRIBM5National Supercomputing Center in TianjinChinaTianhe-1A - NUDT YH MPP, Xeo

23、n X5670 6C 2.93 GHz, NVIDIA 2050NUDT6DOE/SC/Oak Ridge National LaboratoryUnited StatesJaguar - Cray XK6, Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA 2090Cray Inc.7CINECAItalyFermi - BlueGene/Q, Power BQC 16C 1.60GHz, CustomIBM8Forschungszentrum Juelich (FZJ)GermanyJuQUEEN - BlueGene/

24、Q, Power BQC 16C 1.60GHz, CustomIBM9CEA/TGCC-GENCIFranceCurie thin nodes - Bullx B510, Xeon E5-2680 8C 2.700GHz, Infiniband QDRBull10National Supercomputing Centre in Shenzhen (NSCS)ChinaNebulae - Dawning TC3600 Blade System, Xeon X5650 6C 2.66GHz, Infiniband QDR, NVIDIA 2050Dawning2011年6月,我國進入Top50

25、0的高性能計算機2National Supercomputing Center in TianjinNUDTProprietaryProprietary4National Supercomputing Centre in Shenzhen (NSCS)DawningInfinibandInfiniband QDR33Institute of Process Engineering, Chinese Academy of SciencesIPE, Nvidia, TyanInfinibandInfiniband QDR40Shanghai Supercomputer CenterDawningI

26、nfinibandInfiniband DDR82Computer Network Information Center, Chinese Academy of ScienceLenovoInfinibandInfiniband97Tsinghua UniversityInspurInfinibandInfiniband QDR143Network CompanyIBMGigabit EthernetGigabit Ethernet164Internet ServiceIBMGigabit EthernetGigabit Ethernet199Web Company (C)Hewlett-Pa

27、ckardGigabit EthernetGigabit Ethernet201Internet ServiceIBMGigabit EthernetGigabit Ethernet202Internet ServiceIBMGigabit EthernetGigabit EthernetIPE:中國科學院過程工程研究所(原化工冶金研究所)RankSiteSystemCoresRmax (TFlop/s)Rpeak (TFlop/s)Power (kW)10National Supercomputing Centre in Shenzhen (NSCS)ChinaNebulae - Dawni

28、ng TC3600 Blade System, Xeon X5650 6C 2.66GHz, Infiniband QDR, NVIDIA 2050Dawning1206401271.02984.3258026National Supercomputing Center in JinanChinaSunway Blue Light - Sunway BlueLight MPP, ShenWei processor SW1600 975.00 MHz, Infiniband QDRNational Research Center of Parallel Computer Engineering

29、& Technology137200795.91070.2107437Institute of Process Engineering, Chinese Academy of SciencesChinaMole-8.5 - Mole-8.5 Cluster, Xeon X5520 4C 2.27 GHz, Infiniband QDR, NVIDIA 2050IPE, Nvidia, Tyan29440496.51012.654094Shanghai Supercomputer CenterChinaMagic Cube - Dawning 5000A, QC Opteron 1.9 Ghz,

30、 Infiniband, Windows HPC 2008Dawning30720180.6233.5122GovernmentChinaSunway 4000H Cluster, Xeon X56xx (Westmere-EP) 2.93 GHz, Infiniband QDRNational Research Center of Parallel Computer Engineering & Technology14280145.6167.4127Research CenterChinaCluster Platform SL250s Gen8, Xeon E5-2660 8C 2.200G

31、Hz, Infiniband FDR, NVIDIA 2090Hewlett-Packard8064135.4270.7132Internet ServiceChinaxSeries x3650 Cluster, Xeon E5649 6C 2.530GHz, Gigabit EthernetIBM23316131.4236.0707.32012年6月,我國進入TOP500的部分超級計算機/sublistTOP 500(2011年6月)中的集群 星群系統(Constellations)包含了一個超大容量交換系統,可以同時管理數千個計算引擎之間的高速數據傳輸;大規模并行機(MPP):由許多松耦合

32、的處理單元組成,每個單元內的CPU都有自己私有的資源,如總線,內存,硬盤等,每個處理單位只有一個微內核;集群(Cluster):每個節點有完整的操作系統。2012年6月數據,TOP500中有407套系統為ClusterArchitectureCountShare %Rmax Sum (GF)Rpeak Sum (GF)Processor SumConstellations20.40 %9497011294717648MPP8717.40 %19293725255504292984630Cluster41182.20 %39541331595165734777646Totals500100%58

33、930025.5985179949.007779924Top500國家分布TOP 500過去19年體系結構演化TOP 500過去19年體系結構演化2013年6月,cluster417,MPP83從TOP500看集群系統在高性能計算領域,集群系統已經成為主流的系統結構,并將進一步擴大其所占份額在Top500中,集群結構占了絕對大多數,說明在構建超大規模計算系統的時候,集群是主要的系統構成方式集群系統的發展趨勢64位系統逐漸成為主流多種商業化的高速互連網絡SAN系統作為集群的存儲設備64位:突破2GB的系統內存瓶頸科學計算大規模模擬應用三維網格模擬應用所需的內存可以輕易突破2GB生物信息學基因拼接

34、等應用需要大量的內存,實際應用中內存不足是主要問題之一素數運算需要用到大量64位整數運算和大內存商業應用海量數據處理DB in memory媒體播放服務器大內存高內存帶寬減少訪問磁盤次數,可將性能提高近一個數量級64位:突破2GB的系統內存瓶頸64位:新的設計理念引發新的設計理念現有的很多算法是基于內存不足設計的,因此很多精力花費在用時間換取空間上64位系統提供了訪問更大內存的機會,因此很多應用可能要基于新的理念進行設計,以獲得64位所帶來的好處64位:不是萬能靈藥并非所有用戶都需要現在就轉向64位代碼膨脹,性能反而可能會下降需要根據自己的應用特性來分析是否需要2GB以上的內存是否有大量64位

35、整數運算如果上述問題的答案都是否,那么不一定能夠從64位系統中得到預期的好處某些應用可以從特定的64位處理器獲得很大的性能提高,但這不是64位本身的特性,而是依賴于特定處理器,需要具體分析實際情況集群系統的互連網絡評價互連網絡的指標延遲帶寬功能支持價格集群系統的互連網絡InterconnectInterfaceMPI Latency(us)Uni-directional Bandwidth(MB/s)說明GB EtherPCI 30-50100最便宜MyrinetPCI-X6248SCIPCI1.4326延遲最小Quadrics IIIPCI5340InfiniBand 4xPCI-X7.58

36、05帶寬最高集群系統的互連網絡功能支持都支持MPI,除GB Ethernet外都實現了高效率的通信協議SCI和Quadrics還提供了共享內存的支持,但是其遠程通信延遲仍然在us數量級,對于細粒度的共享內存程序,仍然無法很好地支持(對比SGI Altrix系列的遠程訪問延遲在200ns以下)集群系統所面臨的挑戰能耗問題不僅僅是集群系統的問題從芯片,單機和集群系統等多個層次來共同解決這個問題管理性監控自我修復管理信息的過濾與提取分區Execution is not just about hardwareModern programmer does not see assembly languag

37、eMany do not even see “low-level” languages like “C”什么是并行編程?Why parallel programmingWhat is Parallel Computing?Traditionally, software has been written for serial computationTo be run on a single computer having a single Central Processing Unit (CPU)A problem is broken into a discrete series of inst

38、ructionsInstructions are executed one after anotherOnly one instruction may execute at any moment in timeFor example:發工資程序Parallel Computing同時使用多個計算資源來處理一個計算任務To be run using multiple CPUsA problem is broken into discrete parts that can be solved concurrently Each part is further broken down to a se

39、ries of instructions Instructions from each part execute simultaneously on different CPUs ExampleExampleThe compute resources might beA single computer with multiple processors An arbitrary number of computers connected by a network A combination of both The computational problem should be able toBe

40、 broken apart into discrete pieces of work that can be solved simultaneously Execute multiple program instructions at any moment in time Be solved in less time with multiple compute resources than with a single compute resource加速比Goal of applications in using parallel machines: SpeedupFor a fixed pr

41、oblem size (input data set), performance = 1/time并行編程的重要性Why parallel programmingNow we can get: single-source approach to multi- and many-coreSource:IDF2012However, the Parallelizing CompilersAfter 30 years of intensive research only limited success in parallelism detection and program transformati

42、ons instruction-level parallelism at the basic-block level can be detected parallelism in nested for-loops containing arrays with simple index expressions can be analyzed analysis techniques, such as data dependence analysis, pointer analysis, flow sensitive analysis, abstract interpretation, . when applied across procedure boundaries often take far too long and tend to be fragile, i

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論