存儲系統容錯編碼簡介ppt課件_第1頁
存儲系統容錯編碼簡介ppt課件_第2頁
存儲系統容錯編碼簡介ppt課件_第3頁
存儲系統容錯編碼簡介ppt課件_第4頁
存儲系統容錯編碼簡介ppt課件_第5頁
已閱讀5頁,還剩114頁未讀 繼續免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

1、存儲系統容錯編碼簡介內容mRAID、容錯編碼mReed-Solomon編碼m二進制線性碼m陣列碼m利用組合數學工具構造容錯編碼內容mRAID、容錯編碼mReed-Solomon編碼m二進制線性碼m陣列碼m利用組合數學工具構造容錯編碼RAIDmRedundant Arrays of Inexpensive DisksRedundant Arrays of Independent Disksm容量m性能m可靠性mChen, P. M., Lee, E. K., Gibson, G. A., Katz, R. H., and Patterson, D. A. “RAID: high-performa

2、nce, reliable secondary storage. ACM Computing Surveys 26(2), pp. 143-185, June 1994.RAID構造mData Stripingstripe unitstripe04KB-14KB8KB-18KB12KB-112KB16KB-116KB20KB-1RAID構造mRedundancy04KB-1RAID構造m編碼:d1 XOR d2 XOR XOR dn = pm解碼:di = p XOR d1 XOR XOR di-1 XOR di-1 m解碼:pnew = p XOR diold XOR dinewRAID構造

3、mdata unit、parity unitmRAID5:更新負載均勻分布RAID構造RAID5的讀寫RAID5的讀寫RAID5規劃mEdward K. Lee, Randy H. Katz, “The Performance of Parity Placements in Disk Arrays, IEEE Transactions on Computers, vol. 42 no. 6, pp. 651-664, 1993.RAID5規劃RAID5規劃RAID0mHui-I Hsiao and David J. DeWitt, “Chained declustering: A new av

4、ailability strategy for multiprocessor database machines, Technical Report CS TR 854, University of Wisconsin, Madison, June 1989. RAID0mGang Wang, Xiaoguang Liu, Sheng Lin, Guangjun Xie, Jing Liu, “Constructing Double- and Triple-erasure-correcting Codes with High Availability Using Mirroring and P

5、arity Approaches, ICPADS2019.What is an Erasure Code?mJ. S. Plank, “Erasure Codes for Storage Applications, Tutorial of the 4th Usenix Conference on File and Storage Technologies, San Francisco, CA, Dec, 2019.When are they useful?mAnytime you need to tolerate failures.When are they useful?mAnytime y

6、ou need to tolerate failures.When are they useful?mAnytime you need to tolerate failures.When are they useful?mAnytime you need to tolerate failures.When are they useful?mAnytime you need to tolerate failures.When are they useful?mAnytime you need to tolerate failures.When are they useful?mAnytime y

7、ou need to tolerate failures.Terms & DefinitionsmNumber of data disks:nmNumber of coding disks: mmRate of a code:R = n/(n+m)mIdentifiable Failure: “ErasureThe problem, once againIssues with Erasure CodingmPerformancemEncodingmTypically O(mn), but not always.mUpdatemTypically O(m), but not always

8、.mDecodingmTypically O(mn), but not always.Issues with Erasure CodingmSpace UsagemQuantified by two of four:mData Devices: nmCoding Devices:mmSum of Devices:(n+m)mRate:R = n/(n+m)mHigher rates are more space efficient,but less fault-tolerant.Issues with Erasure CodingmFlexibilitymCan you arbitrarily

9、 add data / coding nodes?m(Can you change the rate)?mHow does this impact failure coverage?Trivial Example: ReplicationmMDSmExtremely fast encoding/decoding/update.mRate: R = 1/(m+1) - Very space inefficientmThere are many replication/based systemsm(P2P especially).Less Trivial Example: Simple Parit

10、ymPatterson D A, Gibson G A, Katz R H, “A case for redundant arrays of inexpensive disks (RAID), ACM International Conference on Management of Data, Chicago, ACM Press, 1988, pp. 109-116.mP. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. RAID: High-performance, reliable secondary

11、 storage. ACM Computing Surveys, 26(2):145185, June 1994.Evaluating ParitymMDSmRate: R = n/(n+1) - Very space efficientmOptimal encoding/decoding/update:mn-1 XORs to encode & decodem2 XORs to updatemExtremely popular (RAID Level 5).mDownside: m = 1 is limited.UnfortunatelymThose are the last eas

12、y things youll see.mFor (n 1, m 1), there is no consensus on the best coding technique.mThey all have tradeoffs.Why is this such a pain?mCoding theory historically has been the purview of coding theorists.mTheir goals have had their roots elsewhere (noisy communication lines, byzantine memory system

13、s, etc).mThey are not systems programmers.m(They dont care)內容mRAID、容錯編碼mReed-Solomon編碼m二進制線性碼內容mRAID、容錯編碼mReed-Solomon編碼m二進制線性碼m陣列碼m利用組合數學工具構造容錯編碼Reed-Solomon CodesmThe only MDS coding technique for arbitrary n & m.mThis means that m erasures are always tolerated.mHave been around for decades.mE

14、xpensive.mJ. S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software Practice& Experience, 27(9):9951012, September 2019.Reed-Solomon CodesmOperate on binary words of data, composed of w bits, where 2w n+m.Reed-Solomon CodesmOperate on binary words of data,

15、 composed of w bits, where 2w n+m.Reed-Solomon CodesmThis means we only have to focus on words, rather than whole devices.mWord size is an issue:mIf n+m 256, we can use bytes as words.mIf n+m 65,536, we can use shorts as words.Reed-Solomon CodesmCodes are based on linear algebra.mFirst, consider the

16、 data words as a column vector D:Reed-Solomon CodesmCodes are based on linear algebra.mNext, define an (n+m)*n “Distribution Matrix B, whose first n rows are the identity matrix:Reed-Solomon CodesmCodes are based on linear algebra.mB*D equals an (n+m)*1 column vector composed of D and C (the coding

17、words):Reed-Solomon CodesmThis means that each data and coding word has a corresponding row in the distribution matrix.Reed-Solomon CodesmSuppose m nodes fail.mTo decode, we create B by deleting the rows of B that correspond to the failed nodes.Reed-Solomon CodesmSuppose m nodes fail.mTo decode, we

18、create B by deleting the rows of B that correspond to the failed nodes.mYoull note that B*D equals the survivors.Reed-Solomon CodesmNow, invert B:Reed-Solomon CodesmNow, invert B:mAnd multiply both sides of the equation by B-1Reed-Solomon CodesmNow, invert B:mAnd multiply both sides of the equation

19、by B-1mSince B-1*B = I, You have just decoded D!Reed-Solomon CodesmNow, invert B:mAnd multiply both sides of the equation by B-1mSince B-1*B = I, You have just decoded D!Reed-Solomon CodesmTo Summarize: EncodingmCreate distribution matrix B.mMultiply B by the data to create coding words.mTo Summariz

20、e: DecodingmCreate B by deleting rows of B.mInvert B.mMultiply B-1 by the surviving words to reconstruct the data.Reed-Solomon CodesmTwo Final Issues:m1: How to create B?mAll square submatrices must be invertible.mDerive from a Vandermonde MatrixJ. S. Plank and Y. Ding. Note: Correction to the 2019

21、tutorial on Reed-Solomon coding. Software Practice & Experience, 35(2):189194,2019.m#2: Will modular arithmetic work?mNO! (no multiplicative inverses)mInstead, you must use Galois Field arithmetic.Reed-Solomon Codesm(n+m)n的范德蒙矩陣m根本變換m恣意兩列可交換 m任何一列可以乘以一個非0數 m恣意兩列可做如下變換:Ci=Ci+c*Cj,c非0 Reed-Solomon

22、 PerformancemEncoding: O(mn)mMore specifically: mS (n-1)/BXOR + n/BGFMult mS = Size of a devicemBXOR = Bandwith of XOR (3 GB/s)mBGFMult = Bandwidth of Multiplication over GF(2w)mGF(28): 800 MB/smGF(216): 150 MB/sReed-Solomon PerformancemUpdate: O(m)mMore specifically: m+1 XORs and m multiplications.

23、Reed-Solomon PerformancemDecoding: O(mn) or O(n3)mLarge devices: dS (n-1)/BXOR + n/BGFMult mWhere d = number of data devices to reconstruct.mYes, theres a matrix to invert, but usually thats in the noise because dSn n3.Reed-Solomon Bottom LinemSpace Efficient: MDSmFlexible:mWorks for any value of n

24、and m.mEasy to add/subtract coding devices.mPublic-domain implementations.mSlow:mn-way dot product for each coding device.mGF multiplication slows things down.Cauchy Reed-Solomon CodesJ. Blomer, M. Kalfane, M. Karpinski, R. Karp, M. Luby, and D. Zuckerman. An XOR-based erasure-resilient coding schem

25、e. Technical Report TR-95-048, International Computer Science Institute, August 2019. #1: Use a Cauchy matrix instead of a Vandermonde matrix: Invert in O(n2).#2: Use neat projection to convert Galois Field multiplications into XORs.Kind of subtle, so well go over it.Cauchy Reed-Solomon Codesm取GF(2w

26、)中m+n個不同元素,構成X=x1, , xm,Y=y1, , ym mCauchy矩陣:元素(i, j)為1/(xi+yj)GF(2w)上運算Cauchy Reed-Solomon Codesm例:X=1, 2,Y=0, 3, 4, 5, 6 Cauchy Reed-Solomon CodesCauchy Reed-Solomon Codesmn、m固定,w增長,GC的優勢明顯參考文獻m參考資料mJ. S. Plank. Enumeration of optimal and good Cauchy matrices for Reed-Solomon coding. Technical Rep

27、ort CS-05-570, University of Tennessee, December 2019.m通訊協議防止重新發送mL. Rizzo. Effective erasure codes for reliable computer communication protocols. ACM SIGCOMM Computer Communication Review, 27(2):2436, 2019.mL. Rizzo and L. Vicisano. RMDP: an FEC-based reliable multicast protocol for wireless enviro

28、nments. Mobile Computer and Communication Review, 2(2), April 2019.參考文獻m曾經正在成為IETF規范mM. Luby, L. Vicisano, J. Gemmell, L. Rizo, M. Handley, and J. Crowcroft. Forward error correction (FEC) building block. IETF RFC 3452 (/rfc/rfc3452.txt), December 2019.mM. Luby, L. Vicisano, J. Gemmell, L. R

29、izo, M. Handley, and J. Crowcroft. The use of forward error correction(FEC) in reliable multicast. IETF RFC 3453 (/rfc/rfc3453.txt), December 2019.m加密mC. S. Jutla. Encryption modes with almost free message integrity. Lecture Notes in Computer Science, 2045, 2019.參考文獻m分布式數據構造mW. Litwin and T.

30、 Schwarz. Lh*rs: a high-availability scalable distributed data structure using Reed Solomon codes. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 237248. ACM Press, 2000.m降低無線通訊能耗mP. J. M. Havinga. Energy efficiency of error correction on wireless systems

31、, 2019.m廣域網、對等網存儲系統mJ. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao. Oceanstore: An architecture for global-scale persistent storage. In Proceedings of ACM ASPLOS. ACM, November 2000.參考文獻m用于cache而不是冗余mJ. Byers, M. Luby, M

32、. Mitzenmacher, and A. Rege. A digital fountain approach to reliable distribution of bulk data. In ACM SIGCOMM 98, pages 5667, Vancouver, August 2019.mJ.W. Byers, M. Luby, and M. Mitzenmacher. Accessing multiple mirror sites in parallel: Using tornado codes to speed up downloads. In IEEE INFOCOM, pa

33、ges 275283, New York, NY, March 2019.mI. T. Rowstron and P. Druschel. Storage management and caching in PAST, a large-scale, persistent peer-topeer storage utility. In Symposium on Operating Systems Principles, pages 188201, 2019.參考文獻mTornado碼mM. Luby, M. Mitzenmacher, and A. Shokrollahi. Analysis o

34、f random processes via and-or tree evaluation. In 9th Annual ACM-SIAM Symposium on Discrete Algorithms, January 2019.mM. Luby, M. Mitzenmacher, A. Shokrollahi, D. Spielman, and V. Stemann. Practical loss-resilient codes. In 29th Annual ACM Symposium on Theory of Computing, pages 150159, 2019.mM. Lub

35、y. Benchmark comparisons of erasure codes. /luby/erasure.html, 2019.性能比RS碼好參考文獻m傳統單節點用于容錯和提高性能mS. Frolund, A. Merchant, Y. Saito, S. Spence, and A. Veitch. A decentralized algorithm for erasure-coded virtual disks. In DSN-04: International Conference on Dependable Systems and Networ

36、ks, Florence, Italy, 2019. IEEE.mG. R. Goodson, J. J. Wylie, G. R. Ganger, and M. K. Reiter. Efficient byzantine-tolerant erasure-coded storage. In DSN-04: International Conference on Dependable Systems and Networks, Florence, Italy, 2019. IEEE.mW.Wilcke et al. The IBM intelligent brick project peta

37、bytes and beyond. IBM Journal of Research and Development, to appear, April 2019.m歸檔系統mS. Rhea, C. Wells, P. Eaton, D. Geels, B. Zhao, H. Weatherspoon, and J. Kubiatowicz. Maintenance-free global data storage. IEEE Internet Computing, 5(5):4049, 2019.參考文獻m廣域網存儲:mS. Atchley, S. Soltesz, J. S. Plank,

38、M. Beck, and T. Moore. Fault tolerance in the network storage stack. In IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, Ft. Lauderdale, FL, April 2019.mR. L. Collins and J. S. Plank. Assessing the performance of erasure codes in the wide-area. In DSN-05: International Conference on

39、 Dependable Systems and Networks, Yokohama, Japan, 2019.IEEE.mH. Xia and A. A. Chien. RobuSTore: Robust performance for distributed storage systems. Technical Report CS2019-0838, University of California at San Diego, October 2019.參考文獻m對等網:mZ. Zhang and Q. Lian. Reperasure: Replication protocol usin

40、g erasure-code in peer-to-peer storage network. In 21st IEEE Symposium on Reliable Distributed Systems (SRDS02), pages 330339, October 2019.mJ. Li. PeerStreaming: A practical receiver-driven peer-to-peer media streaming system. Technical Report MSR-TR-2019-101, Microsoft Research, September 2019.mW.

41、 K. Lin, D. M. Chiu, and Y. B. Lee. Erasure code replication revisited. In PTP04: 4th International Conference on Peer-to-Peer Computing. IEEE, 2019.mL. Dairaine, J. Lacan, L. Lancerica, and J. Fimes. Content-access QoS in peer-to-peer networks using a fast MDS erasure code. Computer Communications,

42、 28(15):17781790, September 2019.參考文獻m內容分發系統:mJ. Byers, M. Luby, M. Mitzenmacher, and A. Rege. A digital fountain approach to reliable distribution of bulk data. In ACM SIGCOMM 98, pages 5667, Vancouver, August 2019.mM. Mitzenmacher. Digital fountains: A survey and look forward,.In 2019 IEEE Informa

43、tion Theory Workshop, San Antonio, October 2019.內容mRAID、容錯編碼mReed-Solomon編碼m二進制線性碼m陣列碼m利用組合數學工具構造容錯編碼內容mRAID、容錯編碼mReed-Solomon編碼m二進制線性碼m陣列碼m利用組合數學工具構造容錯編碼Binary Linear CodesmLisa Hellerstein, Garth A Gibson, Richard M Karp, Randy H Katz, David A Patterson, “Coding techniques for handling failures in

44、 large disk arrays, Algorithmica, vol. 12, no. 2/3, pp. 182-208, 1994. m數據單元劃分為重疊的校驗組,每個校驗組相當于一個RAID4/5Binary Linear Codesm容錯編碼評價規范mMTTDLmcheck disk overheadmupdate penaltymgroup sizem擴展性t維編碼m磁盤陳列為t維方陣,每維最后一行為校驗盤m1維:RAID4、RAID5,校驗磁盤開銷,1/Gm2維:2/G,但留意,G是校驗組大小,不是磁盤數,G2=N,因此是2/sqrt(N)mt維:t/G,t/N1/t,t大于3

45、時,冗余率太差 t維編碼校驗矩陣表示法m數據位和校驗位組成一個向量碼字mparity check matrixH=P | Imc(k + c),k數據位數,c校驗位數mI為cc的單位矩陣,P為ck的0/1矩陣m表示校驗計算公式,對碼字X,應有HX=0mP的列數據盤,I的列校驗盤m行校驗組m元素:1參與校驗組,0未參與校驗矩陣表示法容錯才干的斷定m以下命題是等價的mH能恢復任何的t擦除缺點mH能檢測恣意的t錯誤m恣意兩個碼字的間隔 distance不一樣的位的數目,至少為t+1mH的恣意t列在GF2上線性無關m因此,缺點盤能否可恢復對應列能否線性無關!m線性無關一切列向量的和或恣意非空子集的和不

46、為0 校驗矩陣還可以表示其他目的m校驗開銷:c/km校驗組大小:行的權重1的個數m更新代價:列的權重m一種常見的擴展方式m新添加的磁盤全部清0無需重新計算校驗,而H相應的添加一列mMTTDL用校驗矩陣表示比較困難重構m假定m個盤缺點,重排校驗矩陣H=A | B,X=d | y,其中B和y對應缺點磁盤m重構求解ymHX=0Ad+By=0求解線性方程組Ad=Bym此方程組有獨一解可正確重構的充要條件是B的各列線性無關m無需解整個方程組,抽取出缺點校驗組對應的行Ad=By,計算y=(B)-1Ad即可 雙容錯和三容錯編碼 m最優冗余full-2和full-3:一切能夠的權重為23的列組成的校驗矩陣mb

47、ad t+1缺點:一個數據單元及其t個校驗單元m高可靠性編碼:一個t-erasure-correcting code可恢復除bad t+1缺點之外的一切t+1缺點m圖表示法mfull-2、full-3:完全圖m二維:完全二部圖2擦除碼對比線性碼比較線性碼比較t3mfull-t不具備t容錯才干,t維碼冗余率太差m定理:H=P | I 為校驗矩陣,假設P的一切列分量為t,且對于P的恣意兩行,P至多有1列在兩行上均為1,那么編碼能容t錯mS為P的j列的集合(j2,數據方陣中選取斜率為0、1、t-1這t個方向組織校驗組mM. Blaum, J. Bruck, and A. Vardy, “MDS ar

48、ray codes with independent parity symbols, IEEE Trans. on Information Theory, Vol. 42, No. 2, Mar, 2019, pp. 529-542.m并非對一切的n、m=t可保證t容錯才干,上文中給出了適用的參數RDPm雙容錯程度碼,(p-1)*(p-1)的數據陣列,p為素數m校驗相關程度校驗單元作為“數據參與對角線校驗陣列碼和線性碼的關系00010203041011121314202122232430313233344041424344Q0Q1Q2Q3Q4P0P1P2P3P4Liberation碼J. S.

49、 Plank, “The RAID-6 Liberation Codes, 6th USENIX Conference on File and Storage Technologies, San Francisco, 2019, pp. 97110.編碼矩陣1的個數最少,但不意味著校驗計算性能最優其他程度碼Cheng Huang, Lihao Xu, “STAR: An Efficient Coding Scheme for Correcting Triple Storage Node Failures, 4th USENIX Conference on File and Storage Te

50、chnologies San Francisco, 2019, pp. 197-210.Chong-Won Park and Jin-Won Park, “A multiple disk failure recovery scheme in RAID systems, Journal of Systems Architecture, vol. 50, pp. 169175, 2019.B-CODEL. Xu, V. Bohossian, J. Bruck, and D.G. Wagner, Low-Density MDS Codes and Factors of Complete Graphs

51、, IEEE Trans. Information Theory, pages 1817-1826, IEEE, 2019.雙容錯MDS垂直碼基于完全圖的完全1-因子分解P1F無素數限制圖論領域的一個猜測:對一切偶數n,Kn存在P1F,每個P1F又能構造兩個規模的B-CODE,一切規模的陣列都能構造B-CODE基于full-2碼B-CODEX-CodeL. Xu and J. Bruck, “X-Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, Vol. 45, No. 1, Jan,

52、 2019, pp.272-276.雙容錯“垂直碼每個磁盤都是既放置數據單元,又放置校驗單元EVENODD是“程度碼數據單元和校驗單元放置在不同磁盤上校驗方向:1和-1X-Codem構造表示其他垂直碼RM2:C. Park, “Efficient placement of parity and data to tolerate two disk failures in disk array systems, IEEE Trans. Parallel Distribut.Syst., vol. 6, no. 11, pp. 1177-1184, 2019.不保證MDS構造方法不是確定的,要進展搜

53、索WEAVER: J. L. Hafner, “WEAVER Codes: Highly Fault Tolerant Erasure Codes for Storage Systems, FAST-2019: 4th Usenix Conference on File and Storage Technologies, December, 2019.非MDS碼,冗余率最好50%!條紋組小,部分性好,分布式存儲系統下缺點方式性能好也是搜索可行編碼,最高容錯12混合碼mDH1/DH2mNam-Kyu Lee, Sung-Bong Yang, Kyoung-Woo Lee, Efficient parity placement schemes for tolerating up to two disk failures in disk arrays, Journal of Systems Architecture, 2000, 46(15): 3-1402.mDH1混合碼mDH2HDD1/HDD2Chih-Shing Tau and Tzone-I Wang, “Efficient parity placement sch

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論