




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
聚類與樹Clustering2024/9/91主要內容Microarrays(微陣列)HierarchicalClustering(層次聚類或系統聚類)K-MeansClustering(K-均值聚類)2024/9/92ApplicationsofClusteringViewingandanalyzingvastamountsofbiologicaldataasawholesetcanbeperplexingItiseasiertointerpretthedataiftheyarepartitionedintoclusterscombiningsimilardatapoints.2024/9/93InferringGeneFunctionalityResearcherswanttoknowthefunctionsofnewlysequencedgenesSimplycomparingthenewgenesequencestoknownDNAsequencesoftendoesnotgiveawaythefunctionofgeneFor40%ofsequencedgenes,functionalitycannotbeascertainedbyonlycomparingtosequencesofotherknowngenesMicroarraysallowbiologiststoinfergenefunctionevenwhensequencesimilarityaloneisinsufficienttoinferfunction.2024/9/94MicroarraysandExpressionAnalysisMicroarraysmeasuretheactivity(expressionlevel)ofthegenesundervaryingconditions/timepointsExpressionlevelisestimatedbymeasuringtheamountofmRNAforthatparticulargeneAgeneisactiveifitisbeingtranscribedMoremRNAusuallyindicatesmoregeneactivity2024/9/95MicroarrayExperimentsProducecDNAfrommRNA(DNAismorestable)AttachphosphortocDNAtoseewhenaparticulargeneisexpressedDifferentcolorphosphorsareavailabletocomparemanysamplesatonceHybridizecDNAoverthemicroarrayScanthemicroarraywithaphosphor-illuminatinglaserIlluminationrevealstranscribedgenesScanmicroarraymultipletimesforthedifferentcolorphosphor’s2024/9/96MicroarrayExperiments(con’t)PhosphorscanbeaddedhereinsteadTheninsteadofstaining,laserilluminationcanbeused2024/9/97UsingMicroarraysEachboxrepresentsonegene’sexpressionovertime
TrackthesampleoveraperiodoftimetoseegeneexpressionovertimeTracktwodifferentsamplesunderthesameconditionstoseethedifferenceingeneexpressions2024/9/98UsingMicroarrays(cont’d)Green:expressedonlyfromcontrolRed:expressedonlyfromexperimentalcellYellow:equallyexpressedinbothsamplesBlack:NOTexpressedineithercontrolorexperimentalcells2024/9/99MicroarrayDataMicroarraydataareusuallytransformedintoanintensitymatrix(below)Theintensitymatrixallowsbiologiststomakecorrelationsbetweendiferentgenes(eveniftheyaredissimilar)andtounderstandhowgenesfunctionsmightberelatedTime:TimeXTimeYTimeZGene110810Gene21009Gene348.63Gene4783Gene5123Intensity(expressionlevel)ofgeneatmeasuredtime2024/9/910MicroarrayData-REVISION-showinthematrixwhichgenesaresimilarandwhicharenot.Microarraydataareusuallytransformedintoanintensitymatrix(below)Theintensitymatrixallowsbiologiststomakecorrelationsbetweendiferentgenes(eveniftheyaredissimilar)andtounderstandhowgenesfunctionsmightberelatedClusteringcomesintoplayTime:TimeXTimeYTimeZGene110810Gene21009Gene348.63Gene4783Gene5123Intensity(expressionlevel)ofgeneatmeasuredtime2024/9/911ClusteringofMicroarrayDataPloteachdatumasapointinN-dimensionalspaceMakeadistancematrixforthedistancebetweeneverytwogenepointsintheN-dimensionalspaceGeneswithasmalldistancesharethesameexpressioncharacteristicsandmightbefunctionallyrelatedorsimilar.Clusteringrevealgroupsoffunctionallyrelatedgenes2024/9/912ClusteringofMicroarrayData(cont’d)Clusters2024/9/913HomogeneityandSeparationPrinciplesHomogeneity:ElementswithinaclusterareclosetoeachotherSeparation:Elementsindifferentclustersarefurtherapartfromeachother…clusteringisnotaneasytask!Giventhesepointsaclusteringalgorithmmightmaketwodistinctclustersasfollows2024/9/914BadClusteringThisclusteringviolatesbothHomogeneityandSeparationprinciplesClosedistancesfrompointsinseparateclustersFardistancesfrompointsinthesamecluster2024/9/915GoodClusteringThisclusteringsatisfiesboth
HomogeneityandSeparationprinciples2024/9/916ClusteringTechniquesAgglomerative:Startwitheveryelementinitsowncluster,anditerativelyjoinclusterstogetherDivisive:StartwithoneclusteranditerativelydivideitintosmallerclustersHierarchical:Organizeelementsintoatree,leavesrepresentgenesandthelengthofthepathesbetweenleavesrepresentsthedistancesbetweengenes.Similargenesliewithinthesamesubtrees2024/9/917HierarchicalClustering2024/9/918HierarchicalClustering:Example2024/9/919HierarchicalClustering:Example2024/9/920HierarchicalClustering:Example2024/9/921HierarchicalClustering:Example2024/9/922HierarchicalClustering:Example2024/9/923HierarchicalClustering(cont’d)HierarchicalClusteringisoftenusedtorevealevolutionaryhistory2024/9/924HierarchicalClusteringAlgorithmHierarchicalClustering(d
,n)FormnclusterseachwithoneelementConstructagraphTbyassigningonevertextoeachclusterwhilethereismorethanoneclusterFindthetwoclosestclustersC1andC2
MergeC1andC2intonewclusterCwith|C1|+|C2|elementsComputedistancefromCtoallotherclustersAddanewvertexCtoTandconnecttoverticesC1andC2RemoverowsandcolumnsofdcorrespondingtoC1andC2Addarowandcolumntod
corrspondingtothenewclusterC
returnTThealgorithmtakesanxndistancematrixdofpairwisedistancesbetweenpointsasaninput.2024/9/925HierarchicalClusteringAlgorithmHierarchicalClustering(d
,n)FormnclusterseachwithoneelementConstructagraphTbyassigningonevertextoeachclusterwhilethereismorethanoneclusterFindthetwoclosestclustersC1andC2
MergeC1andC2intonewclusterCwith|C1|+|C2|elements
ComputedistancefromCtoallotherclustersAddanewvertexCtoTandconnecttoverticesC1andC2RemoverowsandcolumnsofdcorrespondingtoC1andC2Addarowandcolumntod
corrspondingtothenewclusterC
returnTDifferentwaystodefinedistancesbetweenclustersmayleadtodifferentclusterings2024/9/926HierarchicalClustering:RecomputingDistances
dmin(C,C*)=mind(x,y)
forallelementsxinCandyinC*Distancebetweentwoclustersisthesmallestdistancebetweenanypairoftheirelements
davg(C,C*)=(1/|C*||C|)∑d(x,y)
forallelementsxinCandyinC*Distancebetweentwoclustersistheaveragedistancebetweenallpairsoftheirelements2024/9/927系統聚類例:微陣列數據2024/9/928評估表達模式的相似性兩行數據之間的相似性或者距離如何量化。歐幾里德距離。采用pearson相關系數r(-1,1)。如果兩個基因之間r為1,說明兩個數據表達模式吻合得很好如果兩個基因之間r為-1,也說明兩個數據表達模式吻合得很好(一上升,一下降)r=0,則說明表達模式之間沒什么相關性2024/9/929數據標準化計算第2個和第10個基因的平均值和標準方差減去平均值,然后除以標準方差,得到每行的標準化數據2024/9/930求pearson相關系數求經過標準化以后的兩向量的內積,再除以元素個數2024/9/931分析基因2,11與基因6,10之間表達比值正好各自相反,因此相關系數r(2,11),r(6,10)應該是-1。2024/9/932數據標準化以后基因兩兩之間的相關系數2024/9/933根據相關系數進行聚類(層次聚類法)1,計算所有元素兩兩之間的距離(相關系數),創建一個距離矩陣。每個元素就是一個類,僅僅包含它自己。2,尋找距離最小的兩個類(相關系數最大)。3,將這兩個類合并為一個新的類。新的類替換這兩個類,重新計算所有的距離,修改相似性矩陣。4,重復2,3步驟直到所有的類聚集為一個類。2024/9/934迭代過程首先會發現r(5,10)=1,然后把基因5和10歸為一類,然后需要重新計算距離矩陣。2024/9/935聚類圖2024/9/936主要內容Microarrays(微陣列)HierarchicalClustering(層次聚類或系統聚類)K-MeansClustering(K-均值聚類)2024/9/937SquaredErrorDistortion(平方誤差失真)Givenadatapoint
vandasetofpointsX,definethedistancefromvtoX
d(v,X)asthe(Eucledian)distancefromvtotheclosestpointfromX.Givenasetofndatapoints
V={v1…vn}andasetofkpointsX,definetheSquaredErrorDistortion
d(V,X)=∑d(vi,X)2/n1<
i
<
n
2024/9/938K-MeansClusteringProblem:FormulationInput:Aset,V,consistingofnpointsandaparameterkOutput:AsetXconsistingofkpoints(clustercenters)thatminimizesthesquarederrordistortiond(V,X)overallpossiblechoicesofX。2024/9/9391-MeansClusteringProblem:anEasyCaseInput:Aset,V,consistingofnpointsOutput:Asinglepointsx(clustercenter)thatminimizesthesquarederrordistortiond(V,x)overallpossiblechoicesofx。
2024/9/9401-MeansClusteringProblem:anEasyCaseInput:Aset,V,consistingofnpointsOutput:Asinglepointsx(clustercenter)thatminimizesthesquarederrordistortiond(V,x)overallpossiblechoicesofx
1-MeansClusteringproblemiseasy.However,itbecomesverydifficult(NP-complete)formorethanonecenter.AnefficientheuristicmethodforK-MeansclusteringistheLloydalgorithm
2024/9/941K-MeansClustering:LloydAlgorithmLloydAlgorithmArbitrarilyassignthekclustercenterswhiletheclustercenterskeepchangingAssigneachdatapointtotheclusterCi correspondingtotheclosestcluster representative(center)(1≤i≤k)Aftertheassignmentofalldatapoints, computenewclusterrepresentatives
accordingtothecenterofgravityofeach cluster,thatis,thenewclusterrepresentativeisforallvinCforeveryclusterC
*Thismayleadtomerelyalocallyoptimalclustering.2024/9/942x1x2x32024/9/943x1x2x32024/9/944x1x2x32024/9/945x1x2x32024/9/946ConservativeK-MeansAlgorithmLloydalgorithmisfastbutineachiterationitmovesmanydatapoints,notnecessarilycausingbetterconvergence.AmoreconservativemethodwouldbetomoveonepointatatimeonlyifitimprovestheoverallclusteringcostThesmallertheclusteringcostof
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024年項目管理重要考點復習試題及答案
- 小吃店經營數據分析與應用考核試卷
- 游樂設施施工中的合同履行考核試卷
- 玩具行業的跨境電商機遇考核試卷
- 新手村2024年農藝師考試試題及答案
- 學習網絡2025年證券從業資格證考試試題及答案
- 資產配置與證券投資的關系試題及答案
- 2023年中國電信蚌埠分公司客戶經理招聘筆試參考題庫附帶答案詳解
- 福建事業單位考試信息技術試題及答案
- 微生物檢驗有效性試題及答案盤點
- 勘查地球化學全冊配套完整課件
- 部門級安全培訓考試題及參考答案【完整版】
- 起重機械安裝維修程序文件及表格-符合TSG 07-2019特種設備質量保證管理體系2
- 人教版中考英語知識分類:考綱詞匯表65天背默版(記憶版)
- 中國高血壓防治指南(2024年修訂版)圖文解讀
- 語文閱讀理解常見答題技巧(萬能公式)
- 氣血疏通中級班教材
- PLC應用技術(S7-1200機型)課件 項目六任務1輸送系統的PLC控制電路設計
- 人教版小學六年級下冊數學《期末測試卷》含答案(滿分必刷)
- JBT 6434-2024 輸油齒輪泵(正式版)
- 2023-2024學年四川省成都市蓉城名校高二(下)期中聯考物理試卷(含解析)
評論
0/150
提交評論