




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
1、August 14, 2022Data Mining: Concepts and Techniques1Data Mining: Concepts and Techniques Chapter 3 Jiawei HanDepartment of Computer Science University of Illinois at Urbana-Champaign 2006 Jiawei Han and Micheline Kamber, All rights reservedAugust 14, 2022Data Mining: Concepts and Techniques2August 1
2、4, 2022Data Mining: Concepts and Techniques3Chapter 3: Data Warehousing and OLAP Technology: An OverviewWhat is a data warehouse? A multi-dimensional data modelData warehouse architectureData warehouse implementationFrom data warehousing to data miningAugust 14, 2022Data Mining: Concepts and Techniq
3、ues4What is Data Warehouse?Defined in many different ways, but not rigorously.A decision support database that is maintained separately from the organizations operational databaseSupport information processing by providing a solid platform of consolidated, historical data for analysis.“A data wareho
4、use is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision-making process.”W. H. InmonData warehousing:The process of constructing and using data warehousesAugust 14, 2022Data Mining: Concepts and Techniques5Data WarehouseSubject-Orient
5、edOrganized around major subjects, such as customer, product, salesFocusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processingProvide a simple and concise view around particular subject issues by excluding data that are not useful in the decis
6、ion support processAugust 14, 2022Data Mining: Concepts and Techniques6Data WarehouseIntegratedConstructed by integrating multiple, heterogeneous data sourcesrelational databases, flat files, on-line transaction recordsData cleaning and data integration techniques are applied.Ensure consistency in n
7、aming conventions, encoding structures, attribute measures, etc. among different data sourcesE.g., Hotel price: currency, tax, breakfast covered, etc.When data is moved to the warehouse, it is converted. August 14, 2022Data Mining: Concepts and Techniques7Data WarehouseTime VariantThe time horizon f
8、or the data warehouse is significantly longer than that of operational systemsOperational database: current value dataData warehouse data: provide information from a historical perspective (e.g., past 5-10 years)Every key structure in the data warehouseContains an element of time, explicitly or impl
9、icitlyBut the key of operational data may or may not contain “time element”August 14, 2022Data Mining: Concepts and Techniques8Data WarehouseNonvolatileA physically separate store of data transformed from the operational environmentOperational update of data does not occur in the data warehouse envi
10、ronmentDoes not require transaction processing, recovery, and concurrency control mechanismsRequires only two operations in data accessing: initial loading of data and access of dataAugust 14, 2022Data Mining: Concepts and Techniques9Data Warehouse vs. Heterogeneous DBMSTraditional heterogeneous DB
11、integration: A query driven approachBuild wrappers/mediators on top of heterogeneous databases When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global a
12、nswer setComplex information filtering, compete for resourcesData warehouse: update-driven, high performanceInformation from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysisAugust 14, 2022Data Mining: Concepts and Techniques10Data Warehouse vs. Op
13、erational DBMSOLTP (on-line transaction processing)Major task of traditional relational DBMSDay-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.OLAP (on-line analytical processing)Major task of data warehouse systemData analysis and decision m
14、akingDistinct features (OLTP vs. OLAP):User and system orientation: customer vs. marketData contents: current, detailed vs. historical, consolidatedDatabase design: ER + application vs. star + subjectView: current, local vs. evolutionary, integratedAccess patterns: update vs. read-only but complex q
15、ueriesAugust 14, 2022Data Mining: Concepts and Techniques11OLTP vs. OLAPAugust 14, 2022Data Mining: Concepts and Techniques12Why Separate Data Warehouse?High performance for both systemsDBMS tuned for OLTP: access methods, indexing, concurrency control, recoveryWarehousetuned for OLAP: complex OLAP
16、queries, multidimensional view, consolidationDifferent functions and different data:missing data: Decision support requires historical data which operational DBs do not typically maintaindata consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sourcesdata
17、 quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciledNote: There are more and more systems which perform OLAP analysis directly on relational databasesAugust 14, 2022Data Mining: Concepts and Techniques13Chapter 3: Data Warehousing
18、 and OLAP Technology: An OverviewWhat is a data warehouse? A multi-dimensional data modelData warehouse architectureData warehouse implementationFrom data warehousing to data miningAugust 14, 2022Data Mining: Concepts and Techniques14From Tables and Spreadsheets to Data CubesA data warehouse is base
19、d on a multidimensional data model which views data in the form of a data cubeA data cube, such as sales, allows data to be modeled and viewed in multiple dimensionsDimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as
20、 dollars_sold) and keys to each of the related dimension tablesIn data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.August 14, 2022Data Mi
21、ning: Concepts and Techniques15Cube: A Lattice of Cuboidstime,itemtime,item,locationtime, item, location, supplieralltimeitemlocationsuppliertime,locationtime,supplieritem,locationitem,supplierlocation,suppliertime,item,suppliertime,location,supplieritem,location,supplier0-D(apex) cuboid1-D cuboids2
22、-D cuboids3-D cuboids4-D(base) cuboidAugust 14, 2022Data Mining: Concepts and Techniques16Conceptual Modeling of Data WarehousesModeling data warehouses: dimensions & measuresStar schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema
23、where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflakeFact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation August 14, 2022Data Minin
24、g: Concepts and Techniques17Example of Star Schema time_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcitystate_or_provincecountrylocationSales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_salesMeasuresitem_keyitem_namebrandtypesupplier_typeitembranch
25、_keybranch_namebranch_typebranchAugust 14, 2022Data Mining: Concepts and Techniques18Example of Snowflake Schematime_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcity_keylocationSales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_salesMeasuresitem_key
26、item_namebrandtypesupplier_keyitembranch_keybranch_namebranch_typebranchsupplier_keysupplier_typesuppliercity_keycitystate_or_provincecountrycityAugust 14, 2022Data Mining: Concepts and Techniques19Example of Fact Constellationtime_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcityprovi
27、nce_or_statecountrylocationSales Fact Tabletime_key item_key branch_key location_key units_sold dollars_sold avg_salesMeasuresitem_keyitem_namebrandtypesupplier_typeitembranch_keybranch_namebranch_typebranchShipping Fact Tabletime_key item_key shipper_key from_location to_location dollars_cost units
28、_shippedshipper_keyshipper_namelocation_keyshipper_typeshipperAugust 14, 2022Data Mining: Concepts and Techniques20Cube Definition Syntax (BNF) in DMQLCube Definition (Fact Table)define cube : Dimension Definition (Dimension Table)define dimension as ()Special Case (Shared Dimension Tables)First tim
29、e as “cube definition”define dimension as in cube August 14, 2022Data Mining: Concepts and Techniques21Defining Star Schema in DMQLdefine cube sales_star time, item, branch, location:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)define dimension time a
30、s (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand, type, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state, country)August 14, 2022Data Mini
31、ng: Concepts and Techniques22Defining Snowflake Schema in DMQLdefine cube sales_snowflake time, item, branch, location:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimen
32、sion item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city(city_key, province_or_state, country)August 14, 2022Data Mining: Concepts and Techniques23Def
33、ining Fact Constellation in DMQLdefine cube sales time, item, branch, location:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand
34、, type, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state, country)define cube shipping time, item, shipper, from_location, to_location:dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
35、define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sale
36、sAugust 14, 2022Data Mining: Concepts and Techniques24Measures of Data Cube: Three CategoriesDistributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioningE.g., count(), sum(), min(), max()A
37、lgebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate functionE.g., avg(), min_N(), standard_deviation()Holistic: if there is no constant bound on the storage size needed to describe a
38、subaggregate. E.g., median(), mode(), rank()August 14, 2022Data Mining: Concepts and Techniques25A Concept Hierarchy: Dimension (location)allEuropeNorth_AmericaMexicoCanadaSpainGermanyVancouverM. WindL. Chan.allregionofficecountryTorontoFrankfurtcityAugust 14, 2022Data Mining: Concepts and Technique
39、s26View of Warehouses and HierarchiesSpecification of hierarchiesSchema hierarchyday month quarter; week yearSet_grouping hierarchy1.10 = minsupMotivationOnly a small portion of cube cells may be “above the water in a sparse cubeOnly calculate “interesting” cellsdata above certain thresholdAvoid exp
40、losive growth of the cubeSuppose 100 dimensions, only 1 base cell. How many aggregate cells if count = 1? What about count = 2?August 14, 2022Data Mining: Concepts and Techniques47Indexing OLAP Data: Bitmap IndexIndex on a particular columnEach value in the column has a bit vector: bit-op is fastThe
41、 length of the bit vector: # of records in the base tableThe i-th bit is set if the i-th row of the base table has the value for the indexed columnnot suitable for high cardinality domainsBase tableIndex on RegionIndex on TypeAugust 14, 2022Data Mining: Concepts and Techniques48Indexing OLAP Data: J
42、oin IndicesJoin index: JI(R-id, S-id) where R (R-id, ) S (S-id, )Traditional indices map the values to a list of record idsIt materializes relational join in JI file and speeds up relational join In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fac
43、t table.E.g. fact table: Sales and two dimensions city and productA join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city Join indices can span multiple dimensionsAugust 14, 2022Data Mining: Concepts and Techniques49Efficient Processing OLA
44、P QueriesDetermine which operations should be performed on the available cuboidsTransform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projectionDetermine which materialized cuboid(s) should be selected for OLAP op.Let the query to be processed be on bran
45、d, province_or_state with the condition “year = 2004”, and there are 4 materialized cuboids available:1) year, item_name, city 2) year, brand, country3) year, brand, province_or_state4) item_name, province_or_state where year = 2004Which should be selected to process the query?Explore indexing struc
46、tures and compressed vs. dense array structs in MOLAPAugust 14, 2022Data Mining: Concepts and Techniques50Chapter 3: Data Warehousing and OLAP Technology: An OverviewWhat is a data warehouse? A multi-dimensional data modelData warehouse architectureData warehouse implementationFrom data warehousing
47、to data miningAugust 14, 2022Data Mining: Concepts and Techniques51Data Warehouse UsageThree kinds of data warehouse applicationsInformation processingsupports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphsAnalytical processingmultidimensional analysis
48、 of data warehouse datasupports basic OLAP operations, slice-dice, drilling, pivotingData miningknowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization toolsAugust
49、14, 2022Data Mining: Concepts and Techniques52From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)Why online analytical mining?High quality of data in data warehousesDW contains integrated, consistent, cleaned dataAvailable information processing structure surrounding data w
50、arehousesODBC, OLEDB, Web accessing, service facilities, reporting and OLAP toolsOLAP-based exploratory data analysisMining with drilling, dicing, pivoting, etc.On-line selection of data mining functionsIntegration and swapping of multiple mining functions, algorithms, and tasksAugust 14, 2022Data M
51、ining: Concepts and Techniques53An OLAM System ArchitectureData WarehouseMeta DataMDDBOLAMEngineOLAPEngineUser GUI APIData Cube APIDatabase APIData cleaningData integrationLayer3OLAP/OLAMLayer2MDDBLayer1Data RepositoryLayer4User InterfaceFiltering&IntegrationFilteringDatabasesMining queryMining resu
52、ltAugust 14, 2022Data Mining: Concepts and Techniques54Chapter 3: Data Warehousing and OLAP Technology: An OverviewWhat is a data warehouse? A multi-dimensional data modelData warehouse architectureData warehouse implementationFrom data warehousing to data miningSummaryAugust 14, 2022Data Mining: Co
53、ncepts and Techniques55Summary: Data Warehouse and OLAP TechnologyWhy data warehousing?A multi-dimensional model of a data warehouseStar schema, snowflake schema, fact constellationsA data cube consists of dimensions & measuresOLAP operations: drilling, rolling, slicing, dicing and pivotingData ware
54、house architectureOLAP servers: ROLAP, MOLAP, HOLAPEfficient computation of data cubesPartial vs. full vs. no materializationIndexing OALP data: Bitmap index and join indexOLAP query processing From OLAP to OLAM (on-line analytical mining)August 14, 2022Data Mining: Concepts and Techniques56References (I)S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB96D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient vie
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 《深入了解硫酸生產》課件
- 《阿里巴巴商業模式分析》課件
- 鐵路旅客運輸服務高鐵時代的客運服務課件
- 《三峽人家風光覽》課件
- 房屋買賣糾紛調解協議
- 鐵道機車專業教學鄭州鐵路毛乾亞課件
- 鐵路班組管理建設班組創新文化課件
- 鐵路市場營銷產品生命周期概述課件
- 鐵路線路安全防護邵鵬飛年課件
- 河底固定電纜施工方案
- 應急急救知識課件
- 慢性病管理與護理方法試題及答案
- 定向培養協議書模板
- 基于CRISPR-Cas9技術探索敲除RAB7A增強肺癌對吉西他濱敏感性的機制研究
- 社區文化活動服務行業跨境出海戰略研究報告
- 汽車背戶協議合同
- 碳中和目標下的公路建設策略-全面剖析
- 2025年山東省東營市廣饒縣一中中考一模英語試題(原卷版+解析版)
- 形勢與政策(貴州財經大學)知到智慧樹章節答案
- 3號鋼筋加工場桁吊安裝方案
- 部編版(統編)六年級語文下冊文學常識及文化常識(共4頁)
評論
0/150
提交評論