




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
1、An Inntrodductiion tto Daata MMininngDiscooveriing hhiddeen vaalue in yyour dataa warrehouuseOvervviewData miniing, the extrractiion oof hiiddenn preedicttive infoormattion fromm larrge ddatabbasess, iss a ppowerrful new techhnoloogy wwith greaat pootenttial to hhelp comppaniees foocus on tthe mmost
2、 impoortannt innformmatioon inn theeir ddata wareehousses. Dataa minning toolls prredicct fuuturee treends and behaaviorrs, aallowwing busiinessses tto maake pproacctivee, knnowleedge-drivven ddecissionss. Thhe auutomaated, proospecctivee anaalysees offfereed byy datta miiningg movve beeyondd thee a
3、naalysees off passt evventss proovideed byy rettrosppectiive ttoolss typpicall of deciisionn suppportt sysstemss. Daata mmininng toools can answwer bbusinness quesstionns thhat ttradiitionnallyy werre tooo tiime cconsuumingg to resoolve. Theey sccour dataabasees foor hiiddenn pattternns, ffindiing p
4、prediictivve innformmatioon thhat eexperrts mmay mmiss becaause it llies outsside theiir exxpecttatioons.Most comppaniees allreaddy coollecct annd reefinee masssivee quaantitties of ddata. Datta miiningg tecchniqques can be iimpleementted rrapiddly oon exxistiing ssoftwware and harddwaree plaatforrm
5、s tto ennhancce thhe vaalue of eexistting infoormattion resoourcees, aand ccan bbe inntegrratedd witth neew prroduccts aand ssysteems aas thhey aare bbrougght oon-liine. Whenn impplemeentedd on highh perrformmancee cliient/servver oor paaralllel pproceessinng coomputters, datta miiningg toools ccan
6、aanalyyze mmassiive ddatabbasess to deliiver answwers to qquesttionss succh ass, WWhichh cliientss aree mosst liikelyy to resppond to mmy neext ppromootionnal mmailiing, and why?This whitte paaper provvidess an intrroducctionn to the basiic teechnoologiies oof daata mmininng. EExampples of pprofiita
7、blle apppliccatioons iillusstratte itts reelevaance to ttodayys bbusinness enviironmment as wwell as aa bassic ddescrriptiion oof hoow daata wwarehhousee arcchiteecturres ccan eevolvve too delliverr thee vallue oof daata mmininng too endd useers.The FFounddatioons oof Daata MMininngData miniing ttec
8、hnniquees arre thhe reesultt of a loong pproceess oof reesearrch aand pproduuct ddevellopmeent. Thiss evoolutiion bbegann wheen buusineess ddata was firsst sttoredd on compputerrs, ccontiinuedd witth immprovvemennts iin daata aaccesss, aand mmore receentlyy, geeneraated techhnoloogiess thaat alllow
9、userrs too navvigatte thhrouggh thheir dataa in reall timme. DData miniing ttakess thiis evvoluttionaary pproceess bbeyonnd reetrosspecttive dataa acccess and naviigatiion tto prrospeectivve annd prroacttive infoormattion deliiveryy. Daata mmininng iss reaady ffor aappliicatiion iin thhe buusineess
10、ccommuunityy beccausee it is ssuppoortedd by threee teechnoologiies tthat are now suffficieentlyy matture: Massiive ddata colllectiion Powerrful multtiproocesssor ccompuuterss Data miniing aalgorrithmms Commeerciaal daatabaases are growwing at uunpreecedeentedd rattes. A reecentt METTA Grroup survve
11、y oof daata wwarehhousee proojectts foound thatt 19% of resppondeents are beyoond tthe 550 giigabyyte llevell, whhile 59% expeect tto bee theere bby seecondd quaarterr of 19966.1 IIn soome iindusstriees, ssuch as rretaiil, tthesee nummberss cann be muchh larrger. Thee acccompaanyinng neeed ffor iimp
12、rooved compputattionaal ennginees caan noow bee mett in a coost-eeffecctivee mannner withh parralleel muultipproceessorr commputeer teechnoologyy. Daata mmininng allgoriithmss embbody techhniquues tthat havee exiistedd forr at leasst 100 yeaars, but havee onlly reecenttly bbeen impllemennted as mmat
13、urre, rreliaable, unddersttandaable toolls thhat cconsiistenntly outpperfoorm oolderr staatistticall metthodss.In thhe evvoluttion fromm bussinesss daata tto buusineess iinforrmatiion, eachh neww steep haas buuilt uponn thee preeviouus onne. FFor eexampple, dynaamic dataa acccess is ccritiical for d
14、rilll-thhrouggh inn datta naavigaationn appplicaationns, aand tthe aabiliity tto sttore largge daatabaases is ccritiical to ddata miniing. Fromm thee useers poinnt off vieew, tthe ffour stepps liistedd in Tablle 1 weree revvoluttionaary bbecauuse tthey alloowed new busiinesss queestioons tto bee ans
15、swereed acccuraatelyy andd quiicklyy.Evoluutionnary SteppBusinness QuesstionnEnablling TechhnoloogiessProduuct PProviiderssCharaacterristiicsData Colllectiion (19600s)Whatt wass my totaal reevenuue inn thee lasst fiive yyearss?Compuuterss, taapes, dissksIBM, CDCRetroospecctivee, sttaticc datta deeli
16、veeryData Acceess (19800s)Whatt werre unnit ssaless in New Englland lastt Marrch?Relattionaal daatabaases (RDBBMS), Strructuured Querry Laanguaage (SQL), ODDBCOraclle, SSybasse, IInforrmix, IBMM, MiicrossoftRetroospecctivee, dyynamiic daata ddelivvery at rrecorrd leevelData Wareehoussing & Decission
17、 Suppport(19900s)Whatt werre unnit ssaless in New Englland lastt Marrch? Drilll doown tto Boostonn.On-liine aanalyytic proccessiing (OLAPP), mmultiidimeensioonal dataabasees, ddata wareehoussesPilott, Coomshaare, Arboor, CCognoos, MMicroostraategyyRetroospecctivee, dyynamiic daata ddelivvery at mmul
18、tiiple leveelsData Miniing (Emerrgingg Todday)Whatts llikelly too happpen to BBostoon unnit ssaless nexxt moonth? Whyy?Advannced algoorithhms, multtiproocesssor ccompuuterss, maassivve daatabaasesPilott, Loockheeed, IBM, SGII, nuumeroous sstarttups (nasscentt inddustrry)Prosppectiive, proaactivve in
19、nformmatioon deeliveeryTablee 1. Stepps inn thee Evoolutiion oof Daata MMininng.The ccore compponennts oof daata mmininng teechnoologyy havve beeen uunderr devveloppmentt forr deccadess, inn ressearcch arreas suchh as stattistiics, artiificiial iintellligeence, andd macchinee leaarninng. TTodayy, th
20、he maaturiity oof thhese techhniquues, couppled withh higgh-peerforrmancce reelatiionall dattabasse ennginees annd brroad dataa inttegraationn efffortss, maake tthesee tecchnollogiees prractiical for currrent dataa warrehouuse eenvirronmeents.The SScopee of Dataa MinningData miniing dderivves iits n
21、name fromm thee simmilarritiees beetweeen seearchhing for valuuablee bussinesss innformmatioon inn a llargee dattabasse for exammple, finndingg linnked prodductss in gigaabytees off stoore sscannner ddata annd miiningg a mmounttain for a veein oof vaaluabble oore. Bothh proocessses rrequiire eeithee
22、r siiftinng thhrouggh ann immmensee amoount of mmaterrial, or inteelliggentlly prrobinng itt to findd exaactlyy wheere tthe vvaluee ressidess. Giiven dataabasees off suffficiient sizee andd quaalityy, daata mmininng teechnoologyy cann genneratte neew buusineess oopporrtuniitiess by provvidinng thhes
23、e capaabiliitiess: Autommatedd preedicttion of ttrendds annd beehaviiors. Datta miiningg auttomattes tthe pproceess oof fiindinng prredicctivee infformaationn in largge daatabaases. Queestioons tthat tradditioonallly reequirred eextennsivee hannds-oon annalyssis ccan nnow bbe annswerred ddirecctly f
24、romm thee datta quicckly. A ttypiccal eexampple oof a preddictiive pprobllem iis taargetted mmarkeetingg. Daata mmininng usses ddata on ppast prommotioonal maillingss to idenntifyy thee tarrgetss mosst liikelyy to maxiimizee retturn on iinvesstmennt inn futture maillingss. Otther preddictiive pprobl
25、lems incllude foreecastting bankkrupttcy aand ootherr forrms oof deefaullt, aand iidenttifyiing ssegmeents of aa poppulattion likeely tto reesponnd siimilaarly to ggivenn eveents. Autommatedd disscoveery oof prrevioouslyy unkknownn pattternns. DData miniing ttoolss sweeep tthrouugh ddatabbasess andd
26、 ideentiffy prrevioouslyy hiddden pattternss in one stepp. Ann exaamplee of patttern disccoverry iss thee anaalysiis off rettail salees daata tto iddentiify sseemiinglyy unrrelatted pproduucts thatt aree oftten ppurchhasedd toggetheer. OOtherr pattternn disscoveery pprobllems incllude deteectinng fr
27、rauduulentt creedit cardd traansacctionns annd iddentiifyinng annomallous dataa thaat coould reprresennt daata eentryy keyying erroors. Data miniing ttechnniquees caan yiield the beneefitss of autoomatiion oon exxistiing ssoftwware and harddwaree plaatforrms, and can be iimpleementted oon neew syyst
28、emms ass exiistinng pllatfoorms are upgrradedd andd neww prooductts deevelooped. Wheen daata mmininng toools are impllemennted on hhigh perfformaance paraallell proocesssing systtems, theey caan annalyzze maassivve daatabaases in mminuttes. Fastter pproceessinng meeans thatt useers ccan aautommaticc
29、allyy expperimment withh morre moodelss to undeerstaand ccompllex ddata. Higgh sppeed makees itt praacticcal ffor uuserss to anallyze hugee quaantitties of ddata. Larrger dataabasees, iin tuurn, yielld immprovved pprediictioons. Databbasess cann be largger iin booth ddepthh andd breeadthh: More colu
30、umns. Anaalystts muust ooftenn limmit tthe nnumbeer off varriablles tthey exammine whenn doiing hhandss-on anallysiss duee to timee connstraaintss. Yeet vaariabbles thatt aree disscardded bbecauuse tthey seemm uniimporrtantt mayy carrry iinforrmatiion aaboutt unkknownn pattternns. HHigh perfformaanc
31、e dataa minning alloows uuserss to expllore the fulll deppth oof a dataabasee, wiithouut prresellectiing aa subbset of vvariaabless. More rowss. Laargerr sammpless yieeld llowerr esttimattion erroors aand vvariaance, andd alllow uuserss to makee infferennces abouut smmall but impoortannt seegmennts
32、oof a popuulatiion. A reccent Garttner Grouup Addvancced TTechnnologgy Reesearrch NNote listted ddata miniing aand aartifficiaal inntellligennce aat thhe toop off thee fivve keey teechnoologyy areeas tthat willl cllearlly haave aa majjor iimpacct accrosss a wwide rangge off inddustrries withhin tthe
33、 nnext 3 too 5 yyearss.2 Garttner alsoo lissted paraallell arcchiteecturres aand ddata miniing aas twwo off thee topp 10 new techhnoloogiess in whicch coompannies willl invvest duriing tthe nnext 5 yeears. Acccordiing tto a receent GGartnner HHPC RReseaarch Notee, WWith the rapiid addvancce inn datt
34、a caapturre, ttranssmisssion and storrage, larrge-ssysteems uuserss willl inncreaasinggly nneed to iimpleementt neww andd innnovattive wayss to minee thee aftter-mmarkeet vaalue of ttheirr vasst sttoress of detaail ddata, empployiing MMPP masssivelly paaralllel pproceessinng ssysteems tto crreatee n
35、eww souurcess of busiinesss advvantaage (0.9 probbabillity).3 The mmost commmonlyy useed teechniiquess in dataa minning are: Artifficiaal neeurall nettworkks: NNon-llineaar prredicctivee moddels thatt leaarn tthrouugh ttrainning and reseemblee bioologiical neurral nnetwoorks in sstruccturee. Decissi
36、on treees: TTree-shapped sstruccturees thhat rrepreesentt setts off deccisioons. Thesse deecisiions geneeratee rulles ffor tthe cclasssificcatioon off a ddatasset. Speccificc deccisioon trree mmethoods iincluude CClasssificcatioon annd Reegresssionn Treees (CARTT) annd Chhi Sqquaree Auttomattic IInt
37、erractiion DDetecctionn (CHHAID) . Genettic aalgorrithmms: OOptimmizattion techhniquues tthat use proccessees suuch aas geenetiic coombinnatioon, mmutattion, andd natturall sellectiion iin a desiign bbasedd on the concceptss of evollutioon. Neareest nneighhbor methhod: A teechniique thatt claassiffi
38、es eachh reccord in aa dattasett bassed oon a combbinattion of tthe cclassses oof thhe k recoord(ss) moost ssimillar tto itt in a hiistorricall dattasett (whhere k 1). Someetimees caalledd thee k-nneareest nneighhbor techhniquue. Rule induuctioon: TThe eextraactioon off useeful if-tthen rulees frrom
39、 ddata baseed onn staatistticall siggnifiicancce. Many of tthesee tecchnollogiees haave bbeen in uuse ffor mmore thann a ddecadde inn speeciallizedd anaalysiis toools thatt worrk wiith rrelattivelly smmall voluumes of ddata. Theese ccapabbilitties are now evollvingg to inteegratte diirecttly wwith i
40、nduustryy-staandarrd daata wwarehhousee andd OLAAP pllatfoorms. Thee apppendiix too thiis whhite papeer prroviddes aa gloossarry off datta miiningg terrms.How DData Miniing WWorkssHow eexacttly iis daata mmininng abble tto teell yyou iimporrtantt thiings thatt youu diddnt knoww or whatt is goinng to
41、o happpen nextt? Thhe teechniique thatt is usedd to perfform thesse feeats in ddata miniing iis caalledd moddelinng. MModelling is ssimplly thhe acct off buiildinng a modeel inn onee sittuatiion wwheree youu knoow thhe annswerr andd theen appplyiing iit too anootherr sittuatiion tthat you dont. FFor
42、 iinstaance, if you weree loookingg forr a ssunkeen Sppanissh gaalleoon onn thee higgh seeas tthe ffirstt thiing yyou mmightt do is tto reesearrch tthe ttimess wheen Sppanissh trreasuure hhad bbeen founnd byy othhers in tthe ppast. Youu migght nnote thatt theese sshipss oftten ttend to bbe foound of
43、f the coasst off Berrmudaa andd thaat thhere are certtain charracteeristtics to tthe ooceann currrentts, aand ccertaain rroutees thhat hhave likeely bbeen takeen byy thee shiips capttainss in thatt eraa. Yoou noote tthesee simmilarritiees annd buuild a moodel thatt inccludees thhe chharaccteriisticc
44、s thhat aare ccommoon too thee loccatioons oof thhese sunkken ttreassuress. Wiith tthesee moddels in hhand you saill offf loookingg forr treeasurre whhere yourr moddel iindiccatess it mostt likkely mighht bee givven aa simmilarr sittuatiion iin thhe paast. Hopeefullly, iif yoouvee gott a ggood modee
45、l, yyou ffind yourr treeasurre.This act of mmodell buiildinng iss thuus soomethhing thatt peoople havee beeen dooing for a loong ttime, cerrtainnly bbeforre thhe addventt of compputerrs orr datta miiningg tecchnollogy. Whaat haappenns onn commputeers, howeever, is not muchh difffereent tthan the way
46、 peopple bbuildd moddels. Commputeers aare lloadeed upp witth loots oof innformmatioon abbout a vaarietty off sittuatiions wherre ann ansswer is kknownn andd theen thhe daata mmininng sooftwaare oon thhe coomputter mmust run throough thatt datta annd diistilll thhe chharaccteriisticcs off thee datta
47、 thhat sshoulld goo intto thhe moodel. Oncce thhe moodel is bbuiltt it can thenn be usedd in simiilar situuatioons wwheree youu donnt kknow the answwer. For exammple, sayy thaat yoou arre thhe diirecttor oof maarketting for a teelecoommunnicattionss commpanyy andd youud llike to aacquiire ssome new
48、longg disstancce phhone custtomerrs. YYou ccouldd jusst raandommly ggo ouut annd maail ccoupoons tto thhe geeneraal poopulaationn - jjust as yyou ccouldd ranndomlly saail tthe sseas lookking for sunkken ttreassure. In neitther casee wouuld yyou aachieeve tthe rresullts yyou ddesirred aand oof coours
49、ee youu havve thhe oppporttunitty too do muchh bettter thann ranndom - yoou coould use yourr bussinesss exxperiiencee stoored in yyour dataabasee to builld a modeel.As thhe maarketting direectorr youu havve acccesss to a loot off infformaationn aboout aall oof yoour ccustoomerss: thheir age, sexx, c
50、rreditt hisstoryy andd lonng diistannce ccalliing uusagee. Thhe goood nnews is tthat you alsoo havve a lot of iinforrmatiion aaboutt youur prrospeectivve cuustommers: theeir aage, sex, creedit histtory etc. Youur prrobleem iss thaat yoou doont knoww thee lonng diistannce ccalliing uusagee of thesse
51、prrospeects (sinnce tthey are mostt likkely now custtomerrs off youur coompettitioon). Youd liike tto cooncenntratte onn thoose pprosppectss whoo havve laarge amouunts of llong disttancee usaage. You can accoompliish tthis by bbuildding a moodel. Tabble 22 illlustrratess thee datta ussed ffor bbuild
52、ding a moodel for new custtomerr proospecctingg in a daata wwarehhousee.CustoomerssProsppectssGenerral iinforrmatiion (e.g. demmograaphicc datta)KnownnKnownnProprrietaary iinforrmatiion (e.g. cusstomeer trransaactioons)KnownnTargeetTablee 2 - Datta Miiningg forr ProospecctinggThe ggoal in pprosppect
53、iing iis too makke soome ccalcuulateed guuessees abbout the infoormattion in tthe llowerr rigght hhand quaddrantt bassed oon thhe moodel thatt we builld gooing fromm Cusstomeer Geeneraal Innformmatioon too Cusstomeer Prropriietarry Innformmatioon. FFor iinstaance, a ssimplle moodel for a teelecoommu
54、nnicattionss commpanyy migght bbe:98% oof myy cusstomeers wwho mmake moree thaan $660,0000/yeear sspendd morre thhan $80/mmonthh on longg disstancceThis modeel coould thenn be appllied to tthe pprosppect dataa to try to ttell someethinng abbout the proppriettary infoormattion thatt thiis teelecoommu
55、nnicattionss commpanyy doees noot cuurrenntly havee acccess to. Withh thiis moodel in hhand new custtomerrs caan bee sellectiivelyy tarrgeteed.Test markketinng iss an exceellennt soourcee of dataa forr thiis kiind oof moodeliing. Miniing tthe rresullts oof a testt marrket reprresenntingg a bbroadd b
56、utt rellativvely smalll saamplee of prosspectts caan prrovidde a founndatiion ffor iidenttifyiing ggood prosspectts inn thee oveeralll marrket. Tabble 33 shoows aanothher ccommoon sccenarrio ffor bbuildding modeels: preddict whatt is goinng too happpen in tthe ffuturre.YesteerdayyTodayyTomorrrowStat
57、iic innformmatioon annd cuurrennt pllans (e.gg. deemogrraphiic daata, markketinng pllans)KnownnKnownnKnownnDynammic iinforrmatiion (e.g. cusstomeer trransaactioons)KnownnKnownnTargeetTablee 3 - Datta Miiningg forr PreedicttionssIf soomeonne toold yyou tthat he hhad aa moddel tthat coulld prredicct c
58、uustommer uusagee howw wouuld yyou kknow if hhe reeallyy hadd a ggood modeel? TThe ffirstt thiing yyou mmightt tryy wouuld bbe too askk himm to applly hiis moodel to yyour custtomerr basse - wherre yoou allreaddy knnew tthe aansweer. WWith dataa minning, thee besst waay too acccompllish thiss is by
59、ssettiing aasidee somme off youur daata iin a vaullt too isoolatee it fromm thee minning proccess. Oncce thhe miiningg is comppletee, thhe reesultts caan bee tessted agaiinst the dataa helld inn thee vauult tto coonfirrm thhe moodels vaalidiity. If tthe mmodell worrks, its obseervattionss shoould ho
60、ldd forr thee vauultedd datta.An Arrchittectuure ffor DData MiniingTo beest aapplyy theese aadvannced techhniquues, theyy musst bee fullly iinteggrateed wiith aa datta waarehoouse as wwell as fflexiible inteeracttive busiinesss anaalysiis toools. Manny daata mmininng toools currrentlly opperatte ouu
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 實習協議合同章
- 解除保險代理合同協議
- 購房分期付款合同協議書
- 返利協議合同
- 快遞進村合同協議書
- 解除違法違約合同協議書
- 紋繡學徒合同協議書模板
- 合同同業競爭協議
- 種苗轉讓協議合同
- 家具美容合同協議
- 2024年寧波市消防救援支隊社會招錄政府專職消防員筆試真題
- Unit 6 Beautiful landscapes Reading 教學設計-2024-2025學年譯林版七年級英語下冊
- 神經導航在神經外科手術中的應用與經驗
- 外研版(2025版)七年級下冊英語Unit 1~3+期中共4套測試卷(含答案)
- 網球場翻新施工方案
- 2025年國家公務員考試公共基礎知識題庫400題及答案
- 《主動脈夾層疾病》課件
- 課題申報書:鄉村振興和教育現代化背景下農村教育發展戰略研究
- 中國妊娠期糖尿病母兒共同管理指南(2024版)解讀
- 建筑工程材料題庫+參考答案
- DB21T 2724-2017 遼寧省河湖(庫)健康評價導則
評論
0/150
提交評論