從你的數據倉庫發掘隱藏財富_第1頁
從你的數據倉庫發掘隱藏財富_第2頁
從你的數據倉庫發掘隱藏財富_第3頁
從你的數據倉庫發掘隱藏財富_第4頁
從你的數據倉庫發掘隱藏財富_第5頁
已閱讀5頁,還剩19頁未讀 繼續免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

1、An Inntrodductiion tto Daata MMininngDiscooveriing hhiddeen vaalue in yyour dataa warrehouuseOvervviewData miniing, the extrractiion oof hiiddenn preedicttive infoormattion fromm larrge ddatabbasess, iss a ppowerrful new techhnoloogy wwith greaat pootenttial to hhelp comppaniees foocus on tthe mmost

2、 impoortannt innformmatioon inn theeir ddata wareehousses. Dataa minning toolls prredicct fuuturee treends and behaaviorrs, aallowwing busiinessses tto maake pproacctivee, knnowleedge-drivven ddecissionss. Thhe auutomaated, proospecctivee anaalysees offfereed byy datta miiningg movve beeyondd thee a

3、naalysees off passt evventss proovideed byy rettrosppectiive ttoolss typpicall of deciisionn suppportt sysstemss. Daata mmininng toools can answwer bbusinness quesstionns thhat ttradiitionnallyy werre tooo tiime cconsuumingg to resoolve. Theey sccour dataabasees foor hiiddenn pattternns, ffindiing p

4、prediictivve innformmatioon thhat eexperrts mmay mmiss becaause it llies outsside theiir exxpecttatioons.Most comppaniees allreaddy coollecct annd reefinee masssivee quaantitties of ddata. Datta miiningg tecchniqques can be iimpleementted rrapiddly oon exxistiing ssoftwware and harddwaree plaatforrm

5、s tto ennhancce thhe vaalue of eexistting infoormattion resoourcees, aand ccan bbe inntegrratedd witth neew prroduccts aand ssysteems aas thhey aare bbrougght oon-liine. Whenn impplemeentedd on highh perrformmancee cliient/servver oor paaralllel pproceessinng coomputters, datta miiningg toools ccan

6、aanalyyze mmassiive ddatabbasess to deliiver answwers to qquesttionss succh ass, WWhichh cliientss aree mosst liikelyy to resppond to mmy neext ppromootionnal mmailiing, and why?This whitte paaper provvidess an intrroducctionn to the basiic teechnoologiies oof daata mmininng. EExampples of pprofiita

7、blle apppliccatioons iillusstratte itts reelevaance to ttodayys bbusinness enviironmment as wwell as aa bassic ddescrriptiion oof hoow daata wwarehhousee arcchiteecturres ccan eevolvve too delliverr thee vallue oof daata mmininng too endd useers.The FFounddatioons oof Daata MMininngData miniing ttec

8、hnniquees arre thhe reesultt of a loong pproceess oof reesearrch aand pproduuct ddevellopmeent. Thiss evoolutiion bbegann wheen buusineess ddata was firsst sttoredd on compputerrs, ccontiinuedd witth immprovvemennts iin daata aaccesss, aand mmore receentlyy, geeneraated techhnoloogiess thaat alllow

9、userrs too navvigatte thhrouggh thheir dataa in reall timme. DData miniing ttakess thiis evvoluttionaary pproceess bbeyonnd reetrosspecttive dataa acccess and naviigatiion tto prrospeectivve annd prroacttive infoormattion deliiveryy. Daata mmininng iss reaady ffor aappliicatiion iin thhe buusineess

10、ccommuunityy beccausee it is ssuppoortedd by threee teechnoologiies tthat are now suffficieentlyy matture: Massiive ddata colllectiion Powerrful multtiproocesssor ccompuuterss Data miniing aalgorrithmms Commeerciaal daatabaases are growwing at uunpreecedeentedd rattes. A reecentt METTA Grroup survve

11、y oof daata wwarehhousee proojectts foound thatt 19% of resppondeents are beyoond tthe 550 giigabyyte llevell, whhile 59% expeect tto bee theere bby seecondd quaarterr of 19966.1 IIn soome iindusstriees, ssuch as rretaiil, tthesee nummberss cann be muchh larrger. Thee acccompaanyinng neeed ffor iimp

12、rooved compputattionaal ennginees caan noow bee mett in a coost-eeffecctivee mannner withh parralleel muultipproceessorr commputeer teechnoologyy. Daata mmininng allgoriithmss embbody techhniquues tthat havee exiistedd forr at leasst 100 yeaars, but havee onlly reecenttly bbeen impllemennted as mmat

13、urre, rreliaable, unddersttandaable toolls thhat cconsiistenntly outpperfoorm oolderr staatistticall metthodss.In thhe evvoluttion fromm bussinesss daata tto buusineess iinforrmatiion, eachh neww steep haas buuilt uponn thee preeviouus onne. FFor eexampple, dynaamic dataa acccess is ccritiical for d

14、rilll-thhrouggh inn datta naavigaationn appplicaationns, aand tthe aabiliity tto sttore largge daatabaases is ccritiical to ddata miniing. Fromm thee useers poinnt off vieew, tthe ffour stepps liistedd in Tablle 1 weree revvoluttionaary bbecauuse tthey alloowed new busiinesss queestioons tto bee ans

15、swereed acccuraatelyy andd quiicklyy.Evoluutionnary SteppBusinness QuesstionnEnablling TechhnoloogiessProduuct PProviiderssCharaacterristiicsData Colllectiion (19600s)Whatt wass my totaal reevenuue inn thee lasst fiive yyearss?Compuuterss, taapes, dissksIBM, CDCRetroospecctivee, sttaticc datta deeli

16、veeryData Acceess (19800s)Whatt werre unnit ssaless in New Englland lastt Marrch?Relattionaal daatabaases (RDBBMS), Strructuured Querry Laanguaage (SQL), ODDBCOraclle, SSybasse, IInforrmix, IBMM, MiicrossoftRetroospecctivee, dyynamiic daata ddelivvery at rrecorrd leevelData Wareehoussing & Decission

17、 Suppport(19900s)Whatt werre unnit ssaless in New Englland lastt Marrch? Drilll doown tto Boostonn.On-liine aanalyytic proccessiing (OLAPP), mmultiidimeensioonal dataabasees, ddata wareehoussesPilott, Coomshaare, Arboor, CCognoos, MMicroostraategyyRetroospecctivee, dyynamiic daata ddelivvery at mmul

18、tiiple leveelsData Miniing (Emerrgingg Todday)Whatts llikelly too happpen to BBostoon unnit ssaless nexxt moonth? Whyy?Advannced algoorithhms, multtiproocesssor ccompuuterss, maassivve daatabaasesPilott, Loockheeed, IBM, SGII, nuumeroous sstarttups (nasscentt inddustrry)Prosppectiive, proaactivve in

19、nformmatioon deeliveeryTablee 1. Stepps inn thee Evoolutiion oof Daata MMininng.The ccore compponennts oof daata mmininng teechnoologyy havve beeen uunderr devveloppmentt forr deccadess, inn ressearcch arreas suchh as stattistiics, artiificiial iintellligeence, andd macchinee leaarninng. TTodayy, th

20、he maaturiity oof thhese techhniquues, couppled withh higgh-peerforrmancce reelatiionall dattabasse ennginees annd brroad dataa inttegraationn efffortss, maake tthesee tecchnollogiees prractiical for currrent dataa warrehouuse eenvirronmeents.The SScopee of Dataa MinningData miniing dderivves iits n

21、name fromm thee simmilarritiees beetweeen seearchhing for valuuablee bussinesss innformmatioon inn a llargee dattabasse for exammple, finndingg linnked prodductss in gigaabytees off stoore sscannner ddata annd miiningg a mmounttain for a veein oof vaaluabble oore. Bothh proocessses rrequiire eeithee

22、r siiftinng thhrouggh ann immmensee amoount of mmaterrial, or inteelliggentlly prrobinng itt to findd exaactlyy wheere tthe vvaluee ressidess. Giiven dataabasees off suffficiient sizee andd quaalityy, daata mmininng teechnoologyy cann genneratte neew buusineess oopporrtuniitiess by provvidinng thhes

23、e capaabiliitiess: Autommatedd preedicttion of ttrendds annd beehaviiors. Datta miiningg auttomattes tthe pproceess oof fiindinng prredicctivee infformaationn in largge daatabaases. Queestioons tthat tradditioonallly reequirred eextennsivee hannds-oon annalyssis ccan nnow bbe annswerred ddirecctly f

24、romm thee datta quicckly. A ttypiccal eexampple oof a preddictiive pprobllem iis taargetted mmarkeetingg. Daata mmininng usses ddata on ppast prommotioonal maillingss to idenntifyy thee tarrgetss mosst liikelyy to maxiimizee retturn on iinvesstmennt inn futture maillingss. Otther preddictiive pprobl

25、lems incllude foreecastting bankkrupttcy aand ootherr forrms oof deefaullt, aand iidenttifyiing ssegmeents of aa poppulattion likeely tto reesponnd siimilaarly to ggivenn eveents. Autommatedd disscoveery oof prrevioouslyy unkknownn pattternns. DData miniing ttoolss sweeep tthrouugh ddatabbasess andd

26、 ideentiffy prrevioouslyy hiddden pattternss in one stepp. Ann exaamplee of patttern disccoverry iss thee anaalysiis off rettail salees daata tto iddentiify sseemiinglyy unrrelatted pproduucts thatt aree oftten ppurchhasedd toggetheer. OOtherr pattternn disscoveery pprobllems incllude deteectinng fr

27、rauduulentt creedit cardd traansacctionns annd iddentiifyinng annomallous dataa thaat coould reprresennt daata eentryy keyying erroors. Data miniing ttechnniquees caan yiield the beneefitss of autoomatiion oon exxistiing ssoftwware and harddwaree plaatforrms, and can be iimpleementted oon neew syyst

28、emms ass exiistinng pllatfoorms are upgrradedd andd neww prooductts deevelooped. Wheen daata mmininng toools are impllemennted on hhigh perfformaance paraallell proocesssing systtems, theey caan annalyzze maassivve daatabaases in mminuttes. Fastter pproceessinng meeans thatt useers ccan aautommaticc

29、allyy expperimment withh morre moodelss to undeerstaand ccompllex ddata. Higgh sppeed makees itt praacticcal ffor uuserss to anallyze hugee quaantitties of ddata. Larrger dataabasees, iin tuurn, yielld immprovved pprediictioons. Databbasess cann be largger iin booth ddepthh andd breeadthh: More colu

30、umns. Anaalystts muust ooftenn limmit tthe nnumbeer off varriablles tthey exammine whenn doiing hhandss-on anallysiss duee to timee connstraaintss. Yeet vaariabbles thatt aree disscardded bbecauuse tthey seemm uniimporrtantt mayy carrry iinforrmatiion aaboutt unkknownn pattternns. HHigh perfformaanc

31、e dataa minning alloows uuserss to expllore the fulll deppth oof a dataabasee, wiithouut prresellectiing aa subbset of vvariaabless. More rowss. Laargerr sammpless yieeld llowerr esttimattion erroors aand vvariaance, andd alllow uuserss to makee infferennces abouut smmall but impoortannt seegmennts

32、oof a popuulatiion. A reccent Garttner Grouup Addvancced TTechnnologgy Reesearrch NNote listted ddata miniing aand aartifficiaal inntellligennce aat thhe toop off thee fivve keey teechnoologyy areeas tthat willl cllearlly haave aa majjor iimpacct accrosss a wwide rangge off inddustrries withhin tthe

33、 nnext 3 too 5 yyearss.2 Garttner alsoo lissted paraallell arcchiteecturres aand ddata miniing aas twwo off thee topp 10 new techhnoloogiess in whicch coompannies willl invvest duriing tthe nnext 5 yeears. Acccordiing tto a receent GGartnner HHPC RReseaarch Notee, WWith the rapiid addvancce inn datt

34、a caapturre, ttranssmisssion and storrage, larrge-ssysteems uuserss willl inncreaasinggly nneed to iimpleementt neww andd innnovattive wayss to minee thee aftter-mmarkeet vaalue of ttheirr vasst sttoress of detaail ddata, empployiing MMPP masssivelly paaralllel pproceessinng ssysteems tto crreatee n

35、eww souurcess of busiinesss advvantaage (0.9 probbabillity).3 The mmost commmonlyy useed teechniiquess in dataa minning are: Artifficiaal neeurall nettworkks: NNon-llineaar prredicctivee moddels thatt leaarn tthrouugh ttrainning and reseemblee bioologiical neurral nnetwoorks in sstruccturee. Decissi

36、on treees: TTree-shapped sstruccturees thhat rrepreesentt setts off deccisioons. Thesse deecisiions geneeratee rulles ffor tthe cclasssificcatioon off a ddatasset. Speccificc deccisioon trree mmethoods iincluude CClasssificcatioon annd Reegresssionn Treees (CARTT) annd Chhi Sqquaree Auttomattic IInt

37、erractiion DDetecctionn (CHHAID) . Genettic aalgorrithmms: OOptimmizattion techhniquues tthat use proccessees suuch aas geenetiic coombinnatioon, mmutattion, andd natturall sellectiion iin a desiign bbasedd on the concceptss of evollutioon. Neareest nneighhbor methhod: A teechniique thatt claassiffi

38、es eachh reccord in aa dattasett bassed oon a combbinattion of tthe cclassses oof thhe k recoord(ss) moost ssimillar tto itt in a hiistorricall dattasett (whhere k 1). Someetimees caalledd thee k-nneareest nneighhbor techhniquue. Rule induuctioon: TThe eextraactioon off useeful if-tthen rulees frrom

39、 ddata baseed onn staatistticall siggnifiicancce. Many of tthesee tecchnollogiees haave bbeen in uuse ffor mmore thann a ddecadde inn speeciallizedd anaalysiis toools thatt worrk wiith rrelattivelly smmall voluumes of ddata. Theese ccapabbilitties are now evollvingg to inteegratte diirecttly wwith i

40、nduustryy-staandarrd daata wwarehhousee andd OLAAP pllatfoorms. Thee apppendiix too thiis whhite papeer prroviddes aa gloossarry off datta miiningg terrms.How DData Miniing WWorkssHow eexacttly iis daata mmininng abble tto teell yyou iimporrtantt thiings thatt youu diddnt knoww or whatt is goinng to

41、o happpen nextt? Thhe teechniique thatt is usedd to perfform thesse feeats in ddata miniing iis caalledd moddelinng. MModelling is ssimplly thhe acct off buiildinng a modeel inn onee sittuatiion wwheree youu knoow thhe annswerr andd theen appplyiing iit too anootherr sittuatiion tthat you dont. FFor

42、 iinstaance, if you weree loookingg forr a ssunkeen Sppanissh gaalleoon onn thee higgh seeas tthe ffirstt thiing yyou mmightt do is tto reesearrch tthe ttimess wheen Sppanissh trreasuure hhad bbeen founnd byy othhers in tthe ppast. Youu migght nnote thatt theese sshipss oftten ttend to bbe foound of

43、f the coasst off Berrmudaa andd thaat thhere are certtain charracteeristtics to tthe ooceann currrentts, aand ccertaain rroutees thhat hhave likeely bbeen takeen byy thee shiips capttainss in thatt eraa. Yoou noote tthesee simmilarritiees annd buuild a moodel thatt inccludees thhe chharaccteriisticc

44、s thhat aare ccommoon too thee loccatioons oof thhese sunkken ttreassuress. Wiith tthesee moddels in hhand you saill offf loookingg forr treeasurre whhere yourr moddel iindiccatess it mostt likkely mighht bee givven aa simmilarr sittuatiion iin thhe paast. Hopeefullly, iif yoouvee gott a ggood modee

45、l, yyou ffind yourr treeasurre.This act of mmodell buiildinng iss thuus soomethhing thatt peoople havee beeen dooing for a loong ttime, cerrtainnly bbeforre thhe addventt of compputerrs orr datta miiningg tecchnollogy. Whaat haappenns onn commputeers, howeever, is not muchh difffereent tthan the way

46、 peopple bbuildd moddels. Commputeers aare lloadeed upp witth loots oof innformmatioon abbout a vaarietty off sittuatiions wherre ann ansswer is kknownn andd theen thhe daata mmininng sooftwaare oon thhe coomputter mmust run throough thatt datta annd diistilll thhe chharaccteriisticcs off thee datta

47、 thhat sshoulld goo intto thhe moodel. Oncce thhe moodel is bbuiltt it can thenn be usedd in simiilar situuatioons wwheree youu donnt kknow the answwer. For exammple, sayy thaat yoou arre thhe diirecttor oof maarketting for a teelecoommunnicattionss commpanyy andd youud llike to aacquiire ssome new

48、longg disstancce phhone custtomerrs. YYou ccouldd jusst raandommly ggo ouut annd maail ccoupoons tto thhe geeneraal poopulaationn - jjust as yyou ccouldd ranndomlly saail tthe sseas lookking for sunkken ttreassure. In neitther casee wouuld yyou aachieeve tthe rresullts yyou ddesirred aand oof coours

49、ee youu havve thhe oppporttunitty too do muchh bettter thann ranndom - yoou coould use yourr bussinesss exxperiiencee stoored in yyour dataabasee to builld a modeel.As thhe maarketting direectorr youu havve acccesss to a loot off infformaationn aboout aall oof yoour ccustoomerss: thheir age, sexx, c

50、rreditt hisstoryy andd lonng diistannce ccalliing uusagee. Thhe goood nnews is tthat you alsoo havve a lot of iinforrmatiion aaboutt youur prrospeectivve cuustommers: theeir aage, sex, creedit histtory etc. Youur prrobleem iss thaat yoou doont knoww thee lonng diistannce ccalliing uusagee of thesse

51、prrospeects (sinnce tthey are mostt likkely now custtomerrs off youur coompettitioon). Youd liike tto cooncenntratte onn thoose pprosppectss whoo havve laarge amouunts of llong disttancee usaage. You can accoompliish tthis by bbuildding a moodel. Tabble 22 illlustrratess thee datta ussed ffor bbuild

52、ding a moodel for new custtomerr proospecctingg in a daata wwarehhousee.CustoomerssProsppectssGenerral iinforrmatiion (e.g. demmograaphicc datta)KnownnKnownnProprrietaary iinforrmatiion (e.g. cusstomeer trransaactioons)KnownnTargeetTablee 2 - Datta Miiningg forr ProospecctinggThe ggoal in pprosppect

53、iing iis too makke soome ccalcuulateed guuessees abbout the infoormattion in tthe llowerr rigght hhand quaddrantt bassed oon thhe moodel thatt we builld gooing fromm Cusstomeer Geeneraal Innformmatioon too Cusstomeer Prropriietarry Innformmatioon. FFor iinstaance, a ssimplle moodel for a teelecoommu

54、nnicattionss commpanyy migght bbe:98% oof myy cusstomeers wwho mmake moree thaan $660,0000/yeear sspendd morre thhan $80/mmonthh on longg disstancceThis modeel coould thenn be appllied to tthe pprosppect dataa to try to ttell someethinng abbout the proppriettary infoormattion thatt thiis teelecoommu

55、nnicattionss commpanyy doees noot cuurrenntly havee acccess to. Withh thiis moodel in hhand new custtomerrs caan bee sellectiivelyy tarrgeteed.Test markketinng iss an exceellennt soourcee of dataa forr thiis kiind oof moodeliing. Miniing tthe rresullts oof a testt marrket reprresenntingg a bbroadd b

56、utt rellativvely smalll saamplee of prosspectts caan prrovidde a founndatiion ffor iidenttifyiing ggood prosspectts inn thee oveeralll marrket. Tabble 33 shoows aanothher ccommoon sccenarrio ffor bbuildding modeels: preddict whatt is goinng too happpen in tthe ffuturre.YesteerdayyTodayyTomorrrowStat

57、iic innformmatioon annd cuurrennt pllans (e.gg. deemogrraphiic daata, markketinng pllans)KnownnKnownnKnownnDynammic iinforrmatiion (e.g. cusstomeer trransaactioons)KnownnKnownnTargeetTablee 3 - Datta Miiningg forr PreedicttionssIf soomeonne toold yyou tthat he hhad aa moddel tthat coulld prredicct c

58、uustommer uusagee howw wouuld yyou kknow if hhe reeallyy hadd a ggood modeel? TThe ffirstt thiing yyou mmightt tryy wouuld bbe too askk himm to applly hiis moodel to yyour custtomerr basse - wherre yoou allreaddy knnew tthe aansweer. WWith dataa minning, thee besst waay too acccompllish thiss is by

59、ssettiing aasidee somme off youur daata iin a vaullt too isoolatee it fromm thee minning proccess. Oncce thhe miiningg is comppletee, thhe reesultts caan bee tessted agaiinst the dataa helld inn thee vauult tto coonfirrm thhe moodels vaalidiity. If tthe mmodell worrks, its obseervattionss shoould ho

60、ldd forr thee vauultedd datta.An Arrchittectuure ffor DData MiniingTo beest aapplyy theese aadvannced techhniquues, theyy musst bee fullly iinteggrateed wiith aa datta waarehoouse as wwell as fflexiible inteeracttive busiinesss anaalysiis toools. Manny daata mmininng toools currrentlly opperatte ouu

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論