


版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
UsingWiktionarytobuildanItalianpart-of-speechTomDeSmedt(ComputationalLinguisticsResearchGroup,UniversityofAntwerp)Marfia(DipartimentodiElettronica, icodiMilano)Pattern(/pages/pattern)containspart-of-speechtaggersforanumberoflanguages(includingEnglish,Spanish,German,FrenchandDutch).Part-of-speechtaggingisusefulinmanydataminingtasks.Apart-of-speechtaggertakesastringoftextandidentifiesthesentencesandthewordsinthetextalongwiththeirwordtype.Thewordtypeorpart-of-speechcanvaryaccordingtoaword'sroleinthesentence.Forexample,inEnglish,cancanbeaverb("CanIhaveacanofsoda?")oranoun("CanIhaveacanofsoda?").TheoutputtakesthefollowingIa?.POS-tagMDindicatesamodalverb,PRPaalpronoun,VBaverb,DTadeterminer,NNanounandINapreposition.ThetagsarepartofthePennTreebankIItagset(/pages/penn-treebank-tagset).PatternusesBrill'salgorithmtoconstructitspart-of-speechtaggers.Otheralgorithmsaremorerobust,butaBrilltaggerisfastandcompact(i.e.,1MBofdata)soitmakesagoodcandidateforPattern.TherearemanylanguagesforwhichPatterndoes(ordid)nothaveatagger–forexampleItalian.Brill'salgorithmessentiallyproducesalexiconofknownwordsandtheirpart-of-speechtag,alongwithsomerulesforunknownwords,orrulesthatchangethetagaccordingtoaword'sroleinthesentence.Inthepast,writtentext(e.g.,1millionwords)hadtobetaggedmanuallybyhumanannotators,andthenfedtothealgorithm.Manualannotationisexpensiveandtimeconsuming.Todaymanyresourcesarefreelyavailable.OnesuchresourceisWiktionary() ,wheremanypeoplecollaboratetoproduceafreemultilingualdictionary.1.MiningWiktionaryforpart-of-speechIfyoutakealookat:(),you'llseealistofthousandsofItalianwordsthatstartwitha-togetherwiththeirpart-of-speechtag.SinceWiktionary'scontentisfree,wecanminetheHTMLofthepagetoautomaticallypopulatealexicon.Wecsominethepagesforwordsstartingwithb-,c-,andsoon,toexpandourlexicon.Thefollowingscriptusesthepattern.web(/pages/pattern-web)moduleto plishthis.TheURLclasshasadownload()methodthatretrievestheHTMLfromagivenwebaddress.TheDOMclasstakesastringofHTMLandtransformsitintoatreeofnestedelements.WecanthensearchthetreewithCSSselectors() fortheelementsweneed,i.e.,thewordsandtheirtype:ropattern.webimpotURL,url=()lexicon=forchipinch,#DownloadtheHTMLsourceofeachWiktionarypage(a-z).html=URL(url+ch).download(throttle=10,cached=True)#ParsetheHTMLtree.dom=#Iteratethroughthelistofwordsandparsethepart-of-speechtags.#Eachwordisalistitem:#<li><ahref="/wiki/additivo">additivo</a><i>nforliiword=pos=li("i")[0].content.split("ifwordnotilexicon:lexicon[word]=[]Weendupwithalexicondictionarythatcontainsabouta100,000words,eachlinkedtoalistofpart-of-speechtags.Forexample:la→DT,PRP,NN.Wedon'thaveanytagsforpunctuationmarks,butwecanaddthemforforpunctuation,tagi(u".","."),(u'"','"'),(u"+","SYM"),(u"#",(u"?","."),(u'“','"'),(u"-","SYM"),(u"$",(u"!","."),(u'”','"'),(u"*","SYM"),(u"&",(u"?","."),(u"(","("),(u"=","SYM"),(u"/",(u":",":"),(u")",")"),(u"<","SYM"),(u"%",(u";",":"),(u",",","),(u">","SYM"),(u"@","IN"),(u"...",lexicon[punctuation]=2MiningWiktionaryforwordInmanylanguages,wordsinflectaccordingtotense,mood,,genderandnumber.Thisistrueforverbs(discussedlater)andoftenfornounsandadjectives.InItalian,thepluralformofthenounaffetto(affection)isaffetti,whilethepluralfeminineformoftheadjectiveaffetto(affected)isaffette.Unfortunay,theinflectedformsarenotalwaysintheWiktionaryindex.Weneedtominedeepertoretrievethem.Thisisatime-consumingprocess.WeneedtosetahighthrottlebetweenrequeststoavoidbeingblacklistedbyWiktionary'sservers.Thisscriptdefinesaninflect()function.Givenaword,itreturnsadictionaryofwordropattern.webimpotURL,DOM,ipotdefinflect(word,language="italian"):inflections={}url=" )"+word.replace("","_")dom=DOM(URL(url).download(throttle=10,cached=True))pos=#Searchtheheaderthatmarksthestartforthegivenlanguage:#<h2><spanclass="mw-headline"id="Italian">Italian</span></h2>e=dom("#"+language)[0].parentwhleisnotNone:#e=ife.type==ife.tag=="hr":#Horizontalline=nextife.tag=="h3":#<h3>Adjective[edit]</h3>pos=plaintext(e.content.lower())pos=pos.replace("[edit]","").strip()[:3].rstrip("ouer")+"-"#Parseinflections,usingregularexpressions.s=#affettom(faffetta,mpluralaffetti,fpluralif,forgender,regexp,ii,
+word+r")m",,,+word+r"),,,+word+r")(mf|mandf)",,+word+r")(mf|mandf)",,r"masculine:?(\S*?)(,|\))",,r"feminine:?(\S*?)(,|\))",,r"(\(|,)m(asculine)?(\S*?)(,|\))",,r"(\(|,)f(eminine)?(\S*?)(,|\))",("mp",r"(\(|,)m(asculine)?plural(\S*?)(,|\))",("fp",r"(\(|,)f(eminine)?plural(\S*?)(,|\))",("p",r"(\(|,)plural(\S*?)(,|\))",("p",r"mandfplural(\S*?)(,|\))",1)):m=re.search(regexp,s,re.I)ifmisnot#{"adj-m":"affetto","adj-fp":"affette"}inflections[pos+gender]=m.group(i)#printe=returnWecanaddacalltoinflect()foreachnoun,adjectiveorverbintheinnerloopofourminer(seestepififany(tagiposfortagi("n","v",forpg,wnp,g=pg.split("-")#pos+gender:("adj",ifwnotnlexicon[w]=ifpnotnlexicon[w]:MiningWikipediaforThelexiconsbundledinPatternareabout500KBto1MBinfilesize.IfwesaveourItalianlexiconasafile,itisabout2MB(or4MBwiththeinflectionsfromstep2).Wemaywanttoreduceit,byremovinglessimportantwords.Whichwordstoremove?Wedon'twanttoremovela;itlooksimportantinItalianWecanassessaword'simportancebycountinghowmanytimesitoccursinwrittentext.Thefollowingscriptusesthepattern.web(/pages/pattern-web)moduletoretrieveItaliantextsfromWikipedia.TheWikipediaclasshasasearch()methodthatreturnsaWikipediaArticle.Wethenusethepattern.vector(/pages/pattern-vector)moduletocountthewordsinarticles:ropattern.webimpotropattern.vectorimpotfrequency=#Spreading#Parselinksfromseedarticle&visitthosearticles.links,seen=set(["Italia"]),{}whillen(links)>article=Wikipedia(language="it").search(links.pop(),throttle=10)seen[article.title]=True#Parselinksfromforlinkniflinknotn#Parsewordsfromarticle.Countforwordnifwordnotnfrequency:frequency[word]=0frequency[word]+=prinsum(frequency.values()),#Collectareliableamountofwords(e.g.,ifsum(frequency.values())>#top=sorted((count,word)forword,countinfrequency.items())#top=top[-1000:]#printWeshouldalsoboostourminerbyincludingcontemporarynewspaperffoglobipot#Up-to-datenewspaperforfiglob("repubblica-forwordiwords(open(f).read()):ifwordnotnfrequency:frequency[word]=0frequency[word]+=Weendupwithafrequencydictionarywithabout1,000,000words(115,000uniquewords)andtheirwordcount.Forexample,dioccurs70,000times,laoccurs30,000timesandindecifrabilmente(indecipherable)asingletime.Thisisawordthatwecouldremoveandreplacewithamorphologicalrule-mente→RB(adverb).MorphologicalrulesarediscussedPreprocessingaCSV-Thisisagoodtimetostorethedata(sowedon'tneedtoreruntheminer).WemapWiktionary'swordtagstoPennTreebankII,andcombinetheentriesinlexiconandfrequency.Wethenusepattern.db(/pages/pattern-db)tostoretheresultasaCSV-file.ropattern.dbimpotPENN={"n":"v":"adj":"adv":"article":"prep":"conj":"num":"int":"pronoun":"proper":}SPECIALSPECIAL=["abbr","contraction"]special=set()csv=forword,posiif""notif=frequency.get(word,frequency.get(word.lower(),0))#MaptoPennTreebankIItagset.penn=[PENN[tag]fortagiposiftagipenn+=[tag]iftagi("SYM",".",",",":","\"","(",")","#","$")else[]penn=",".join(penn)#Collecttaggedwordsinthe.csvfile.csv.append((f,word,penn))#Collectspecialwordsforpost-fortagiiftagipinWeendupwithaCSV-fileofItalianwordsandtheirpart-of-speechtag,sortedbyPartofeDT,PRP,PRP,JJ,aDistributionofItalianwords.Topfiveisdi,e,il,la,Asshown,thedistributionofwordsapproximatesZipf'slaw(slaw).Themostfrequentwordappearsnearlytwiceasmuchasthesecondmostfrequentword,andsoon.Thetop10%mostfrequentcovers90%ofItalianlanguageuse.Thisimpliesthatwecanremovepartof"Zipf'slongtail"(wordsthatoccuronlyonce).Ifwehavealexiconthatcoversthetop10%,andtagallunknownwordsasNN,wehaveataggerthatisabout90%accurate.Thisisthebaseline.Wecanimproveitby1-5%bydetermininggoodmorphologicalandcontextualrulesforunknownwords.MorphologicalrulesbasedonwordWhenweremovewordsfromthelexicon(toreducefilesize),thetaggermaynolongerrecognizesomewords.Bydefault,itwilltagunknownwordsasNN.Wecanimprovethetagsofunknownwordsusingmorphologicalrules.ExaminetheEnglishen-morphology.txt( /clips/pattern/blob/master/pattern/text/en/en-morphology.txt)toseetheruleformat.Onewaytopredicttagsistolookatwordsuffixes.Forexample,Englishadverbsusuallyendin-ly.InItaliantheyendin-mente.Thefollowingscriptdeterminesthemostfrequenttagforeachwordsuffix:ropattern.dbimpotlexicon=forfrequency,word,tagsiDatasheet.load("it-]=,群里免費提供500+本Pythonrocollectionsimpot #{"mente":{"RB":2956.0,"JJ":8.0,NN:suffix= it(lad: forwiiflen(w)>5:#Last5characters.x=w[-5:]#fortagilexicon[w]:suffix[x][tag]+=1.0#Mapthedictionarytoalistsortedbytotaltagsuffix=[(sum(tags.values()),x,tags)forx,tagsisuffix.items()]suffix=sorted(suffix,reverse=True)forn,x,tagsi#Relativecountpertag(0.0-#Thisshowsthetagdistributionpersuffixmoreclearly.tags=[("%.3f"%(i/n),tag)fortag,iitags.items()]tags=sorted(tags,reverse=True)pinx,n,Partsof-99%RB+0.5%JJ+0.5%-99%NN+0.5%JJ+0.5%-97%JJ+2%NN+0.5%RB+0.5%-99%NN+0.5%VB+0.5%-84%NN+16%Wecsorunthescriptforsuffixesof4or3characters.Wethenmanuallyconstructanit-morphology.txtfilewithinterestingrules.Forexample:-mente→RBhasahighcoverage(2,969wordsinthelexicon)andahighprecision(99%).Weaddthisruletotheruleset:NNNNmentefhassuf5RBContextualWhenweconstructedaCSV-file(seestep4),wesawthatsomewordscanhavemultipletags,dependingontheirroleinthesentence.InEnglish,in"Ican","youcan"or"wecan",canisaverb.In"acan"and"thecan"itisanoun.Wecouldgeneralizethisintwocontextualrules:PRP+can→VB,andDT+can→NN.ExaminetheEnglishen-context.txt /clips/pattern/blob/master/pattern/text/en/en-context.txt)toseetheruleWecancreatecontextualrulesbyhand.Wecsoyzeacorpusoftaggedtexts(atreebank)topredicthowwordtagschangeaccordingtothesurroundingwords.However,acorpusoftaggedtextsimpliesthatitwastaggedwithanotherpart-of-speechtagger.Itisathinlinebetweenusingsomeoneelse'staggerandplagiarizingsomeoneelse'stagger.Weshouldcontacttheauthorsand/orcitetheirwork.ForItalian,wecanusethefreelyavailableWaCKycorpus()(Baroni,Bernardini,Ferraresi&Zanchetta,2009).Thefollowingscriptreads1millionwordsfromtheWaCKyMultiTagWikipediacorpus.Forwordsthatcanhavemultipletags,itrecordsthetagoftheprecedingwordanditsfrequency:rropattern.dbimpotambiguous=forfrequency,word,tagsiDatasheet.load("it-lexicon.csv"):tags=tags.split(",")tags=[tagfortagitagsififlen(tags)!=1andint(frequency)>100:ambiguous[word]=(int(frequency),tags)focodecsiptrocollectionsimpot #MapTANLtagstoPennTreebank#medialab.di.unipi.it/wiki/Tanl_POS_TagsetTANL={"A":"B":"C":"CC","CC":"CC","CS":"IN","D":"DT","E":"FF":",","FS":".","FB":"(","I":"UH","N":"P""P""PRP""PP""PRP$"利Python"R":"S":"NN","SP":"NNP","T":"DT","V":"VB","VM":,群里免費提供500+本Python}#Wordtagslinkedtofrequencyofprecedingwordtag:#{"le":{"DT":{"IN":1580},"PRP":{"VB":105}}}context=defau it(lad:defau tlad:defau window=[]#[(word1,tag1),(word2,tag2),(word3,tag3)]fori,sienumerate(open("/downloads/wikiMT",encoding="utf-8")):s=s.split("\t")ifi>ifi>1andlen(s)>=word,tag=s[02]#("l'","RD","il")tag=TANL.get(tag[:2])or\TANL.get(tag[:1])ortagwindow.append((word,tag))iflen(window)>3:iflen(window)==3andwindow[1][0]iambiguous:w1,tag1=window[0]#wordleftw2,tag2=window[1]#wordthatcanhavemultipletagsw3,tag3=window[1]#wordrightcontext[w2][tag2][tag1]+=1Wecanthenexaminetheoutput,sortedbywordforforwordireversed(sorted(ambiguous,elabak:pinfortagicontext[word]:left=context[word][tag]s=left=[("%.2f"%(n/s),x)forx,nileft.items()]left=sorted(left,reverse=True)pin"\t",int(s),tag,Precedingpart-of-speech31%VB+27%IN+9%58%VB+12%CC+11%743%NN+43%IN+14%38%NN+28%+12%46%VB+17%RB+13%32%NN+24%+8%33%JJ+17%NN+15%Whatcanwelearnfromtheoutput?Thewordla(DT,PR N?)wecansimplytagasDTinourlexicon,sincetheothercasesarenegligible(2%).Thewordche(PRP,IN,DTorCC?)wecantagasPRPinourlexicon(covering72%ofallcases)andcreatearuleVB+che→IN(coveringanother12%).Wemanuallyaddthisruletotheruleset:PRPPRPINWDPREVTAGVBWecanrunvariationsoftheabovescripttolookatthewordsafter,orbothbeforeandBaroni,M.,Bernardini,S.,Ferraresi,A.&Zanchetta,A.(2009).TheWaCkyWideWeb:ACollectionofVeryLargeLinguisticallyProcessedWeb-CrawledCorpora.LanguageResourcesandEvaluation,43(3),209–226.Subclassingthepattern.textParserInsummary,weconstructedanit-lexicon.csvwiththefrequencyandpart-of-speechtagsofknownwords(steps1-4)togetherwithanit-morphology.txt(step5)andanit-context.txt(step6).WecanusethesetocreateaparserforItalianbysubclassingthebaseParserinthepattern.textmodule.Thepattern.textmodulehasbaseclassesforParser,Lexicon,Morphology,etc.Takeamomenttoreviewthesourcecode,andthesourcecodeofotherparsersinPattern.You'llnoticethatallparsersfollowthesamesimplesteps.Atemplateforne rsersisincludedinpattern.text.xx.TheParserbaseclasshasthefollowingmethodswithdefaultParser.find_tokens()findssentencemarkers(.?!)andsplitspunctuationmarksfrom findswordpart-of-speechParser.find_chunks()findswordsthatbelongtogether(e.g.,theblackParser.find_lemmata()findswordbaseforms(cats→ executestheabovestepsonagiven
免費提供500+本Python書籍Wewillneedtoredefinefind_tokens()withrulesforItalianabbreviationsandcontractions(e.g.,ll'anno=di+l'anno).Rememberthespecialsetinstep4?Itcontainsthedataweropattern.textimpotABBREVIATIONS="a.C.","all.","apr.","b.c.","c.m.","C.V.","Dott.","ecc.","egr.","giu.","Ing.","orch.","p.es.","Prof.","prof.","ql.co.","Spett."]CONTRACTIONS="all'":"all'"anch'":"anch'"c'":"c'"coll'":"coll'"com'":"com'"dall'":"dall'"dell'":"dell'"dev'":"dev'"dov'":"dov'"mo'":"mo'"nell'":"nell'"sull'":"sull'}classdeffind_tokens(self,tokens,**kwargs):kwargs.setdefault("abbreviations",ABBREVIATIONS)kwargs.setdefault("replace",CONTRACTIONS)returnParser.find_tokens(self,tokens,WecanthencreateaninstanceoftheItalianParserandfeeditourdata.Weneedtoconvertit-lexicon.csvtoanit-lexicon.txtfileintherightformat(awordanditstagoneachline).Thisonlyneedshappens time,ofww=forfrequency,word,tagsnDatasheet.load("it-ifint(frequency)>=1:#Adjusttotweakfilefortagntags.split(",ifw.append("%s%s"%(word,tag)); Loadthelexiconan
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 羊只轉讓協(xié)議書
- 電費結算協(xié)議書
- 簽分手費協(xié)議書
- 第三終端協(xié)議書
- 退稅墊資協(xié)議書
- 送教結對協(xié)議書
- 藥店共建協(xié)議書
- 美油貿(mào)易協(xié)議書
- 電子廠用工合同協(xié)議書
- 茶葉團購協(xié)議書
- 車道雨棚施工方案
- 賓館財務安全管理制度
- 軟體家具相關項目創(chuàng)業(yè)計劃書
- 固定資產(chǎn)登記表模板
- 新人教版高中英語必修第二冊-Unit-5THE-VIRTUAL-CHOIR精美課件
- 施工臨時圍擋施工方案及施工圍擋承包合同
- 醫(yī)院布草洗滌服務投標方案(技術標)
- 寧陵牧原農(nóng)牧有限公司小張莊年存欄2萬頭母豬養(yǎng)殖項目環(huán)境影響報告
- 《大象的耳朵》評課稿
- 胰島素的種類及應用(共26張PPT)
- 現(xiàn)場照相技術課件
評論
0/150
提交評論