




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
Large
LanguageModels
IntroductiontoLargeLanguageModels
Languagemodels
?Rememberthesimplen-gramlanguagemodel
?Assignsprobabilitiestosequencesofwords
?Generatetextbysamplingpossiblenextwords
?Istrainedoncountscomputedfromlotsoftext
?Largelanguagemodelsaresimilaranddifferent:
?Assignsprobabilitiestosequencesofwords
?Generatetextbysamplingpossiblenextwords
?Aretrainedbylearningtoguessthenextword
Largelanguagemodels
?Eventhroughpretrainedonlytopredictwords
?Learnalotofusefullanguageknowledge
?Sincetrainingonalotoftext
Threearchitecturesforlargelanguagemodels
Decoders
GPT,Claude,Llama
Mixtral
Encoders
BERTfamily,HuBERT
Encoder-decoders
Flan-T5,Whisper
Encoders
Manyvarieties!
?Popular:MaskedLanguageModels(MLMs)
?BERTfamily
?Trainedbypredictingwordsfromsurroundingwordsonbothsides
?Areusuallyfinetuned(trainedonsuperviseddata)forclassificationtasks.
Encoder-Decoders
?Trainedtomapfromonesequencetoanother
?Verypopularfor:
?machinetranslation(mapfromonelanguagetoanother)
?speechrecognition(mapfromacousticstowords)
Large
LanguageModels
IntroductiontoLargeLanguageModels
Large
LanguageModels
LargeLanguageModels:Whattaskscantheydo?
Bigidea
Manytaskscanbeturnedintotasksofpredictingwords!
Thislecture:decoder-onlymodels
Alsocalled:
?CausalLLMs
?AutoregressiveLLMs
?Left-to-rightLLMs
?Predictwordslefttoright
ConditionalGeneration:Generatingtextconditionedonprevioustext!
CompletionText
TransformerBlocks
Encoder
Solongandthanksfor
all
the
the
Language
Modeling
Head
…
…
+i
E
+i
E
+i
E
+i
E
+i
E
Softmax
Unencoderlayer\U/
+i
E
+i
E
U
logits
all
Pre?xText
ManypracticalNLPtaskscanbecastaswordprediction!
Sentimentanalysis:“IlikeJackieChan”
1.Wegivethelanguagemodelthisstring:
Thesentimentofthesentence"I
likeJackieChan"is:
2.Andseewhatworditthinkscomesnext:
P(positive|Thesentimentofthesentence‘‘IlikeJackieChan"is:)
P(negative|Thesentimentofthesentence‘‘IlikeJackieChan"is:)
FraminglotsoftasksasconditionalgenerationQA:“WhowroteTheOriginofSpecies”
1.Wegivethelanguagemodelthisstring:
Q:Whowrotethebook‘‘TheOriginofSpecies"?A:
2.Andseewhatworditthinkscomesnext:
P(w|Q:Whowrotethebook‘‘TheOriginofSpecies"?A:)
3.Anditerate:
P(w|Q:Whowrotethebook‘‘TheOriginofSpecies"?A:Charles)
Summarization
TheonlythingcrazierthanaguyinsnowboundMassachusettsboxingupthepowderywhitestuffandofferingitforsaleonline?Peopleareactuallybuyingit.For$89,self-styledentrepreneurKyleWaringwillshipyou6poundsofBoston-areasnowinaninsulatedStyrofoambox–enoughfor10to15snowballs,hesays.
Original
ButnotifyouliveinNewEnglandorsurroundingstates.“Wewillnotshipsnowtoanystatesinthenortheast!”saysWaring’swebsite,ShipSnowY.“We’reinthebusinessofexpungingsnow!”
Hiswebsiteandsocialmediaaccountsclaimtohave?lledmorethan133ordersforsnow–morethan30onTuesdayalone,hisbusiestdayyet.Withmorethan45totalinches,Bostonhassetarecordthiswinterforthesnowiestmonthinitshistory.Mostresidentsseethehugepilesofsnowchokingtheiryardsandsidewalksasanuisance,butWaringsawanopportunity.
AccordingtoB,itallstartedafewweeksago,whenWaringandhiswifewereshov-elingdeepsnowfromtheiryardinManchester-by-the-Sea,acoastalsuburbnorthofBoston.Hejokedaboutshippingthestufftofriendsandfamilyinwarmerstates,andanideawasborn.[...]
Summary
KyleWaringwillshipyou6poundsofBoston-areasnowinaninsulatedStyrofoambox–enoughfor10to15snowballs,hesays.ButnotifyouliveinNewEnglandorsurroundingstates.
LLMsforsummarization(usingtl;dr)
GeneratedSummary
KyleWaringwill…
U
LMHead
U
U
E
E
E
E
E
E
E
…
E
idea
Theonly
will
tl;dr
Waring
…
wasborn.Kyle
OriginalStoryDelimiter
Large
LanguageModels
LargeLanguageModels:Whattaskscantheydo?
Large
LanguageModels
SamplingforLLMGeneration
DecodingandSampling
Thistaskofchoosingawordtogeneratebasedonthemodel’sprobabilitiesiscalleddecoding.
ThemostcommonmethodfordecodinginLLMs:sampling.Samplingfromamodel’sdistributionoverwords:
?chooserandomwordsaccordingtotheirprobabilityassignedbythemodel.
Aftereachtokenwe’llsamplewordstogenerateaccordingtotheirprobabilityconditionedonourpreviouschoices,
?Atransformerlanguagemodelwillgivetheprobability
Randomsampling
i←1
wi~p(w)
whilewi!=EOSi←i+1
wi~p(wi|w<i)
Randomsamplingdoesn'tworkverywell
Eventhoughrandomsamplingmostlygeneratesensible,high-probablewords,
Therearemanyodd,low-probabilitywordsinthetailofthedistribution
Eachoneislow-probabilitybutaddeduptheyconstitutealargeportionofthedistribution
Sotheygetpickedenoughtogenerateweirdsentences
Factorsinwordsampling:qualityanddiversity
Emphasizehigh-probabilitywords
+quality:moreaccurate,coherent,andfactual,-diversity:boring,repetitive.
Emphasizemiddle-probabilitywords+diversity:morecreative,diverse,-quality:lessfactual,incoherent
Top-ksampling:
1.Choose#ofwordsk
2.ForeachwordinthevocabularyV,usethelanguagemodeltocomputethelikelihoodofthiswordgiventhecontextp(wt|w<t)
3.Sortthewordsbylikelihood,keeponlythetopkmostprobablewords.
4.Renormalizethescoresofthekwordstobealegitimateprobabilitydistribution.
5.Randomlysampleawordfromwithintheseremainingkmost-probablewordsaccordingtoitsprobability.
Top-psampling
Holtzmanetal.,2020
Problemwithtop-k:kisfixedsomaycoververydifferentamountsofprobabilitymassindifferentsituations
Idea:Instead,keepthetopppercentoftheprobabilitymass
ΣP(w|w<t)≥p
w∈V(p)
Temperaturesampling
ReshapethedistributioninsteadoftruncatingitIntuitionfromthermodynamics,
?asystemathightemperatureisflexibleandcanexploremanypossiblestates,
?asystematlowertemperatureislikelytoexploreasubsetoflowerenergy(better)states.
Inlow-temperaturesampling,(τ≤1)wesmoothly
?increasetheprobabilityofthemostprobablewords
?decreasetheprobabilityoftherarewords.
Temperaturesampling
Dividethelogitbyatemperatureparameterτbeforepassingitthroughthesoftmax.
Insteadofy=softmax(u)
Wedo
y=softmax(u/t)
0≤τ≤1
Temperaturesampling
y=softmax(u/t)
Whydoesthiswork?
?Whenτiscloseto1thedistributiondoesn’tchangemuch.
?Thelowerτis,thelargerthescoresbeingpassedtothesoftmax
?Softmaxpusheshighvaluestoward1andlowvaluestoward0.
?Largeinputspusheshigh-probabilitywordshigherandlowprobabilitywordlower,makingthedistributionmoregreedy.
?Asτapproaches0,theprobabilityofmostlikelywordapproaches1
Large
LanguageModels
SamplingforLLMGeneration
Large
LanguageModels
PretrainingLargeLanguageModels:Algorithm
Pretraining
Thebigideathatunderliesalltheamazingperformanceoflanguagemodels
Firstpretrainatransformermodelonenormousamountsoftext
Thenapplyittonewtasks.
Self-supervisedtrainingalgorithm
Wejusttrainthemtopredictthenextword!
1.Takeacorpusoftext
2.Ateachtimestept
i.askthemodeltopredictthenextword
ii.trainthemodelusinggradientdescenttominimizetheerrorinthisprediction
"Self-supervised"becauseitjustusesthenextwordasthelabel!
Intuitionoflanguagemodeltraining:loss
?Samelossfunction:cross-entropyloss
?Wewantthemodeltoassignahighprobabilitytotruewordw
?=wantlosstobehighifthemodelassignstoolowaprobabilitytow
?CELoss:Thenegativelogprobabilitythatthemodelassignstothetruenextwordw
?Ifthemodelassignstoolowaprobabilitytow
?Wemovethemodelweightsinthedirectionthatassignsahigherprobabilitytow
Cross-entropylossforlanguagemodeling
:terenceb
Thecorrectdistributionytknowsthenextword,sois1fortheactualnextwordand0fortheothers.
Sointhissum,alltermsgetmultipliedbyzeroexceptone:thelogpthemodelassignstothecorrectnextword,so:
Teacherforcing
?Ateachtokenpositiont,modelseescorrecttokensw1:t,?Computesloss(–logprobability)forthenexttokenwt+1
?Atnexttokenpositiont+1weignorewhatmodelpredictedforwt+1
?Insteadwetakethecorrectwordwt+1,addittocontext,moveon
Trainingatransformerlanguagemodel
Nexttokenlongandthanksforall
Loss-logyand-logythanks
Language
Modeling
Head
logitslogitslogitslogitslogits
UUUUU
Stacked
Transformer
Blocks
…
x1
x2
x4
x3
x5
3
2
+1
+
+
+
4+5
Input
E
E
E
E
Encoding
E
InputtokensSolongandthanksfor
…
=
…
…
…
…
…
Large
LanguageModels
PretrainingLargeLanguageModels:Algorithm
Large
LanguageModels
PretrainingdataforLLMs
LLMsaremainlytrainedontheweb
Commoncrawl,snapshotsoftheentirewebproducedbythenon-profitCommonCrawlwithbillionsofpages
ColossalCleanCrawledCorpus(C4;Raffeletal.2020),156billiontokensofEnglish,filtered
What'sinit?Mostlypatenttextdocuments,Wikipedia,andnewssites
ThePile:apretrainingcorpus
academicswebbooks
dialog
Filteringforqualityandsafety
Qualityissubjective
?ManyLLMsattempttomatchWikipedia,books,particularwebsites
?Needtoremoveboilerplate,adultcontent
?Deduplicationatmanylevels(URLs,documents,evenlines)Safetyalsosubjective
?Toxicitydetectionisimportant,althoughthathasmixedresults
?CanmistakenlyflagdatawrittenindialectslikeAfricanAmericanEnglish
Whatdoesamodellearnfrompretraining?
?Therearecanineseverywhere!Onedoginthefrontroom,andtwodogs
?Itwasn'tjustbigitwasenormous
?Theauthorof"ARoomofOne'sOwn"isVirginiaWoolf
?Thedoctortoldmethathe
?Thesquarerootof4is2
Bigidea
TextcontainsenormousamountsofknowledgePretrainingonlotsoftextwithallthat
knowledgeiswhatgiveslanguagemodelstheirabilitytodosomuch
Butthereareproblemswithscrapingfromtheweb
Copyright:muchofthetextinthesedatasetsiscopyrighted
?NotcleariffairusedoctrineinUSallowsforthisuse
?Thisremainsanopenlegalquestion
Dataconsent
?Websiteownerscanindicatetheydon'twanttheirsitecrawledPrivacy:
?WebsitescancontainprivateIPaddressesandphonenumbers
Large
LanguageModels
PretrainingdataforLLMs
Finetuning
Large
LanguageModels
Finetuningfordaptationtonewdomains
WhathappensifweneedourLLMtoworkwellonadomainitdidn'tseeinpretraining?
Perhapssomespecificmedicalorlegaldomain?
OrmaybeamultilingualLMneedstoseemoredataonsomelanguagethatwasrareinpretraining?
Finetuning
PretrainingData
Pretraining
PretrainedLM
…
…
…
Fine-
tuning
Data
Fine-tuning
Fine-tunedLM
…
…
…
"Finetuning"means4differentthings
We'lldiscuss1here,and3inlaterlectures
Inallfourcases,finetuningmeans:
takingapretrainedmodelandfurtheradaptingsomeorallofitsparameterstosomenewdata
1.Finetuningas"continuedpretraining"onnewdata
?Furthertrainalltheparametersofmodelonnewdata
?usingthesamemethod(wordprediction)andlossfunction(cross-entropyloss)asforpretraining.
?asifthenewdatawereatthetailendofthepretrainingdata
?Hencesometimescalledcontinuedpretraining
Finetuning
Large
LanguageModels
Large
LanguageModels
EvaluatingLargeLanguageModels
Perplexity
Justasforn-gramgrammars,weuseperplexitytomeasurehowwelltheLMpredictsunseentext
Theperplexityofamodelθonanunseentestsetistheinverseprobabilitythatθassignstothetestset,normalizedbythetestsetlength.
Foratestsetofntokensw1:ntheperplexityis:
Whyperplexityinsteadofrawprobabilityofthetestset?
?Probabilitydependsonsizeoftestset
?Probabilitygetssmallerthelongerthetext
?Better:ametricthatisper-word,normalizedbylength
?Perplexityistheinverseprobabilityofthetestset,normalizedbythenumberofwords
(Theinversecomesfromtheoriginaldefinitionofperplexityfromcross-entropyrateininformationtheory)
Probabilityrangeis[0,1],perplexityrangeis[1,∞]
Perplexity
?Thehighertheprobabilityofthewordsequence,thelowertheperplexity.
?Thusthelowertheperplexityofamodelonthedata,thebetterthemodel.
?Minimizingperplexityisthesameasmaximizingprobability
Also:perplexityissensitivetolength/tokenizationsobestusedwhencomparingLMsthatusethesametokenizer.
Manyotherfactorsthatweevaluate,like:
Size
BigmodelstakelotsofGPUsandtimetotrain,memorytostore
Energyusage
CanmeasurekWhorkilogramsofCO2emitted
Fairness
Benchmarksmeasuregenderedandracialstereotypes,ordecreasedperformanceforlanguagefromoraboutsomegroups.
Large
LanguageModels
DealingwithScale
ScalingLaws
LLMperformancedependson
?Modelsize:thenumberofparametersnotcountingembeddings
?Datasetsize:theamountoftrainingdata
?Compute:Amountofcompute(inFLOPSoretc
Canimproveamodelbyaddingparameters(morelayers,
widercontexts),moredata,ortrainingformoreiterationsTheperformanceofalargelanguagemodel(theloss)scalesasapower-lawwitheachofthesethree
ScalingLaws
LossLasafunctionof#parametersN,datasetsizeD,computebudgetC(ifothertwoareheldconstant)
Scalinglawscanbeusedearlyintrainingtopredictwhatthelosswouldbeifweweretoaddmoredataorincreasemodelsize.
Numberofnon-embeddingparametersN
≈12nlayerd2
ThusGPT-3,withn=96layersanddimensionalityd=12288,has12×96×122882≈175billionparameters.
KVCache
Intraining,wecancomputeattentionveryefficientlyinparallel:
Butnotatinference!Wegeneratethenexttokensoneatatime!
Foranewtokenx,needtomultiplybyWQ,WK,andWVtogetquery,key,values
Butdon'twanttorecomputethekeyandvaluevectorsforallthepriortokensx<i
Instead,storekeyandvaluevectorsinmemoryintheKVcache,andthenwecanjustgrabthemfromthecache
KVCache
KT
A
Q
V
k1
k2
k3
k4
a1
a2
a3
a4
q1
q2
q3
q4
v1
v2
v3
v4
x
=
mask
x
=
dkxN
QKT
q1?k1
q1?k2
q1?k3
q1?k4
q2?k1
q2?k2
q2?k3
q2?k4
q3?k1
q3?k2
q3?k3
q3?k4
q4?k1
q4?k2
q4?k3
q4?k4
Nxdv
Nxdv
NxN
Nxdk
Q
x
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 教育與培訓行業:在線教育平臺用戶體驗優化研究報告
- 工業互聯網平臺下2025年傳感器網絡自組網技術在智能優化中的應用報告
- 夏季課件安全生產
- 夏季皮膚護理課件
- 陰陽學說與中醫養生
- 新生兒壞死小腸結腸炎的護理
- 夏季護膚護理課件
- 夏季小學生健康教育課件
- 能干的小耳朵健康教育
- 新生兒輸血及不良反應的護理
- EndNote使用教程介紹課件
- 重癥肌無力 (神經內科)
- 醫院診斷證明書word模板
- 井下煤礦掘進工作面爆破設計方案
- 藥物分析與檢驗技術中職PPT完整全套教學課件
- 小兒急性顱內壓增高護理
- 城市消防站建設標準XXXX
- 小學英語The-Giving-Tree 優秀公開課課件
- 左宗棠課件完整版
- GA 1277.8-2023互聯網交互式服務安全管理要求第8部分:電子商務服務
- 建筑工地事故應急救援演習記錄表范本
評論
0/150
提交評論