從頭訓練大型語言模型的最佳實踐_第1頁
從頭訓練大型語言模型的最佳實踐_第2頁
從頭訓練大型語言模型的最佳實踐_第3頁
從頭訓練大型語言模型的最佳實踐_第4頁
從頭訓練大型語言模型的最佳實踐_第5頁
已閱讀5頁,還剩41頁未讀 繼續免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

CurrentBestPractices

forTrainingLLMsfromScratch

Authors:RebeccaLi,AndreaParker,JustinTenuto

·······weights&Biases

TableofContents

Introduction03

Buildvs.BuyPre-trainedLLMModels03

TheScalingLaws05

Hardware06

Memoryvs.ComputeEfficiency06

TechniquesforParallelization06

DatasetCollection08

DatasetPre-processing08

DatasetHandling08

Tokenization09

Pre-trainingSteps13

ModelEvaluation15

BiasandToxicity16

InstructionTuning17

ReinforcementLearningthroughHumanFeedback(RLHF)19

Conclusion20

References20

Appendix21

LLMOverview21

TransformerModelArchitecture21

TheOriginalLLMScalingLaws23

www.wandb.ai?contact@wandb.ai2

·······weights&Biases

Introduction

Althoughwe’reonlyafewyearsremovedfromthetransformer

breakthrough,LLMshavealreadygrownmassivelyin

performance,cost,andpromise.AtW&B,we’vebeenfortunatetoseemoreteamstrytobuildLLMsthananyoneelse.Butmanyofthecriticaldetailsandkeydecisionpointsareoftenpasseddownbywordofmouth.

Thegoalofthiswhitepaperistodistillthebestpracticesfor

trainingyourownLLMforscratch.We’llcovereverythingfromscalingandhardwaretodatasetselectionandmodeltraining,lettingyouknowwhichtradeoffstoconsiderandflaggingsomepotentialpitfallsalongtheway.Thisismeanttobeafairly

exhaustivelookatthekeystepsandconsiderationsyou’llmakewhentraininganLLMfromscratch.

Thefirstquestionyoushouldaskyourselfiswhethertrainingonefromscratchisrightforyourorganization.Assuch,we’llstartthere:

BUILDVS.BUYPRE-TRAINEDLLMMODELS

BeforestartingLLMpre-training,thefirstquestionyouneedtoaskiswhetheryoushouldpre-trainanLLMbyyourselforuseanexistingone.Therearethreebasicapproaches:

?Option1:UsetheAPIofacommercialLLM,e.g.GPT-3(OpenAI,2020),CohereAPIs,AI21J-1

?Option2:Useanexistingopen-sourcedLLM,e.g.GPT-J(EleutherAI,2021),GPT-NeoX(EleutherAI,2022),Galactica(MetaAI),UL2(Google,2022),OPT(MetaAI,2022),BLOOM(BigScience,2022),Megatron-LM(NVIDIA,2021),CodeGen(Salesforce,2022)

?Option3:Pre-trainanLLMbyyourselforwithconsultants:

YoucaneithermanageyourowntrainingorhireLLM

consultants&platforms.Forexample,MosaicMLprovidestrainingservicesfocusingonLLMs.

Thatsaid,therearealotofdetailstoconsiderwhenmakingyourchoice.Herearethepros,cons,andapplicablescenariosforeachoption:

Option3

Pre-trainanLLMbyyourselforwithconsultants

Option2

Useanexistingopen-sourcedLLM

Option1

UsetheAPIofacommercialLLM

Pros

?RequirestheleastLLMtrainingtechnicalskills.

?Minimumupfronttraining/

explorationcost,givenmaincostincursatinferencetime.

?Theleastdata-demandingoption.

Onlyafewexamples(ornoexamples)areneededformodelstoperform

inference.

?Canleveragethebest-performingLLMsinthemarketandbuilda

superiorexperience.

?Reducetime-to-marketofyour

appsandde-riskyourprojectwithaworkingLLMmodel.

?AgoodwaytoleveragewhatLLMs

havelearnedfromavastamountofinternetdataandbuildontopofit

withoutpayingfortheIPatinference.

?Comparedtooptionone,youarelessdependentonthefuturedirectionofLLMserviceprovidersandthushavemorecontrolregardingroadmap&backwardscompatibility.

?Comparedtooptionthree,youhaveamuchfastertime-to-valuegivenyouarenotbuildingLLMsfromscratch,alsoleadingtolessdata,training

time,trainingbudgetneeded.

?Comparedtooptionsoneandtwo,youhavethemostcontrolofyour

LLM’sperformanceandfuture

direction,givingyoulotsofflexibilitytoinnovateontechniquesand/or

customizetoyourdownstreamtasks.

?Gainfullcontroloftrainingdatasetsusedforthepre-training,which

directlyimpactsmodelquality,bias,andtoxicityissues.Incomparison,thoseissuesarelesscontrollableinoptiononeortwo.

?TrainingyourownLLMalsogives

youadeepmoat:superiorLLM

performanceeitheracrosshorizontalusecasesortailoredtoyourvertical,allowingyoutobuildasustaining

advantageespeciallyifyoucreateapositivedata/feedbackloopwithLLMdeployments.

www.wandb.ai?contact@wandb.ai3

·······weights&Biases

Option1

Option2

Option3

UsetheAPIofacommercialLLM

Useanexistingopen-sourcedLLM

Pre-trainanLLMbyyourselforwithconsultants

Cons

?CommercialLLMservicescanget

expensivewithahighvolumeoffine-tuningorinferencetasks.Itcomes

downtoLLMtotal-cost-of-ownership(TCO)amortizedtoeachinference.

?Manyindustries/usecasesforbidtheuseofcommercialLLMservicesassensitivedata/PIIdatacannotbeseenbytheserviceforcompliance(healthcareusecases,forexample).

?Ifbuildingexternalapps,you’llneedtofindothermoatsandde-riskyourbusinessifyou’rehighlyreliantonexternalLLMservicetechnology.

?Lessflexibledownstream:doesn’t

supportedgeinference,limited

abilitytocustomizethemodel(fine-tuninggetsexpensive),limitedabilityforongoingmodelimprovements.

?Notasdemandingasbuilding

yourown,butstillrequireslotsofdomainexpertskillstotrain,fine-tune,andhostanopen-sourcedLLM.LLMreproducibilityisstillasignificantissuesotheamountoftimeandworkneededcannotbeunderestimated.

?Slowertime-to-marketandlessagileifyouarebuildingdownstreamapps,duetoamoreverticaltechstack.

?Open-sourcedmodelstypically

lagperformancecomparedto

commercialmodelsbymonths/years.Ifyourcompetitorleveragescommercialmodels,theyhaveanadvantageonLLMtechandyou’llneedtofindothercompetitive

advantages.

?Veryexpensiveendeavorwith

highrisks.Needcross-domain

knowledgespanningfromNLP/ML,subjectmatterexpertise,softwareandhardwareexpertise.Ifnotdonewell,youcouldendupinasituationwhereyou’vespentthousands

orevenmillionsofdollarswith

asuboptimalmodel.Mistakes,

especiallylateintotrainingstages,arehardtofix/unwind.

?Lessefficientthanoptiontwo.

OptiontwoleveragesexistingLLMs,learningfromanentireinternet’s

worthofdataandcanprovidea

solidstartingpoint.Withoption3,youstartfromscratchandneedlotsofhigh-quality/diversedatasets

foryourmodelstogaingeneralizedcapabilities.

Whentoconsidereachoption

?BestifyoueitherhavelesstechnicalteamsbutwanttoleverageLLM

techniquestobuilddownstream

apps,oryouwanttoleveragethebest-in-classLLMsforperformancereasons(outsourcingtheLLMtech).

?Betweenoptionstwoandthree,

ifyouaren’ttryingtochangethe

modelarchitecture,itisalmost

alwaysbettertoeitherdirectlytakeanexistingpre-trainedLLMand

fine-tuneitortaketheweightsofan

?Bestifyouneedtochangemodelarchitectureortrainingdatasetfromexistingpre-trainedLLMs.Forexample,ifyouwanttouseadifferenttokenizer,changethevocabularysize,orchangethe

?Goodifyouhaveverylimitedtrainingdatasetsandwanttoleveragean

LLM’scapabilitytodozero/few-shotlearning.

existingpre-trainedLLMasastartingpointandcontinuepre-training.Thereasonisbecauseagoodpre-trainedLLMlikeGPT-NeoXhasalreadyseenavastamountofdataandthushas

numberofhiddendimensions,attentionheads,orlayers.

?Typically,inthiscasetheLLMisa

corepartofyourbusinessstrategy&

?Goodforprototypingappsand

exploringwhatispossiblewithLLMs.

learnedgeneralcapabilitiesfromthedata.Youcanleveragethatlearningespeciallyifyourtrainingdatasetisnothugeordiverse.

?Anothertypicalscenarioisthatyouoperateinaregulatoryenvironmentorhaveuser/sensitivedatathat

cannotbefedtocommercial

LLMservices.Oryouneededge

deploymentofthemodelforlatencyorlocationalreasons.

technologicalmoat.Youaretakingonsomeoralotofinnovations

inLLMtraining,andhavealargeinvestmentappetitetotrainandmaintainexpensivemodelsonanongoingbasis.

?Typically,youhaveorwillhavelotsofproprietarydataassociatedwithyourLLMtocreateacontinuous

modelimprovementloopfor

sustainablecompetitiveadvantage.

Itisalsoworthmentioningthatifyouonlyhaveaverytargetedsetofusecasesanddon’tneedthegeneral-purposecapabilitiesor

generativecapabilitiesfromLLMs,youmightwanttoconsidertrainingorfine-tuningamuchsmallertransformerorothermuchsimplerdeeplearningmodels.Thatcouldresultinmuchlesscomplexity,lesstrainingtime,andlessongoingcosts.

www.wandb.ai?contact@wandb.ai4

·······weights&Biases

THESCALINGLAWS

Beforeyoudiveintotraining,it’simportanttocoverhowLLMsscale.Understandingscalingletsyoueffectivelybalancethesizeandcomplexityofyourmodelandthesizeofthedatayou’llusetotrainit.

Somerelevanthistoryhere:OpenAIoriginallyintroduced“theLLMscalinglaws”in2020.Theysuggestedthatincreasingmodelsizewasmoreimportantthanscalingdatasize.Thisheldfor

abouttwoyearsbeforeDeepMindsuggestedalmostthepolaropposite:thatpreviousmodelsweresignificantlyundertrainedandthatincreasingyourfoundationaltrainingdatasetsactuallyleadstobetterperformance.

Thatchangedin2022.Specifically,DeepMindputforward

analternativeapproachintheir

TrainingCompute-Optimal

LargeLanguageModels

paper.TheyfoundthatcurrentLLMsareactuallysignificantlyundertrained.Putsimply:theselargemodelsweren’ttrainedonnearlyenoughdata.

DeepmindshowcasedthiswithamodelcalledChinchilla,whichisafourththesizeoftheGophermodelabovebuttrainedon

4.6xmoredata.Atthatreducedsizebutwithfarmoretrainingdata,ChinchillaoutperformedGopherandotherLLMs.

DeepMindclaimsthatthemodelsizeandthenumberof

trainingtokens*shouldinsteadincreaseatroughlythesameratetoachieveoptimalperformance.Ifyougeta10xincreaseincompute,youshouldmakeyourmodel3.1xtimesbiggerandthedatayoutrainover3.1xbigger;ifyougeta100xincreaseincompute,youshouldmakeyourmodel10xbiggerandyourdata10xbigger.

*Note:TokenizationinNLPisanessentialstepofseparatingapiece

oftextintosmallerunitscalledtokens.Tokenscanbeeitherwords,

characters,orsubwords.Thenumberoftrainingtokensisthesizeof

trainingdataintokenformaftertokenization.Wewilldiveintodetailedtokenizationmethodsalittlelater.

DeepMindprovidesthefollowingchartshowinghowmuch

trainingdataandcomputeyou’dneedtooptimallytrainmodelsofvarioussizes.

EstimatedoptimaltrainingFLOPsandtrainingtokensforvariousmodelsizes,

TrainingCompute-OptimalLargeLanguageModels

Thatsaid,mostexistingLLMsarestillundertrained:

Data/compute-optimal(Chinchilla)heatmap,

Chinchilla

data-optimalscalinglaws:InplainEnglish

Insummary,thecurrentbestpracticesinchoosingthesizeofyourLLMmodelsarelargelybasedontworules:

?DecideonyourdatasetandfindtheChinchilla-optimal

modelsizebasedondatasize(orclosetoChinchilla-optimalwithintheboundaryofyourdatacollectionlimitation)

?Determinethedataandmodelsizecombinationthat’sbestforyourmodel,basedonyourtrainingcomputebudgetandinferencelatencyrequirements

Totheleftoftheminimaoneachcurve,modelsaretoosmall--alargermodeltrainedonlessdatawouldbeanimprovement.Totherightoftheminimaoneachcurve,modelsaretoolarge--asmallermodeltrainedonmoredatawouldbeanimprovement.Thebestmodelsareattheminima.

www.wandb.ai?contact@wandb.ai5

·······weights&Biases

HARDWARE

Itshouldcomeasnosurprisethatpre-trainingLLMsisa

hardware-intensiveeffort.Thefollowingexamplesofcurrentmodelsareagoodguidehere:

?PaLM(540B,Google):6144TPUv4chipsusedintotal,madeoftwoTPUv4Podsconnectedoverdatacenternetwork(DCN)usingacombinationofmodelanddataparallelism

?OPT(175B,MetaAI):99280GBA100GPUs,utilizingfullyshareddataparallelismwithMegatron-LMtensorparallelism

?GPT-NeoX(20B,EleutherAI):9640GBA100GPUsintotal

?Megatron-TuringNLG(530B,NVIDIA&MSFT):560DGXA100nodes,eachclusternodehas8NVIDIA80-GB

A100GPUs

TrainingLLMsischallengingfromaninfrastructureperspectivefortwobigreasons.Forstarters,itissimplynolongerpossibletofitallthemodelparametersinthememoryofeventhelargestGPU(e.g.NVIDIA80GB-A100),soyou’llneedsomeparallel

architecturehere.Theotherchallengeisthatalargenumberofcomputeoperationscanresultinunrealisticallylongtrainingtimesifyouaren’tconcurrentlyoptimizingyouralgorithms,

software,andhardwarestack(e.g.trainingGPT-3with175Bparameterswouldrequireabout288yearswithasingleV100NVIDIAGPU).

Memoryvs.ComputeEfficiency

TechniquesforParallelization

Parallelizationreferstosplittinguptasksanddistributing

themacrossmultipleprocessorsordevices,suchasGPUs,sothattheycanbecompletedsimultaneously.Thisallowsformoreefficientuseofcomputeresourcesandfastercompletiontimescomparedtorunningonasingleprocessorordevice.

ParallelizedtrainingacrossmultipleGPUsisaneffectivewaytoreducetheoveralltimeneededforthetrainingprocess.

Thereareseveraldifferentstrategiesthatcanbeusedto

parallelizetraining,includinggradientaccumulation,micro-

batching,dataparallelization,tensorparallelizationandpipelineparallelization,andmore.TypicalLLMpre-trainingemploysa

combinationofthesemethods.Let’sdefineeach:

DataParallelism

Dataparallelismisthebestandmostcommonapproachfor

dealingwithlargedatasetsthatcannotfitintoasinglemachineinadeeplearningworkflow.

Morespecifically,dataparallelismdividesthetrainingdataintomultipleshards(partitions)anddistributesthemtovarious

nodes.Eachnodefirstworkswithitslocaldatatotrainitssub-model,andthencommunicateswiththeothernodestocombinetheirresultsatcertainintervalsinordertoobtaintheglobal

model.Theparameterupdatesfordataparallelismcanbeeitherasynchronousorsynchronous.

Theadvantageofthismethodisthatitincreasescompute

efficiencyandthatitisrelativelyeasytoimplement.ThebiggestdownsideisthatduringthebackwardpassyouhavetopassthewholegradienttoallotherGPUs.Italsoreplicatesthemodelandoptimizeracrossallworkerswhichisrathermemoryinefficient.

ToachievethefullpotentialofthousandsofdistributedGPUs,itiscrucialtodesignparallelismintoyourarchitectureto

balancememoryandcomputeefficiency.

Memoryefficiency

TrainingaLLMrequiresterabytesofaggregatememoryfor

modelweights,gradients,andoptimizerstates-farbeyondwhatisavailableonasingleGPU.Onetypicalmitigationstrategyis

gradientaccumulation,inwhichthefulltrainingbatchissplitintomicro-batchesthatareprocessedinsequencewiththeirresultinggradientsaccumulatedbeforeupdatingthemodel

weights.Thatmeansyourtrainingbatchsizecanscalewithoutincreasingthepeakresidentactivationmemory.

Computeefficiency

WhilelargeGPUclusterscanhavethousandsofhigh-throughputGPUs,achievinghighcomputeefficiencyatthisscaleis

challenging.Alargebatchsizecanbeaneffectivewaytoincreasecomputeefficiency,becauseitincreasesthearithmeticintensityofaGPUkernelandhelpsamortizethetimespentstalledon

communicationandsynchronization.However,usingtoolargeofabatchsizecanhavenegativeeffectsonthemodelquality.

Whileparallelizationisparamount,therearemanydifferent

waystodoit.We’llgetintothemostcommoninournextsection.

www.wandb.ai?contact@wandb.ai6

·······weights&Biases

TensorParallelism

Tensorparallelismdivideslargematrixmultiplicationsintosmallersubmatrixcalculationswhicharethenexecuted

simultaneouslyusingmultipleGPUs.

Thisallowsforfastertrainingtimesduetoitsasynchronousnatureandtheabilitytoreducecommunicationoverheadbetweennodes.Thebenefitofthismethodisthatitis

memory-efficient.Thedownside,however,isthatit

introducesadditionalcommunicationofactivationsineachforward&backwardpropagation,andthereforerequireshighcommunicationbandwidthtobeefficient.

Pipelineparallelismandmodelparallelism

Pipelineparallelismimprovesboththememoryandcomputeefficiencyofdeeplearningtrainingbypartitioningthelayersofamodelintostagesthatcanbeprocessedinparallel.

Thishelpswithoverallthroughputspeedssignificantlywhile

addingthesmallestcommunicationoverhead.Youcanthinkofpipelineparallelismas“inter-layerparallelism”(wheretensor

parallelismcanbethoughtofas“intra-layerparallelism”).

Similartopipelineparallelism,modelparallelismiswhenyou

splitthemodelamongGPUsandusethesamedataforeach

model;soeachGPUworksonapartofthemodelratherthanapartofthedata.Thedownsideofpipelineandmodelparallelismisthatitcannotscaleinfinitelygiventhatthedegreeofpipelineparallelismisboundedbythedepthofthemodel.

Asmentionedatthestartofthissection,it’snotuncommonforteamstoleverageacombinationofparallelismtechniquesduringtraining.Forexample,PaLM(GoogleBrain,2022)andOPT(MetaAI,2022)bothusedacombinationoftensormodelparallelismanddataparallelism.

NVIDIAapproachedthingsalittledifferentlyinthe

Efficient

Large-ScaleLanguageModelTrainingonGPUClustersUsing

Megatron-LM

paper.TheyproposedaPTD-Ptechniquethat

combinespipeline,tensor,anddataparallelismtoachieve

state-of-the-artcomputationalperformance(52%ofpeakdevicethroughput)on1000sofGPUs.

Specifically,PTD-Pleveragesacombinationofpipeline

parallelismacrossmulti-GPUservers,tensorparallelismwithinamulti-GPUserver,anddataparallelismtopracticallytrain

modelswithatrillionparameters.Themethodalsoemploys

gracefulscalinginanoptimizedclusterenvironmentwithhigh-bandwidthlinksbetweenGPUsonthesameserverandacrossservers.

UsingthesetechniquestotrainLLMsrequiresnotonlythe

highest-performingGPUstobeefficient,butalsoneedshigh-

bandwidthnetworkingforoptimalcommunication––InfiniBandisoftenusedtomovedatabetweennodes.

Butthisofcoursecomeswithacost.Leveragingthousands

ofhigh-performingGPUsandhigh-bandwidthnetworksto

trainLLMsisinfrastructure-intensive.Forexample,aback-of-the-envelopecalculationestimatedthatthecostofthePaLMmodel(540B,Google)mightbeashighas$23MM

(seedetailed

analysis

).

Toimplementdistributeddeeplearningtrainingsystems,

softwaretoolkitssuchasDistributedTensorFlow,Torch

Distributed,Horovod,andlibrariessuchasDeepSeedand

Megatronareoftenneeded.Thereisimplementationcomplexityheresoitrequiressystemexpertiseifyou’regoingtobe

successful.

Inaddition,thefollowingtechniquesandstrategiesarecommonlyemployedtoachieveparallelism:

Gradientaccumulation

Gradientaccumulationinvolvesaddingupgradientsfrom

multiplebatchesbeforeperformingoneweightupdatesteponallaccumulatedgradientsatonce.

ThisapproachreducescommunicationoverheadbetweenGPUsbyallowingthemtoworkindependentlyontheirownlocalbatchofdatauntiltheyhavesynchronizedwitheach

otheragain,afteraccumulatingenoughgradientsforasingleoptimizationstep.

Asynchronousstochasticgradientdescentoptimization

AsynchronousstochasticgradientdescentoptimizationmethodscanalsobeemployedwhenperformingmodeloptimizationovermultipleGPUs.

www.wandb.ai?contact@wandb.ai7

·······weights&Biases

Thismethodusessmallsubsets(microbatches)ofdatafrom

eachnodeinsteadofloadingalldataatonce,whichhelpsreducememoryrequirementswhilestillallowingforfastconvergenceratesduetoitsasynchronousnature.Itworkslikethis:

?First,wefetchthemostup-to-dateparametersofthe

modelneededtoprocessthecurrentmini-batchfromtheparameterservers.

?Wethencomputegradientsofthelosswithrespecttotheseparameter

?Finally,thesegradientsaresentbacktotheparameterservers,whichthenupdatesthemodelaccordingly.

Micro-batching

Micro-batchingcombinessmallmini-batchesintolargeronessothatmorebatchescanbeprocessedinlesstimeandwithfewersynchronizationpointsbetweendevicesduringbackpropagationoperations.Ithasbecomeincreasinglypopularfortraining

verylargemodelsacrossmanyGPUsduetoitsabilitytoreducememoryconsumptionandimprovescalability.Overall,micro-

batchingisaneffectivewaytoleveragedistributeddeeplearningtechniqueswhendealingwithverylargedatasetsormodelsthatrequiresignificantamountsofprocessingpower.

Nowthatwe’vegonethroughscaling,hardware,andsome

techniquesforparallelizingyourtrainingruns,let’slookatwhatyourLLMwillactuallylearnfrom:data.

DATASETCOLLECTION

Baddataleadstobadmodels.Butcarefulprocessingofhigh-quality,high-volume,diversedatasetsdirectly

contributestomodelperformanceindownstreamtasksaswellasmodelconvergence.

DatasetdiversityisespeciallyimportantforLLMs.That’sbecausediversityimprovesthecross-domainknowledgeofthemodel,

aswellasitsdownstreamgeneralizationcapability.TrainingondiverseexampleseffectivelybroadenstheabilityofyourLLMtoperformwellonmyriadnuancedtasks.

Atypicaltrainingdatasetiscomprisedoftextualdatafrom

diversesources,suchascrawledpublicdata,onlinepublicationorbookrepositories,codedatafromGitHub,Wikipedia,news,socialmediaconversations,etc.

Forexample,considerThePile

.ThePileisapopulartextcorpuscreatedbyEleutherAIforlarge-scalelanguagemodeling.It

containsdatafrom22datasources,coarselybrokendownintofivebroadcategories:

?AcademicWriting:PubMedAbstractsandPubMedCentral,arXiv,FreeLaw,USPTOBackgrounds,PhilPapers,NIH

Exporter

?OnlineorScrapedResources:CommonCrawl,OpenWebText2,StackExchange,Wikipedia

?Prose:BookCorpus2,Bibliotik,ProjectGutenberg

?Dialog:YouTubesubtitles,UbuntuIRC,OpenSubtitles,HackerNews,Europarl

?Miscellaneous:GitHub,theDeepMindMathematicsdataset,Enronemails

NotethatThePiledatasetisoneoftheveryfewlarge-scale

textdatasetsthatisfreeforthepublic.FormostoftheexistingmodelslikeGPT-3,PaLM,andGalactica,theirtrainingand

evaluationdatasetsarenotpubliclyavailable.Giventhelargescaleeffortittakestocompileandpre-processthesedatasetsforLLMtraining,mostcompanieshavekeptthemin-house

tomaintaincompetitiveadvantage.ThatmakesdatasetslikeThePileandafewdatasetsfromAllenAIextremelyvaluableforpubliclarge-scaleNLPresearchpurposes.

Anotherthingworthmentioningisthat,duringdataset

collection,generaldatacanbecollectedbynon-expertsbut

dataforspecificdomainsnormallyne

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論