




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
warwick.ac.uk/lib-publications
AThesisSubmittedfortheDegreeofPhDattheUniversityofWarwick
PermanentWRAPURL:
http://wrap.warwick.ac.uk/179925
Copyrightandreuse:
Thisthesisismadeavailableonlineandisprotectedbyoriginalcopyright.Pleasescrolldowntoviewthedocumentitself.
Pleaserefertotherepositoryrecordforthisitemforinformationtohelpyoutociteit.Ourpolicyinformationisavailablefromtherepositoryhomepage.
Formoreinformation,pleasecontacttheWRAPTeamat:
wrap@warwick.ac.uk
WARW
THEUNIVERSITYOFWARWICK
LearningtoCommunicateinCooperativeMulti-Agent
ReinforcementLearning
by
EmanuelePesce
Thesis
SubmittedtotheUniversityofWarwick
inpartialful?lmentoftherequirements
foradmissiontothedegreeof
DoctorofPhilosophyinEngineering
WarwickManufacturingGroup
February2023
i
Contents
ListofTables
iv
ListofFigures
v
Acknowledgments
viii
Declarations
ix
1Publications
ix
2Sponsorshipsandgrants
x
Abstract
xi
Acronyms
xii
Chapter1Introduction
1
1.1Researchobjectives
3
1.2Contributions
3
1.3Outline
4
Chapter2LiteratureReview
5
2.1Reinforcementlearning
6
2.2Deepreinforcementlearning
8
2.3Multi-agentdeepreinforcementlearning
9
2.4Cooperativemethods
12
2.5Emergenceofcommunication
13
2.6Communicationmethods
14
ii
2.6.1Attentionmechanismstosupportcommunication
17
2.6.2Graph-basedcommunicationmechanisms
18
Chapter3Memory-drivencommunication
20
3.1Introduction
20
3.2Memory-drivenMADDPG
22
3.2.1Problemsetup
22
3.2.2Memory-drivencommunication
23
3.2.3MD-MADDPGdecentralisedexecution
29
3.3Experimentalsettings
29
3.3.1Environments
29
3.4Experimentalresults
32
3.4.1Mainresults
32
3.4.2Implementationdetails
35
3.4.3Increasingthenumberofagents
36
3.5Communicationanalysis
37
3.6Ablationstudies
41
3.6.1Investigatethememorycomponents
41
3.6.2Corruptingthememory
42
3.6.3Multipleseeds
44
3.6.4Multiplememorysizes
47
3.7Summary
47
Chapter4Connectivity-drivencommunication
49
4.1Introduction
49
4.2Connectivity-drivencommunication
52
4.2.1Problemsetup
52
4.2.2Learningthedynamiccommunicationgraph
52
4.2.3Learningatime-dependentattentionmechanism
54
4.2.4Heatkernel:additionaldetailsandanillustration
56
4.2.5Reinforcementlearningalgorithm
58
4.3Experimentalsettings
61
4.3.1Environments
61
iii
4.3.2Implementationdetails
64
4.4Experimentalresults
65
4.4.1Mainresults
65
4.4.2Varyingthenumberofagents
71
4.5Communicationanalysis
72
4.6Ablationstudies
78
4.6.1Investigatingtheheat-kernelcomponents
78
4.6.2Heat-kernelthreshold
79
4.7Summary
80
Chapter5BenchmarkingMARLmethodsforcooperativemis-
sionsofunmannedaerialvehicles
82
5.1Introduction
82
5.2Proposeddroneenvironment
84
5.3Competingalgorithms
87
5.4Experimentalsettings
92
5.5Experimentalresults
94
5.6Discussion
96
5.7Summary
97
Chapter6Conclusionsandfuturework
98
6.1Conclusion
98
6.2Futurework
100
6.3Ethicalimplications
102
iv
ListofTables
3.1ComparingMD-MADDPGwithotherbaselines
33
3.2Increasingthenumberofagents-CooperativeNavigation
36
3.3Increasingthenumberofagents-POCooperativeNavigation
.37
3.4AblationstudyonMD-MADDPGcomponents
43
3.5Corruptingthememorycontent
43
4.1ComparingCDCwithotherbasilnes
66
4.2ComparivesummaryofMARLalgorithms
67
4.3Varyingthenumberofagents
71
4.4Graphanalysis
75
4.5Heat-kernelthreshold
80
5.1Commonparametersoftheenvironments
86
5.2SummaryofselectedMARLalgorithms
92
5.3Environmentparameters
93
5.4Benchmarkingresults
94
v
ListofFigures
2.1Reinforcementlearning
7
2.2MADDPG
11
3.1TheMD-MADDPGframework
24
3.2Environmentillustrations
32
3.3Learnedcommunicationstrategies-write
38
3.4Learnedcommunicationstrategy-read
41
3.5ChangingseedsonSwappingCooperativeNavigation
45
3.6ChangingseedsonSequentialCooperativeNavigation
46
3.7Investigatingdi?erentmemorydimensions
47
4.1TheCDCframework
53
4.2Anedgeselectionexample
58
4.3Enviromentillustrations
63
4.4Learningcurves-NavigationControlandLineControl
69
4.5Learningcurves-FormationControlandDynamicPackControl
70
4.6Communicationnetworks-NavigationControlandLineControl
73
4.7Communicationnetworks-FormationControlandDynamic
PackControl
74
4.8Averagecommunicationgraphs-NavigationControlandLine
Control
76
4.9Averagecommunicationgraphs-FormationControlandDy-
namicPackControl
77
4.10AblationstudyonCDCcomponents
79
vi
5.1AUVrepresentation
86
5.2Learningcurves
96
TomybelovedSimona,fortheendlesslove,care,andsupportthroughoutalltheseyearstogether,andforbelievinginmemorethananyoneelse.
viii
Acknowledgments
Thisthesiswouldnothavebeenpossiblewithoutthehelpandsupportofmanypeople.Firstly,IwouldliketoexpressmygratitudetoProfessorGiovanniMontanaandtheWMGdepartmentforgrantingmetheopportunitytopursueafully-fundedPhDattheUniversityofWarwick.IamthankfultoGiovanniforhisguidancethroughoutthisjourney,consistentlyprovidingmewithusefuladviceandcontributingtotherevisionofmymanuscripts.Additionally,IamgratefultoKurtDebattistaforhissupportovertheyears.IwouldalsoliketoextendmythankstoLukeOwenandRamonDalmau-CodinafortheircontributionstothedevelopmentoftheUAVenvironment.IwouldliketoacknowledgeJeremieHoussineauandRaúlSantos-Rodríguez,myexaminers,fortheirvaluableadvicetoenhancethisthesis.AbigthankyougoestoProfessorTonyMcNallyforhisassistanceincoordinatingtheexaminationprocess.IalsowishtoextendmythankstoProfessorRobertoTagliaferriforbeingasourceofinspirationandinstillinginmealoveforthis?eld.
Furthermore,IwishtothankallthefriendsIhavehadtheprivilegeofmeetingalongthispath.Ruggiero,withwhomIhavesharedcountlessmemorablemomentsandtechnicaldiscussions.Demetris,forhisencouragementandinspirationeverytimeIneededit.IamalsothankfultoKevin,Massimo,Ozsel,SaadandFrancesco,allofwhomhaveplayedsigni?cantrolesinmakingthisPhDjourneyamoreenjoyableandsociableexperience.
Finally,Iamincrediblygratefultomyparents,RoccoandRita,foralwaysbelievinginmeandencouragingmetobewhoIam.AspecialthanksgoestoSimona,mybelovedpartner,whoisthemostcaringpersonIhavemetandhasconsistentlybeenasourceofsupportduringbothjoyfulandchallengingtimes.
ix
Declarations
ThisthesisissubmittedtotheUniversityofWarwickinsupportofmyapplic-ationforthedegreeofDoctorofPhilosophy.Ithasbeencomposedbymyselfandhasnotbeensubmittedinanypreviousapplicationforanydegree.
1Publications
Partsofthisthesishavebeenpreviouslypublishedbytheauthorinthefollowing:
[125]
EmanuelePesceandGiovanniMontana.Improvingcoordinationinsmall-scalemulti-agentdeepreinforcementlearningthroughmemory-drivencommunication.MachineLearning,109(9):1727–1747,2020
[126]
EmanuelePesceandGiovanniMontana.Learningmulti-agentcoordin-ationthroughconnectivity-drivencommunication.MachineLearning,2022.doi:10.1007/s10994-022-06286-6
[127]
EmanuelePesce,RamonDalmau,LukeOwen,andGiovanniMontana.Benchmarkingmulti-agentdeepreinforcementlearningforcooperativemissionsofunmannedaerialvehicles.InProceedingsoftheInternationalWorkshoponCitizen-CentricMultiagentSystems,pages49–56.CMAS,2023
Alltheworkpublished
[125
–
127
]islicensedunderaCreativeCommonsAttribution4.0InternationalLicense.Toviewacopyofthislicencevisit
/licenses/by/4.0/.
x
2Sponsorshipsandgrants
ThisresearchwasfundedbytheUniversityofWarwick.
xi
Abstract
Recentadvancesindeepreinforcementlearninghaveproducedunpreceden-tedresults.Thesuccessobtainedonsingle-agentapplicationsledtoexploringthesetechniquesinthecontextofmulti-agentsystemswhereseveraladditionalchallengesneedtobeconsidered.Communicationhasalwaysbeencrucialtoachievingcooperationinmulti-agentdomainsandlearningtocommunicaterepresentsafundamentalmilestoneformulti-agentreinforcementlearningalgorithms.Inthisthesis,di?erentmulti-agentreinforcementlearningap-proachesareexplored.Theseprovidearchitecturesthatarelearnedend-to-endandcapableofachievinge?ectivecommunicationprotocolsthatcanboostthesystemperformanceincooperativesettings.Firstly,weinvestigateanovelapproachwhereintra-agentcommunicationhappensthroughasharedmemorydevicethatcanbeusedbytheagentstoexchangemessagesthroughlearnablereadandwriteoperations.Secondly,weproposeagraph-basedapproachwhereconnectivitiesareshapedbyexchangingpairwisemessageswhicharethenaggregatedthroughanovelformofattentionmechanismbasedonagraphdi?usionmodel.Finally,wepresentanewsetofenvironmentswithreal-worldinspiredconstraintsthatweutilisetobenchmarkthemostrecentstate-of-the-artsolutions.Ourresultsshowthatcommunicationcanbeafundamentaltooltoovercomesomeoftheintrinsicdi?cultiesthatcharacterisecooperativemulti-agentsystems.
xii
Acronyms
CDCConnectivity-drivencommunication.
CLDECentralisedlearningdecentralisedexecution.
CNCooperativenavigation.
DDPGDeepdeterministicpolicygradient.
DNNDeepneuralnetwork.
DPGDeterministicpolicygradient.
DQNDeepQ-network.
DRLDeepreinforcementlearning.
ERExperiencereplay.
GNNGraphneuralnetwork.
HKHeatkernel.
KLKullback-Leibler.
LSTMLongshorttermmemory.
MAMulti-agent.
MADRLMulti-agentdeepreinforcementlearning.
MA-MADDPGMeta-agentMADDPG.
MADDPGMulti-agentDDPG.
xiii
MARLMulti-agentreinforcementlearning.
MD-MADDPGMemory-drivenMADDPG.
MDPMarkovdecisionprocess.
NNNeuralnetwork.
PCPrincipalcomponent.
PCAPrincipalcomponentanalysis.
PGPolicygradient.
POPartialobservability.
PPOProximalpolicyoptimization.
RLReinforcementlearning.
RNNRecurrentneuralnetwork.
TRPOTrustregionpolicyoptimisation.
UASUnmannedaerialsystem.
VDNValuedecompositionnetwork.
1
Chapter1
Introduction
ReinforcementLearning(RL)allowsagentstolearnhowtomapobservationstoactionsthroughfeedbackrewardsignals[
157]
.Recently,deepneuralnetworks(DNNs)
[89,
141
]havehadanoticeableimpactonRL
[94]
.Theyprovide?exiblemodelsforlearningvaluefunctionsandpolicies,overcomedi?cultiesrelatedtolargestatespaces,andeliminatetheneedforhand-craftedfeaturesandad-hocheuristics
[29,
121,
122]
.Deepreinforcementlearning(DRL)algorithms,whichusuallyrelyondeepneuralnetworkstoapproximatefunctions,havebeensuccessfullyemployedinsingle-agentsystems,includingvideogameplaying
[111
],robotlocomotion
[97
],objectlocalisation[
18
]anddata-centercooling
[38]
.FollowingtheuptakeofDRLinsingle-agentdomains,thereisnowaneedtodevelopimprovedlearningalgorithmsformulti-agent(MA)systemswhereadditionalchallengesarise.Multi-agentreinforcementlearning(MARL)extendsRLtoproblemscharacterizedbytheinterplayofmultipleagentsoperatinginasharedenvironment.Thisisascenariothatistypicalofmanyreal-worldapplicationsincludingrobotnavigation[
162
],autonomousvehiclescoordination
[15
],tra?cmanagement
[36
],andsupplychainmanagement
[90]
.Comparedtosingle-agentsystems,MARLpresentsadditionallayersofcomplexity.Earlyapproachesstartedexploringhowdeepreinforcementlearningtechniquescanbeutilisedinmulti-agentsettings[
23,
53,
155
],whereitemergedaneedofnoveltechniquesspeci?callydesignedtotackleMAchallenges.
2
MarkovDecisionProcesses(MDP),uponwhichDRLmethodsrely,assumethattherewarddistributionanddynamicsarestationary[
58]
.Whenmultiplelearnersinteractwitheachother,thispropertyisviolatedbecausetherewardthatanagentreceivesalsodependsonotheragents’actions[
86]
.Thisissue,knownasthemoving-targetproblem
[166
],removesconvergenceguaranteesandintroducesadditionallearninginstabilities.Furtherdi?cultiesarisefromenvironmentscharacterizedbypartialobservability[
23,
128,
151
]wherebytheagentsdonothavefullaccesstotheworldstate,andwherecoordinationskillsareessential.
Animportantchallengeinmulti-agentdeepreinforcementlearning(MADRL)ishowtofacilitatecommunicationamonginteractingagents.Communicationiswidelyknowntoplayacriticalroleinpromotingcoordinationbetweenhumans
[159]
.Humanshavebeenproventoexcelatcommunicatingeveninabsenceofaconventionalcode[
32]
.Whencoordinationisrequiredandnocommonlanguagesexist,simplecommunicationprotocolsarelikelytoemerge
[144]
.Humancommunicationinvolvesmorethansendingandreceivingmes-sages,itrequiresspecializedinteractiveintelligencewherereceivershavetheabilitytorecognizeintentionsandsenderscanproperlydesignmessages[
178]
.Theemergenceofcommunicationhasbeenwidelyinvestigated[
47,
163
],forexamplenewsignsandsymbolscanemergewhenitcomestorepresentingrealconcepts.Fusarolietal.
[46]
demonstratedthatlanguagecanbeseenasasocialcoordinationdevicelearntthroughreciprocalinteractionwiththeenvironmentforoptimizingcoordinativedynamics.Therelationbetweencommunicationandcoordinationhasbeenwidelydiscussed[
34,
71,
109,
170]
.Communicationisanessentialskillinmanytasks:forinstance,incriticalsituations,whereisoffundamentalimportanceinproperlymanagingcriticalandurgentsitu-ations,suchasemergencyresponseorganizations[
28
],inwhichiscrucialtoestablishaclearwayofcommunicating.Forexample,inordertoproperlymanagecriticalandurgentsituations,emergencyresponseorganizationsneedaclearcommunicationthatisfundamentalandcanbeachievedthroughsharinginformationamongstdi?erentagentsinvolved,whichisusuallyaccomplishedthroughyearsofcommontraining
[28]
.Inmultiplayervideogames,itisoften
3
essentialtoreachasu?cientlyhighlevelofcoordinationrequiredtosucceed,oftenacquiredviacommunicating[
20]
.WebelievethatcommunicationisapromisingtoolthatneedstobeexploitedbyMADRLmodelsinordertoenhancetheirperformanceinmulti-agentenvironments.Whenthisresearchwasstarted,wenoticedalackofmethodstoenableinter-agentcommunication,sowedecidedtoexplorethisareatocontributeto?llingagapthathadthepotentialforimprovingthecollaborationprocessinaMAsystem.
1.1Researchobjectives
TheaimofthisresearchistoexplorenovelcommunicationmodelstoenhancetheperformanceofexistingMARLmethods.Inparticular,wefocusoncooper-ativescenarios,whichiswherecommunicationisneededthemostbytheagentsinordertoproperlysucceedandcompletetheassignedtasks.Weinvestigatedi?erentapproachestoachievinge?ectivewaysofcommunicatingtoboostthelevelofcooperationinmulti-agentsettings.Theresultingcommunicationprotocolsarelearnedend-to-endsothat,attrainingtime,theycanbeadaptedbytheagentstoovercomethedi?cultiesproposedbytheunderlyingenviron-mentalcon?guration.Inaddition,wealsoaimtoanalysethecontentofthelearnedcommunicationcontent.
1.2Contributions
Themaincontributionsmadeinthisthesisaresummarisedasfollows:
?inChapter
3
,weproposeanovelmulti-agentapproachwhereinter-agentcommunicationisobtainedbyprovidingacentralisedsharedmemorythateachagenthastolearntouseinordertoreadandwritemessagesfortheothersinsequentialorder;
?inChapter
4
,wediscussanovelmulti-agentmodelthat?rstconstructsagraphofconnectivitiestoencodepair-wisemessageswhicharethenusedtogenerateanagent-speci?csetofencodingsthroughaproposed
4
attentionmechanismthatutilisesadi?usionmodelsuchastheheat-kernel(HK);
?inChapter
5
,weproposeanenvironmenttosimulatedronebehavioursinrealisticsettingsandpresentarangeofexperimentsinordertoevaluatetheperformanceofseveralstate-of-the-artmethodsinsuchscenarios.
1.3Outline
Thissectionprovidesanoutlineofthisthesis.Therestofthisdocumentisstructuredasfollows.Chapter
2
reviewstheexistingMADRLmodelsthatrelatetothiswork,withaspecialfocusoncooperativealgorithms.Chapter
3
introducesthe?rstresearchcontributionthatproposesanovelformofcommunicationbasedonasharedmemorycell.Chapter
4
presentsthesecondresearchcontributioninwhichagraphbasedarchitectureisexploitedbyadi?usionmodeltogenerateagentspeci?cmessages.Chapter
5
proposesanovelenvironmenttosimulatearealisticscenarioofdronenavigationanddiscussesanextensivecomparisonofseveralstate-of-the-artMADRLmodels.Chapter
6
concludesthisworkwithadiscussionoftheresultsobtainedandrecommendationsforfuturework.
5
Chapter2
LiteratureReview
Inthischapter,weintroducetheRLsettingandreviewtheexistingworksrelatedtomulti-agentreinforcementlearning.InSection
2.2
,wediscusssigni?cantmilestonesinsingle-agentreinforcementlearningtoestablishthefoundationalknowledgefortheextendedorutilizedbasiclearningtechniques.Section
2.2
presentsdeeplearningextensionsforthepreviouslymentionedapproaches,servingasaconnectionbetweensingle-agentandmulti-agentmethodologies.MovingontoSection
2.3
,wefocusonhowtheseapproacheshavebeenexpandedtooperateinmulti-agentscenarios,withaparticularemphasisonthetrainingphasesthatarecommonlyemployedinstate-of-the-artworks.Wethencategorizethemulti-agentliteratureintothefollowinggroups:
?Cooperativemethods(Section
2.4
):worksthatconcentrateonachievingcooperationbetweenagents
?Emergenceofcommunication(Section
2.5
):worksthatinvestigatehowautonomousagentscanlearnlanguages
?Communicationmethods(Section
2.6
):workswhereagentsmustlearntocommunicatetoenhancesystemperformance
Inthisreview,weintentionallyomittedspeci?cresearchareassuchastraditionalgametheoryapproaches[
120,
123,
145
],microgridsystems[
27,
70,
72
],andprogrammingforparallelexecutionsofagents[
24,
45,
136]
.Our
6
primaryfocuswasonmulti-agentworksbasedonreinforcementlearningapproaches,withaparticularemphasisoncommunicationmethodologies.
Someofthemethodsmentionedinthisreviewhavealsobeenchosenasbaselinesfortheexperimentspresentedinthesubsequentchapters,particularlyinChapter
5
,thatplaysacrucialroleasitservesasapracticalcontextfortheproposedmulti-agentapproachesdiscussedinChapters
3
and
4.
Byintroducingaspeci?callydesignedenvironmenttosimulatedronebehavioursinrealisticsettings,Chapter
5
providesindeedapracticalplatformforevaluatingtheperformanceofstate-of-the-artMADRLmodelsthatemploydi?erentcommunicationandcoordinationmethodsdiscussedinthischapter.
2.1Reinforcementlearning
Reinforcementlearningmethodsformalisetheinteractionofanagent(oractor)withitsenvironmentsusingaMarkovdecisionprocess[
129]
.AnMDPisde?nedasatuple〈S,A,R,T,γ〉whereSisthesetthatcontainsallthestatesofagivenenvironment,Aisasetof?niteactionsthatcanbeselectedbytheagent,andtherewardfunctionR:S×A→Rde?nesrewardreceivedbyanagentwhenexecutingtheactiona∈Awhilebeinginastates∈S.AtransitionfunctionT:S×Adescribeshowtheenvironmentdeterminesthenextstatewhenstartingfromastates∈Sandgivenanactiona∈A.Thediscountfactorγbalancesthetrade-o?betweencurrentandfuturerewards.AsrepresentedinFigure
2.1
,anagentinteractswiththeenvironmentbyproducinganactiongiventhecurrentstateandreceivingarewardinreturn.MDPsaresuitablemodelstotakedecisionsinfullyobservableenvironmentswhereacompletedescriptionofallitselementsisavailabletotheagentsandcanbeexploitedbytechniquessuchasthevalueiterationalgorithm[
9]
whichiterativelycomputesavaluefunctionthatestimatesthepotentialrewardfunctionofeachstate.Astate-actionvalueinsteadiscalculatedwhenthepotentialrewardfunctionisestimatedusingboththestateandtheaction.WhenaMDPissolvedastochasticpolicyπ:S×A→[0,1]isobtainedtomapstatesintoactions.RLalgorithmsoftenmakeuseofthepastagents’
7
Figure2.1:Areinforcementlearningsetting.Theenvironmentprovidesanobservationwhiletheagentproducesanactionandreceivesarewardinreturn.
experiencewithinteractingwiththeenvironment.Awell-knownalgorithmistheQ-learning
[176
],atabularapproachthatkeepstrackoftheQ-functionsQ(s,a)thatestimatethediscountedsumoffuturerewardforagivenpairstate-action.Everytimetheagentmovesfromastatesintoastates’usinganactiona’therespectivetabularentryisupdatedasfollows:
Q(s,a)=Q(s,a)+α[(r+γmaxQ(s’,a’))?Q(s,a)](2.1)
a,
whereα∈[0,1]isthelearningrate.
Policygradients(PG)methods[
157
]representanalternativeapproachofQ-learningwheretheparametersθofthepolicyaredirectlyadjustedtominimisetheobjectivefunctionbytakingstepsinthedirectionofthegradient?nedastheγ-discountedsumofrewardsandt∈{1,...T}isthetime-stepoftheenvironment.Suchgradientiscalculatedasfollows:
▽θJ(θ)=Ea~πθ[▽θlogπθ(a|s)Q(s,a)](2.2)
TheREINFORCEalgorithm
[179
]utilisesEq.
2.2
inconjunctionwithaMonteCarloestimationoffullsampledtrajectoriestolearnpolicyparametersinthe
8
followingway:
(2.3)
Policygradientalgorithmsarerenownedtosu?erfromhighvariancewhichcansigni?cantlyslowthelearningprocess[
79]
.Thisissueisoftenmitigatedbyaddingabaseline,suchastheaveragerewardorthestatevaluefunction,thataimstocorrectthehighvariationattrainingtime.Actor-criticmethods[
79]
arecomposedofanactormodulethatselectstheactionstotakeandacriticthatprovidesthefeedbacknecessaryforthelearningprocess.Whenthecriticisabletolearnboththestate-actionandthevaluefunctions,anadvantagefunctioncanbecalculatedasthedi?erencebetweenthesetwoestimates.Apopularactor-criticalgorithmistheDeterministicPolicyGradient(DPG)
[149
],inwhichtheactorisupdatedthroughthegradientofthepolicy,whilethecriticutilisesthestandardQ-learningapproach.InDPGthepolicyisassumedtobeadeterministicfunctionμθ:S→Aandthegradientthatminimisestheobjectivefunctioncanbewrittenas:
▽θJ(θ)=Es~D[▽θμθ(a|s)▽aQ(s,a)Ia=μθ(s)](2.4)
whereDisanexperiencereplay(ER)bu?erthatstoresthehistoricaltransitions,μθandQ(s,a)representtheactorandthecritic,respectively.
2.2Deepreinforcementlearning
Deeplearningtechniques
[89
]havewidelybeenadoptedtoovercomethemajorlimitationsoftraditionalreinforcementlearningalgorithms,suchaslearninginenvironmentswithlargestatespacesorhavingtoprovidehand-speci?edfeatures
[158]
.Deepneuralnetworks(DNN)asfunctionapproximatorsallowedindeedtoapproximatevaluefunctionsandagents’policies[
12]
.InDQN
[110
],theQ-learningframeworkisextendedwithDNNs,inordertoapproximatethestateprovidedbytheenvironment,whilestillkeepingthehistoricalexperienceinanexperiencereplaybu?erwhichisusedsampledataattrainingtime.DQNlearnstoapproximatet
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 肺炎的診療規范
- 物業管理費測算
- 綠色醫藥行業
- 旅游行業的創新創業探索
- 護理導師培訓課程
- 文化非遺行業環境分析
- 糖尿病患者護理
- 2024江西陶瓷工藝美術職業技術學院工作人員招聘考試及答案
- 2024河源市現代職業技術學校工作人員招聘考試及答案
- 房地產買賣合同趨勢分析與展望
- 消防更換設備方案范本
- 合伙開辦教育培訓機構合同范本
- 嵌入式機器視覺流水線分揀系統設計
- 《電力建設工程施工安全管理導則》(nbt10096-2018)
- 江蘇省鹽城市東臺市第一教育聯盟2024-2025學年七年級下學期3月月考英語試題(原卷版+解析版)
- 湖南省2025屆高三九校聯盟第二次聯考歷史試卷(含答案解析)
- 2024年全國職業院校技能大賽(高職組)安徽省集訓選拔賽“電子商務”賽項規程
- 2025年中考數學復習:翻折問題(含解析)
- (統編版2025新教材)語文七下全冊知識點
- 家具全屋定制的成本核算示例-成本實操
- 第二單元第1課《精彩瞬間》第2課時 課件-七年級美術下冊(人教版2024)
評論
0/150
提交評論