[0]Abstractabstract
Visualprocessingincontextof
reinforcementlearningHlynurDavíðHlynssonAthesispresentedforthedegreeof
DoctorofEngineeringFacultyofElectricalEngineeringandInformationTechnology
Ruhr-UniversityBochum
Germany
July2021
Visualprocessingincontextofreinforcementlearning
DissertationforthedegreeofDoctorofEngineeringoftheFacultyofElectricalEngineeringandInformationTechnologyattheRuhr-UniversitätBochum
Nameoftheauthor: | HlynurDavíðHlynsson |
---|---|
Placeofbirth: | Reykjavík |
Firstsupervisor: | Prof.Dr.LaurenzWiskott |
Ruhr-UniversitätBochum,Germany | |
Secondsupervisor: | Prof.Dr.TobiasGlasmachers |
Ruhr-UniversitätBochum,Germany | |
Yearofthesissubmission: | 2021 |
Dateoftheoralexamination: | March17th2022 |
Abstract
Althoughdeepreinforcementlearning(RL)hasrecentlyenjoyedmanysuccesses,itsmethodsarestilldatainefficient,whichmakessolvingnumerousproblemsprohibitivelyexpensiveintermsofdata.Weaimtoremedythisbytakingadvantageoftherichsupervisorysignalinunlabeleddataforlearningstaterepresentations.ThisthesisintroducesthreedifferentrepresentationlearningalgorithmsthathaveaccesstodifferentsubsetsofthedatasourcesthattraditionalRLalgorithmsuse:(i)GRICAisinspiredbyindependentcomponentanalysis(ICA)andtrainsadeepneuralnetworktooutputstatisticallyindependentfeaturesoftheinput.GrICAdoessobyminimizingthemutualinformationbetweeneachfeatureandtheotherfeatures.Additionally,GrICAonlyrequiresanunsortedcollectionofenvironmentstates.(ii)LatentRepresentationPrediction(LARP)requiresmorecontext:inadditiontorequiringastateasaninput,italsoneedsthepreviousstateandanactionthatconnectsthem.Thismethodlearnsstaterepresentationsbypredictingtherepresentationoftheenvironment’snextstategivenacurrentstateandaction.Thepredictorisusedwithagraphsearchalgorithm.(iii)RewPredlearnsastaterepresentationbytrainingadeepneuralnetworktolearnasmoothedversionoftherewardfunction.TherepresentationisusedforpreprocessinginputstodeepRL,whiletherewardpredictorisusedforrewardshaping.Thismethodneedsonlystate-rewardpairsfromtheenvironmentforlearningtherepresentation.Wediscoverthateverymethodhastheirstrengthsandweaknesses,andconcludefromourexperimentsthatincludingunsupervisedrepresentationlearninginRLproblem-solvingpipelinescanspeeduplearning.
[0]KurzfassungderDissertationkurzfassung
KurzfassungderDissertation
ObwohltiefesVerstärkungslernen(VL)indenletztenJahrengroßeErfolgeerzielthat,sinddessenMethodenimmernochdatenineffizient,wasdieLösungvielerProblemeunerschwinglichmacht.WiruntersuchendieMöglichkeit,dieszubeheben,indemwirdasinformationsreicheÜberwachungssignalinnichtgekennzeichneteDatenfürdieDarstellungvonLernzuständennutzen.IndieserArbeitwerdendreiverschiedeneRepräsentationslernalgorithmenvorgestellt,dieZugriffaufverschiedeneTeilmengenderDatenquellenhaben,dieherkömmlicheVL-AlgorithmenzumLernenverwenden:(i)GrICAistvonderunabhängigenKomponentenanalyse(ICA)inspiriertundtrainierteintiefesneuronalesNetzwerk,umstatistischunabhängigeKomponentenderEingabeauszugeben.GrICAminimiertdiegemeinsamenInformationenvoneinzelnenMerkmalenmitdenjeweilsanderenMerkmalen.ZusätzlicherfordertGrICAlediglicheineunsortierteSammlungvonUmgebungszuständen.(ii)LatentRepresentationPrediction(LARP)erfordertmehrKontextdaten:AlsEingabebenötigtsiezusätzlichzueinemZustandauchdenentsprechendenvorherigenZustandundeineHandlung,welchedieseverbindet.DieMethodelerntZustandsdarstellungen,indemsiedieDarstellungdesnächstenZustandsderUmgebungmithilfeeinesaktuellenZustandsundeineraktuellenAktionvorhersagt.DerPrädiktorwirdzusammenmiteinemGraphensuchalgorithmusverwendet.(iii)RewPredlerntdieZustandsdarstellung,indemeintiefesneuronalesNetzwerktrainiertwirdeinegeglätteteVersionderBelohnungsfunktionzulernen.DieDarstellungwirdzurVorverarbeitungvonEingabenimtiefenVLverwendet,währendderBelohnungsprädiktoralsBelohnungsformungdient.DieseMethodebenötigteinzigStatus-Belohnungs-PaareausderUmgebung,umdieDarstellungzulernen.Wirstellenfest,dassjedeMethodeihreStärkenundSchwächenhat,undschließenausunserenExperimenten,dassdasEinbeziehenvonunbeaufsichtigtemRepräsentationslerneninVL-ProblemlösungspipelinesdasLernenbeschleunigenkann.\pdfbookmark[0]Dedicationdedication
\pdfbookmark[0]Acknowledgementsacknowledgements
Acknowledgements
IwanttofirstthankmysupervisorProf.LaurenzWiskottforgivingmetheopportunitytoresearchthenichesofmachinelearningthatIfindinteresting.Hisinsightfulfeedbackandoutstandingenthusiasmandintuitionforthefieldprovedinvaluabletomeandothersinthisfastgrowingareaofresearch.Theadviceofmysecondsupervisor,TobiasGlasmachers,alsoprovedextremelyhelpful,especiallyonthetopicofreinforcementlearning.I’mgratefulfortheendlessloveandsupportfrommypartnerLisaSchmitz,whomadethetimeofmyPhDstudiesthebestinmylife–sofar.Myspecialthanksgoalsotoherparents,RosemarieSchmitzandGeorgSchmitz,fortheirsupportduringthistime.I’mthankfulforthehelpfulandpleasantenvironmentcreatedbytheotherPhDstudentsoftheInstitutfürNeuroinformatik(INI):MerlinSchüler,RobinSchiewer,ZahraFayyaz,EddieSeabrook,MortizLange,FrederickBaucks,JanBollenbacherandJanTekülve.IwouldalsoliketothanktwoalumnioftheINI,AlbertoEscalanteandFabianSchönfeld,forthesupporttheyhavegivenme.IalsowanttothanktheINIstaffoutsidethegroupfortheirhelpovertheyears:ArnoBerg,AngelikaWilleandKathleenSchmidt.Lastbutnotleast,IwanttothankmylovingfamilyforcreatingthecircumstancesthatgavemeroomtotraintheskillsthatIneededtopursueaPhDtobeginwith:ÓlöfIngibjörgEinarsdóttir,HlynurHöskuldsson,ÓlafurHlynssonandHöskuldurHlynsson.
Contents
- 1 Introduction
- 2 Background
- 3 Learninggradient-basedICAbyneurallyestimatingmutualinformation
- 4 Latentrepresentationpredictionnetworks
- 5 Rewardpredictionforrepresentationlearningandrewardshaping
- 6 Comparisonofourthreemethods
- 7 Summaryandconclusion
- A
List of Figures
- 2.1 Illustration:Afully-connectedneuralnetwork
- 2.2 Illustration:Aconvolutionlayer
- (a) Inputwith(heightwidthdepth)dimensionsof().
- (b) Inputwith(heightwidthdepth)dimensionsof().
- 2.3 Example:DimensionalityreductionbyPCAandt-SNE
- 2.4 Illustration:Anautoencoder
- 3.1 Example:Thelavafieldenvironment
- 3.2 Illustration:Ourindependentfeaturelearningsystem
- 3.3 Result:Noisysignalrecovery
- (a) Theoriginalsources.
- (b) Linearmixtureofsources.
- (c) Sourcesrecoveredbyourmethod.
- (d) SourcesrecoveredbyFastICA.
- 3.4 Result:Rewardduringtrainingonthelavafieldenvironment
- 3.5 Result:Trajectoriesinthelavafieldenvironment
- 3.6 Result:Trajectorieswithshiftedlavafields
- 3.7 Result:Reconstructionbyautoencoder
- 3.8 Result:Rewardduringtrainingonthelavafieldenvironment(convolutionalautoencoder)
- 4.1 Illustration:Latentrepresentationpredictionnetwork
- 4.2 Illustration:Predictiverepresentationlearningwithspheringregularization
- 4.3 Illustration:Predictiverepresentationlearningwithcontrastivelossregularization
- 4.4 Illustration:Predictiverepresentationlearningwithdecoderlossregularization
- 4.5 Result:LaplacianEigenmaprepresentationspaceofaNORBtoy
- (a) *
- (b) *
- (c) *
- (d) *
- (e) *
- (f) *
- 4.6 Result:HeatmapofLaplacianeigenmaplatentspacesimilarity
- 4.7 Result:HeatmapofVGG16latentspacesimilarity
- 4.8 Result:AggregateheatmapsofVGG16representationsimilaritiesontestdata
- 4.9 Result:Histogramsofelevation-wiseandazimuth-wiseVGG16errors
- 4.10 Result:LARPreinforcementlearningcomparison
- 4.11 Result:LARPre-trainingafterplacingobstaclesinacheckerboardpattern
- 4.12 Example:Toysfortransferlearningexperiments
- 4.13 Result:LARPLatentspacevisualization
- (a) LARPelevationt-SNE
- (b) CAEelevationt-SNE
- (c) VGG16elevationt-SNE
- (d) LARPazimutht-SNE
- (e) CAEazimutht-SNE
- (f) VGG16azimutht-SNE
- (g) LARPlightingt-SNE
- (h) CAElightingt-SNE
- (i) VGG16lightingt-SNE
- 5.1 Illustration:Reward-maximizingvs.reward-predictiverepresentations
- 5.2 Illustration:Learningandusingtherepresentation
- 5.3 Example:Thetwo-roomenvironment
- (a) Fullworldstates.
- (b) Agent’spointofview.
- (c) Goalobservations.
- 5.4 Example:Thelavagapenvironment
- 5.5 Example:Four-roomenvironment
- 5.6 Result:Predictedrewards,two-roomenvironment
- (a) Rawrewardprediction.
- (b) Smoothedrewardprediction.
- 5.7 Result:Two-roomenvironment
- 5.8 Result:Predictedrewards,lavagap
- 5.9 Result:Lavagapexperiment
- 5.10 Result:Re-learningexperiment
- 5.11 Result:Lavagaptrajectories
- (a) Sixsuccessfulepisodes.
- (b) Sixfailedepisodes.
- 5.12 Result:Fullfour-roomenvironment
- 5.13 Result:Four-roomtrajectories
- (a) Threesuccessfultrajectories
- (b) Threefailedtrajectories
- 6.1 Example:Visualcart-pole
- 6.2 Example:Obstacleavoidanceenvironment
- 6.3 Results:Visualcart-pole
- 6.4 Results:Gridworldcomparison,singlegoallocation
- 6.5 Results:Gridworldcomparison,four-roomgoal-finding
- 6.6 Results:Obstacleavoidance
- 7.1 Illustration:SupervisionsourceVenndiagram
List of Tables
- 1.1 Result:Inputdatatypepermethod
- 3.1 Result:OurconvolutionalICAnetwork
- 3.2 Result:Convolutionalautoencodernetworkarchitecture
- 4.1 Result:LARPrepresentationnetworkarchitecture
- 4.2 Result:LARPregularizingdecoderarchitecture
- 4.3 Result:LARPrepresentationpredictorarchitecture
- 4.4 Result:Ablationstudyoftherepresentationdimensionality
- 4.5 Result:LARPtransferlearningperformance
- 5.1 Result:Representationnetwork
- 5.2 Result:Policynetwork
- 5.3 Result:Rewardpredictionnetwork
Chapter 1 Introduction
Mankindhasbeeninterestedintheconceptofinfusinginanimateobjectswithitsintellectforthousandsofyears,withstoriesofartificiallyintelligentbeingsreachingbackthousandsofyears.ThisinterestmanifestsitselfforexampleinthestorytoldbyancientGreeksofthegreatautomataTalos,whowascraftedoutofbronzebythesmithinggodHephaestustoprotectthemythologicalqueenEuropa(rhodios2008argonautika).Automatahavebeenbuiltbycraftsmenfromdifferentculturesthroughouttheages,buttheyhavebeensimplemechanicalbeings(mccorduck1979machines).Thepossibilityofsatisfyingthehumandesiretocraftintelligentbeingshasonlyariseninthemiddleofthe19thcenturywiththefoundingofartificialintelligence(AI)asanacademicdiscipline.ThevastscopeofAIhasgivenrisetodifferentfields,eachwithitsownapplicationdomains,methodologiesandphilosophies.Onesuchexampleismachinelearning(ML),anareaofAIthatisconcernedwithalgorithmsthatleveragedatafordecision-making.Thisfieldhasexperiencedgreatsuccessinbothacademiaandindustryinthelastdecadewithincreasedaccesstopowerfulcomputersandlargedatabasesinadditiontoadelugeofadvancedcomputationaltechniquesandflexiblesoftwaresolutions(clark20152015).Thefieldisnotwithoutitsdrawbacks,however.Machinelearningalgorithmsneeddatacorrespondingtomonthsoryearsofhumanexperiencetogetcompetenceintasksthatapersoncanmasterinminutes.Peoplehavetheadvantagethattheycomeequippedwithastrongerunderstandingoftheworld.Inthisdissertation,weaimtoleveltheplayingfieldforMLmodelsthatperformsequentialdecision-makingbyexploringdifferentwaysforthemto"understand"theworldthroughrepresentations.InSection1.1,wediscussoneofthemostpromisingfieldsofartificialintelligence,deepreinforcementlearning,andmentionitsvictories.ThecurrentsituationisevaluatedinSection1.2wherewedescribethechallengesanddisadvantagesofthefield.ThevalueoftheresearchdirectionweputforwardisunderlinedinSection1.3aswellasourresearchobjectiveandhypothesis.ThechapterconcludeswithSection1.4whereweoutlinetheorderofcontentinthedissertation.
1.1 Deepreinforcementlearning
Awatershedmomentforartificialintelligencehappenedwhenkrizhevsky2012imagenetcombinedseveraltechniquesfromtheliteratureandconstructedadeep111Neuralnetworksthatprocesstheinputhierarchicallyusingatleastmorethantwolayersofcomputationallayersarecalleddeepneuralnetworktooutperformthecompetitionbyasignificantmargininanimageclassificationcontest.Thiswasthecatalystoftheso-calleddeeplearningrevolution(sejnowski2018deep)whichhasimpactedfieldssuchasnaturallanguageprocessing(wolf2020transformers),bioinformatics(li2019deep),computervision(khan2018guide),frauddetectionandmanyothers(alom2019state).Theareaofmachinelearningconcernedwiththetrainingofdeepneuralnetworksiscalleddeeplearning.Deeplearningmethodshavetheadvantagethattheyautonomouslylearnpatternsinthedatainahierarchicalmanner.Forexample,edgesareusefulpatternsforpicturesofshapessuchassquares,trianglesandcircles(patrick2010ai).Theedgescanbecombinedtocornersandthenumberofcornerscanbecountedfordistinguishingbetweenthedifferentshapes.Sincedeeplearningencompassesabroadsetofmachinelearningalgorithms,itcanbereadilycombinedwithotherareasofmachinelearning.Onesuchareaisreinforcementlearning,wheregeneralgoal-directeddecision-makingproblemsarestudied.Reinforcementlearning(RL)methodstrainmodelsthatareinteractingsequentiallywiththeirenvironmentstomaximizearewardsignal.DeeplearningisfrequentlycombinedwithRLtechniques,allowingthemodelstomaptheinputs,suchashigh-dimensionalimagedata,directlytoactions.Thiscombinationhasyieldedpromisingresultsindifferentareas,rangingfromrecommendersystems(zhang2019deep)overautonomousdriving(kiran2021deep)toplayinggames(silver2017mastering).
1.2 Openproblems
Aknownproblemofdeepneuralnetworksisthattheyrequirealargeamountofdataforadequateperformance.Thisdatacanbeprohibitivelyexpensivetoobtain–eitherintermsoftimeneededtocreatedataandtrainthemodelsforreinforcementlearningmethodsormonetarycostofacquiringhuman-labeledtrainingdataforsupervisedlearningmethods–whichhasencouragedthedevelopmentofmethodsthatlearnrepresentationsfromstreamsofmorereadilyavailable,unlabeleddata.Thisproblemincreasesinseverityindeepreinforcementlearning(DRL).Intheuncompromisinglytitledblogpost,Deepreinforcementlearningdoesn’tworkyet,irpan2018deepidentifiesseveralfundamentalproblemsofDRL.Oneofthemistheproblemofsampleinefficiency,wheremanyhighlypublicizedstate-of-the-artresultsonvideogamesrequirehundredsofmillionsofframesofexperiencetoachieveperformancethathumansreachinamatterofminutes.Anotherproblemistheoneofinstability.Deepneuralnetworksarehighlyexpressiveandoptimizelargenumbersofparameters.ThismakesthedesignofDRLmodelsdifficult,asthesearchofhyperparameters222Ahyperparameterisbroadlyspeakinganydesignchoicemadebytheprogrammerbeforethelearningofthe"regular"parametersstarts.thatsolvetheproblemcanbequitetime-consuming.Evenwhenapromisingsetofhyperparametersisfound,thedifferencebetweentheperformanceofdifferentmodelslearnedfromscratchcanbesignificant,dependingontherandomseed.ThisincreasedvariancecomesfromthenewsourceofrandomnessthatisintroducedtoRLmodels,comparedtoregularregressionlearning:theagentsactionsarestochastic,increasinglysointhebeginningoflearning333Thereisatradeoffbetweenexploringtheenvironmentandexploitingtheexpectedrewardsignal.AcommonstrategyforRLagentsistostartthelearningwithahighchanceofperformingrandomactionstoexploredifferentstatesoftheenvironmentandthendecreasethischanceasthelearningprogresses..
1.3 Researchaim
Inthisdissertation,weproposemethodsforunsupervisedandself-supervisedlearningofrepresentationsforgoal-directedbehavior.Self-supervisedlearningmethodsuseasubsetoftheinputtopredicttherestofit,foregoingtheneedofannotationswhiletakingadvantageofthepowerfulmachineryofsupervisedlearningmethods.Tacklingtheopenproblemsofdatainefficiencyandinstabilityoutlinedaboveinordertofurtherthefieldisourintentionwiththisthesis.Wedosobydevelopingandinvestigatingthreedifferentapproaches:(i)unsupervisedlearningofarepresentationforRLagents,(ii)amethodofjointlylearningapredictorforplanningarepresentationthatisgoodforthetransitionprediction,and(iii)learningarepresentationforRLagentsasthebyproductofrewardprediction.WerelatethedataneededtolearntherepresentationsforourmethodstotheavailabledatainthecontextofRLinTable1.1.Ourhypothesisisthatsuitablestaterepresentationsthatreducethecomplexityofhigh-dimensionalinputsinRLsettingscansupportamorestableanddataefficientlearningthanhavingdeepRLalgorithmslearnstaterepresentationsfromscratch.
Subsetofrequiredforlearning | Method | Chapter |
---|---|---|
GrICA | 3 | |
LARP | 4 | |
RewPred | 5 |
1.4 Thesisoutline
Hereweoutlinethestructureofthethesis.Threeofthechaptersareadaptedfromtheworkthatwerepublishedoverthecourseofthedoctoralwork.\nobibliography*
-
Chapter2:Background.Inthischapter,wegoinfurtherdetailsonthemaintopicsinthisthesisanddiscusstheirfundamentals.WeintroducetheformalismofMarkovdecisionprocessesandexplainthedifferencebetweenmodel-basedandmodel-freereinforcementlearningalgorithms.Themachinerybehinddeeplearningisthenexplainedandthemainbuildingblocksofdeepneuralnetworksareillustrated.Thechapterconcludeswithadiscussionofthemainrepresentationlearningmethods.
-
Chapter3:Learninggradient-basedICAbyneurallyestimatingmutualinformation.Thischapterdiscussesanadaptionofindependentcomponentanalysis(ICA)forDL.Weintroduceanovelapplicationofaneuralmethodformutualinformationestimationtolearnarepresentationwithstatisticallyindependentfeatures.Thechapterisanadaptedversionof
-
\bibentry
hlynsson2019learning(hlynsson2019learning)
-
\bibentry
-
Chapter4:Latentrepresentationpredictionnetworks.Thischapterdiscussesamethodformanipulableenvironmentsforjointlylearningarepresentationofobservationsandamodelforpredictingthenextrepresentation,givenanaction.Welearntherepresentationinaself-supervisedmanner,withouttheneedofarewardsignal.Weintroduceanewenvironmentthatisakintomanipulatingtoyobjectsforaviewpointmatchingtask.Therepresentationiscombinedwithagraph-searchalgorithmtofindthegoalviewpoint.Thechapterisanadaptedversionof
-
\bibentry
hlynsson2020latent(hlynsson2020latent)
-
\bibentry
-
Chapter5:Rewardpredictionforrepresentationlearningandrewardshaping.Thischapterdiscussesaself-supervisedlearningmethodtomaphigh-dimensionalinputstoalowerdimensionalspaceforRLagents.Weintroduceatechniquewherearepresentationlearnedforarewardpredictorisusedtoshapetherewardfortheagents.Thechapterisanadaptedversionof
-
\bibentry
hlynsson2021reward(hlynsson2021reward)
-
\bibentry
-
Chapter6:Comparisonofourmethods.Inthischapter,wedirectlycomparethethreedifferentmethodstostate-of-the-artdeepRLmethodsonfourdifferentenvironments:avisualpole-balancingenvironment,twogoal-findingenvironmentandanobstacleavoidanceenvironment
-
Chapter7:Summaryandconclusion.Thischapterclosesthedissertationwithabriefsummaryofthethesis,concludingremarksandpossiblefuturework.Thefollowingworkwasalsopublishedoverthecourseofthedoctoralstudies:
-
\bibentry
hlynsson2019measuring(hlynsson2019measuring)
Thepapercomparessupervisedlearningmethods,butitistoodissimilarintopicfromtherestoftheworkandisthuschosentobeomittedfromthisdissertation.
-
\bibentry
Chapter 2 Background
Thischapterlaysoutthefundamentalconceptsofmachinelearningthatformsthefocalpointoftherestofthedissertation.InSection2.1,welayoutthemainobjectofstudyinreinforcementlearning(RL),partiallyobservableMarkovdecisionprocesses,andpresentabrieftaxonomyofreinforcementlearningalgorithms.WeexplainthebasicsofartificialneuralnetworksanddeeplearninginSection2.2.Themostcommonlyusedtypeofneuralnetworkusedforprocessingvisualdata,theconvolutionalneuralnetwork,isthendescribed,alongwithsomeofitsmainbuildingblocks.InSection2.3,wediscussrepresentationlearning(alsoknownasfeaturelearning)andmotivateitinthecontextofreinforcementlearning.Foramorein-depthdiscussionofthesetopics,wereferthereadertothecomprehensivetextbookonRLbysutton2018reinforcement,thedeeplearningbookbyGoodfellow-et-al-2016andtheexcellentsurveybybengio2013representationonrepresentationlearning.
2.1 Reinforcementlearning
Inthissection,weformalizeRLfortherestofthethesis.RLisoneofthemaindisciplinesofmachinelearning,anditcovershowagentscanlearntobehaveoptimallyinanenvironmenttomaximizeacumulativereward.
2.1.1 PartiallyobservableMarkovdecisionprocesses
Apartially-observableMarkovdecisionprocess(POMDP)isageneralframeworkformodelingsequentialdecisionprocessesinenvironmentsthatcanbestochastic,complexandcontainhiddeninformation.Formally,itisatuple
(2.1) |
whichwealsorefertoastheenvironment.Thetupleismadeupofthefollowingelements:
-
Thestatespacedefinesthepossibleconfigurationsoftheenvironment
-
Theactionspacedescribeshowtheagentisabletointeractwiththeenvironment
-
Thetransitionfunctiondictatestheeffectsofdifferentactionsindifferentstates
-
Therewardfunctiondeterminestheimmediaterewardgiventotheagentfortransitioningbetweenanytwostateswithanyaction
-
Theinitialstatedistribution
-
Theobservationspacedefinestheaspectsoftheenvironmentthattheagentcanperceive
-
Theobservationfunctiondefineswhat(potentiallytransformed)subsetoftheenvironmenttheagentreceivesafteractinginagivenstate
-
Therewarddiscountfactor
Theenvironmentstartsinastatedrawnfrom,fromwhichtheagentinteractssequentiallywiththeenvironmentbychoosingactionfromactionspaceattimesteps.Theagentreceivesanobservationandarewardaftereachaction.TheobjectiveofanRLagentistolearnapolicythatdeterminesthebehavioroftheagentintheenvironmentbymappingstatestoaprobabilitydistributionover,written.AdiscountfactorisusuallyincludedinthedefinitionofPOMDPs,anditcomesintoplayintheoptimizationfunctionoftheagent.Namely,thepolicyshouldmaximizetheexpecteddiscountedfuturesumofrewards,ortheexpectedreturn,wherethereturnisdefinedas
(2.2) |
Thevaluefunctionisdefinedastheexpectationofthereturn(Eq.2.2),givenapolicyandaninitialstate
(2.3) |
Thereisatleastoneoptimalpolicythatisbetterthanorequaltoothers:forallstatesandallotherpolicies.
2.1.2 Model-freealgorithms
Model-freereinforcementlearninglearnsthepolicyoravaluefunctiondirectlyfromexperiencewithoutattemptingtoapproximatethedynamicsoftheenvironment.Twopopularclassesofmodel-freemethodsarevalue-basedmethodsandpolicy-basedmethods.Value-basedmethodsapproximateeitherthevaluefunction(sutton1988learning)oranotherusefulfunctionthatissimilartothevaluefunction,theaction-valuefunction.Thisfunctionisdefinedastheexpectedreturnoffollowingthepolicyaftertakinganactioninastate:
(2.4) |
Estimatingtheaction-valuefunctionisapivotalstepforalgorithmssuchasQ-learning(watkins1992q).Asimpleone-stepQ-learningupdatingruleis(sutton2018reinforcement):
(2.5) |
whereisapositivelearningrateparameterandtheinitialvaluesofarechosenarbitrarily.Q-learningisguaranteedtoconvergetotheoptimalpolicy’saction-valuefunction,undercertainconditions111Thisdependsonagoodlearningratescheduleandexplorationtechniques,whicharedifficulttodetermineinpractice,whichinturnyieldstheoptimalpolicy:.Thismethodtabulatesthevaluesandthusworkswithdiscreteactionsandstatespaces.Q-learninghasbeencombinedwithdeepneuralnetworkstoworkforactionsandstatespacesofhigherdimensions(mnih2015human).Policy-basedmethodsdonotlearnavaluefunction,butratherlearnthepolicydirectlybyoptimizinganobjectivefunctionwithrespectto.Wedescribetwoofthosemethodsthatweemployinthiswork:(1)proximalpolicyoptimization(PPO)(schulman2017proximal)and(2)actorcriticusingKronecker-factoredtrustregion(ACKTR)(wu2017scalable).PPOoptimizestheobjectivefunction
(2.6) |
whereisahyperparameter,andisanestimatoroftheadvantagefunction.Thecliptermreturnsif,otherwisethevalueisclippedtothecloserboundaryvalue.ThefullPPOalgorithmisshowninAlgorithm1.
ACKTRappliesthepolicygradientupdates
(2.7) |
where,isthelog-likelihoodoftheoutputdistributionofthepolicyandthelearningrateiscontrolleddynamicallywiththetrustregionparametertopreventthepolicyfromconvergingprematurelytoapoorpolicy.
2.1.3 Model-basedalgorithms
Model-basedreinforcementlearning(MBRL)algorithmslearntheoptimalpolicybyfirstestimatingthetransitionfunctionandtherewardfunction.Thesefunctionsareusuallycalledtheenvironmentdynamicsorworldmodelandarelearnedinasupervisedfashionfromadatasetofobservedtransitions,.Theworldmodelscanbeusedinmultipledifferentways,dependingonthealgorithm,toderivetheoptimalpolicy.Forexample,sampling-basedplanningalgorithmsuseandtosampleactionsequencesandcalculatetheirexpectedvalues:
(2.8) |
TheagentfollowstheactionsequenceassociatedwiththehighestexpectedrewardinEquation2.8.Thisisoftencombinedwithmodel-predictivecontrol(MPC),whereanewactionsequenceiscalculatedaftertakingthefirstactioninthelastsequence.Therearedifferentwaysofchoosingcandidateactionsequences,withthesimplestbeingtherandomshootingalgorithm(richards2005robust),thatdrawstheactionsfromauniformdistribution.
2.2 Deeplearning
Forthelastfewyears,deeplearninghasbeenonthecenterstageofmachinelearningresearch.Wemakeextensiveuseofdeeplearninginthisworkbecauseofitsparallelizability,itsefficientscalingwithlargedatasetsanditscapabilitytoapproximatecomplexfunctions.Themostbasictypeofdeeplearningmethodisthefeedforwarddeepnetwork,whichcompriseslayersofartificialneurons.Thetheoreticalcapabilitiesofartificialneuralnetworkswereguaranteedbycybenko1989approximation:hisUniversalApproximationTheoremhastheimplicationthatanycontinuousfunctionofrealnumberswithvaluesinaEuclideanspacecanbeapproximatedbyaneuralnetworkwithonehiddenlayer.Unfortunately,itisapureexistencetheorem,leavingthetaskofconstructingthenetworktotheengineer.
2.2.1 Theartificialneuron
Anartificialneuron(rosenblatt1958perceptron)isamathematicalfunctionthatmultiplieseachinputwithaconstant,addsabiastothelinearcombinationandthenappliesanon-linearitytotheoutcome:
(2.9) |
Thenon-linearityisknownastheactivationfunction,thecoefficientsareknownastheweightsandthetermisknownasthebias.Wenowbrieflydiscusssomecommonlyusedactivationfunctionsthatareemployedinthisthesis.Foramorecomprehensiveoverviewofrecenttrendsintheusageofactivationfunctions,weencouragethereadertolookatacomparisonbynwankpa2018activation.Thelogisticfunctioncanbeusedforbinaryclassification.
(2.10) |
Thisfunction"squashes"theinputstoliebetweenand,givingtheoutputaprobabilisticinterpretation.Thesoftmaxfunctionisanextensionofthelogisticfunctionforseveralclasses
(2.11) |
Theoutputofthesoftmaxfunctionisavectorofthesamedimensionalityastheinputvectorandsumsto.Thehyperbolictangentfunction(tanh)squashestheinputtoliebetween-1and1
(2.12) |
Thishasthecomputationaladvantageoverthelogisticfunctionthatbiasesinthegradientsareavoidedand0-centereddatagivesrisetolargerderivativesduringoptimizationofthenetworks(lecun2012efficient),makingthemamorefrequentchoiceasanactivationfunctioninhiddenlayers.Themostpopularnonlinearityfordeepneuralnetworksistherectifierfunction
(2.13) |
Therectifierfunctionoffersthesameadvantagesasthetanhfunctionbutatalowercost,asevaluatingexponentialsandperformingdivisionisavoided.Therectifierfunctionisalsocalledarectifiedlinearunit,anditiscommonlyabbreviatedas"ReLU".
2.2.2 Feedforwardneuralnetworks
ComputationalunitsimplementingthefunctioninEq.2.9canbearrangedhierarchically,withtheinputofaneuronconsistingoftheoutputofotherneurons.Anexamplefeedforwardneuralnetworkormultilayerperceptron(MLP)222Feedforwardneuralnetworksaresometimeslooselyreferredtoasmulti-layerperceptrons(MLPs),namedafteranearlyartificialneuronmodelcalledtheperceptron.However,perceptronsuseahardthresholdactivationfunctionwhilemodernMLPscanuseanydifferentiableactivation,sotheyareoftennotperceptrons,inthestrictmeaningoftheword.isdepictedinFigure2.1.
Thefigureshowsanetworkwithonehiddenlayer,butitcaninprinciplehaveanynumberofhiddenlayers.Thesameistrueforthenumberofinputsandoutputs.
2.2.3 Optimizingneuralnetworks
Trainingadeepneuralnetworkinvolvestrainingdataandalossfunction.Fortraininganartificialneuralnetwork,anappropriatelossfunctionhastobefoundtomatchboththetaskathandalongwiththefinallayer’sactivationfunction.Thelossfunctionmeasuresthedifferencebetweentheoutputofthenetwork,whenthedataispassedthroughit,andthedesiredoutcome.Theparametersofthenetworkarethenadjustedtowardtheoptimalthatminimizethelossfunctionoverthedata
(2.14) |
whereisthenumberofdatapointsandisthepredictionofaneuralnetworkwithparameters,forsamplewiththetruevalue.Thequantityisalsoknownastheempiricalrisk.Onesuchexampleisthemean-squarederror(MSE)loss
(2.15) |
MaximizingthelikelihoodofGaussiandatawithrespecttotheparametersoftheassumedmodelthatgeneratedthedataisequivalenttominimizingtheMSE,makingitapopularchoiceforregressiontasks(i.e.whentheoutputlayeractivationislinearorReLU).Forclassificationnetworkswithalogisticorsigmoidactivationoutput,asuitablelossfunctionisthecross-entropylossfunction
(2.16) |
whereisthenumberofclasses333If,thenthelabelvectorcould,forexample,taketheform..SimilarlytoMSE,thislossfunctionisalsomotivatedbythefactthatminimizingthecross-entropylossisequivalenttomaximizingthelikelihoodofuniformlydistributedi.i.d.data(yao2019negative).Sofar,thelossfunctionswehaveseenrequirealabelasapartoftheinput.Thismakesthemsupervisedlearninglosses.Manycommonlyusedlossfunctionsexistthatdonotrequirelabels,thosearecalledunsupervisedlearninglosses.Oncewehavedecidedonalossfunctiontominimize,thenextstepistochoosetheoptimizationalgorithm.ThemostpopularonesareimplementedinsoftwarelibrariessuchasKeras(chollet2015keras),MXNet(chen2015mxnet),Tensorflow(abadi2016tensorflow),Pytorch(NEURIPS2019_9015)andseveralothers.Themostcommonwayoftrainingdeepneuralnetworksisbyemployingavariationofthegradientdescent(curry1944method)algorithm.Gradientdescentmethodstakeadvantageofthefactthatafunctiondecreasesthefastestinthenegativedirectionofitsgradient,convergingatalocalminimum.Analgorithmcalledbackpropagation(linnainmaa1970representation)computesthegradientofthelossfunctionwithrespecttoeachparameter(e.g.weightsandbiases)viathechainrulefromcalculus.Thesegradientsarethenusedforanupdatestepforeachparameter:
(2.17) |
wherekeepstrackoftheindexoftheiteration.Theparameterisknownasthelearningrateoftheoptimizationalgorithm.TheclassicalgradientdescentmethodinEquation2.17calculatestheaveragelossovertheentiredataset.Thiscanbemadefaster,withoutlosingconvergenceguarantees,byperformingaweightupdateusingthegradientfromonlyasubsetofthetrainingdata–oratrainingbatch–ineachiteration.Thisstochasticapproximationofgradientdescentiscalledstochasticgradientdescent.Agoodlearningrateisimportantforthepracticalconvergenceofstochasticgradientdescent:ifitisverysmall,thenthetimeittakestoconvergecanbetoolong.However,ifitistoolarge,thenthereisariskofovershootingthelocalminima.Itisgenerallygoodtostartoffwithalargerlearningrateandthenmakeitsmallerwithtime.Determiningexactlywhentodecreasethesizeofthelearningrate,andbyhowmuch,canbelaboriousinpractice.Forthisreason,therehavebeenproposedseveralgradientdescentmethodsthatautomaticallyfindthislearningrateschedulewithadaptivelearningrates,forexample,rmsprop(tieleman2012lecture)andAdam(kingma2014adam).
2.2.4 Convolutionalneuralnetworks
ThenetworkinFig.2.1isafully-connectedordenseneuralnetwork,becauseeveryunitisconnectedtoeveryunitintheprecedinglayer.Thereareother,morespecialized,neuralnetworksthatarenotfully-connected,oneofthemostimportantclassbeingconvolutionalneuralnetworks.Forinputdatawithaspatialstructure,forexampleimages,convolutionalneuralnetworksareveryefficient.Incontrasttodenseneuralnetworks,eachunitinconvolutionalneuralnetworksonlyreceivesasinputasubsetoftheoutputsfromthepreviouslayer.Morespecifically,eachunitonlyreceivesinputsfromunitsthatareinspatialproximityofoneanother.WaldoTobler’sFirstLawofGeographycapturessuccinctlythemotivationbehindconvolutionalnetworks(tobler1970computer):"everythingisrelatedtoeverythingelse,butnearthingsaremorerelatedthandistantthings".Anotherkeypropertyofconvolutionalneuralnetworksistheoneofsharedweights–eachcomputationalunitinthesamelayerhasthesamesetofweights,eventhoughtheyreceivedifferentinputs.Thesepropertieshavethepracticalconsequencethatthenumberofparametersiscutdownsubstantially:eachneuronprocessingagrayscaleimagewouldrequireweights,whichisthenmultipliedagainbythenumberofneuronsinthelayerforthetotalparametercount.Ontheotherhand,aconvolutionallayerwhereeachneuronprocessesawindow(arelativelylargewindowsize)aroundapixelwouldrequireparameters–forthewholelayer,duetothesharedweights.Thus,processingtheimageinputinthisexamplewithaconvolutionallayerinsteadofadenselayerwithasingleneuronreducesthenumberofparametersbyafactorof.Inadditiontotheactivationfunction,wespecifythevalueoffourhyperparametersforconvolutionallayerswhenwedescribespecificnetworkarchitecturesinthisthesis:thenumberoffilterstoslidealongtheheightandwidthoftheinput,thesize,orthereceptivefield,ofthefilters,thestrideandhowmuchzeropaddingtouse.Thereceptivefielddictatesthesizesofthespatialdimensions(height,width)oftheinputthattheneurontakes.InFigure2.1(b),forinstance,wewouldsaythatthefiltersizeis(),despitethedimensionoftheweightsbeing()–thisisduetothesizeoftheinputdepth.Notethateventhoughtheseheightandwidthvaluesareconstrained,theneuronalwaysprocessesthefulldepthoftheinput.Thestridecontrolshowmanystepsthefilterstakeastheinputisprocessedalongitsspatialdimensions.Zeropaddingamountstoaddingrowsandcolumnsaroundtheinput,composedentirelyofzeros,alsoalongthedepth.Paddingisoftendonetomaketheoutputvalid,forinstance,ifthestrideorfiltersizewouldpotentiallycauseaneurontoprocessinputsthatare"outofbounds".Thisisoftencalledvalidpadding.Anotherpurposeofpaddingistopadtheinputwithzerostokeeptheoriginalspatialdimensionsunchanged,thisiscalledsamepadding.Forexample,zeroscanbeaddedaroundanimageofsizebeforeitisprocessedbyanetworkthatrequiresaninputof.TherelationshipbetweentheinputandoutputofaconvolutionalfilterisillustratedinFigure 2.1(a).InFigure 2.1(b)weaddadepthdimension.Inourexample,thestrideis1.However,intheexample,ifwewouldwanttoincreasethestridevaluethenwewouldhavetointroducezeropadding.
Generally,anydeepneuralnetworkiscalledaconvolutionalneuralnetworkifithasoneormoreconvolutionallayers.Thisholdstrueevenifnotallthelayersareconvolutionallayers.Somepopularlayertypesinclude:
-
Subsamplinglayersthatkeeponlyeverythrowandcolumntoreducethecomputationalcomplexity.Thisisusuallyonlydoneasafirstpreprocessingstepforveryhighdimensionalinputs,wherethrowingawaytheinformationisnotasharmfulasintheintermediatelayersofthenetwork.
-
Maxpoolinglayersslidealongthewidthandheightofeachdepthsliceintheinputandreturnthelargestsinglevalueintheirwindow.Theyreducethecomputationalcomplexitybyreducingtheinputdimension,andtheymaketherepresentationapproximatelyinvarianttosmalltranslations.
-
Flatteninglayersaretechnicallayersthatre-shapetensororarrayinputstovectors.
-
Normalizationlayersforre-centeringandre-scalinginputstolayers.Theyhelpspeedingupandstabilizingthelearningprocess.
Deepneuralnetworksoftenhaveanumberofparametersinthethousandsorbillions.Thismakestheinterpretationofthecalculationsdifficult,especiallyduetothenumberoflayers.Visualizationsofthefirstfewconvolutionallayersintrainednetworkshasbeendone(zeiler2014visualizing),withtheresultthatthefirstlayer’sfiltersusuallycaptureedges,corners,andcolorcombinations.Thesecondlayerthencombinesthesefeaturesintomorecomplicatedpatterns.Higherfiltersthencombinethesefeaturesfurtherintotextures,objectpartsorevenwholeobjects.
2.3 Representationlearning
Incomputerscienceingeneral,andmachinelearninginparticular,thechoiceoftherepresentationofthedatathatisbeingprocessediscrucial.Thiscouldmeanchoosingtherightdatastructureforthetask,suchasdesigningadatabaseforfastsearching.Thiscouldalsomeanchoosingtherightindependentvariablesforastatisticalmodel.Theextractionofusefulinformationaboutthedataisthusanimportanttask.Thisisespeciallytrueiftheinputisfromahigh-dimensionalspace,withthetermcurseofdimensionality(bellman1957dynamic)beingusedsincethelate1950sfordescribingproblemsofthisnature:theamountofdataneededtomakestatisticallysignificantclaimsgrowsexponentiallywiththedimensionalityofthespacethatthedataresidesin.Thismakesthediscoveryofmethodsforreducingthedimensionalityoftheinput,withoutdiscardingimportantinformation,anattractiveprospect.
2.3.1 Supervisedrepresentationlearning
Representationsariseinartificialneuralnetworks(ANNs)whentheyaretrainedforaregressionorclassificationobjective.OneviewofANNsisthatthehiddenlayersperformfeatureextractionontheinput,transformingittoamoresuitableformfortheoutputlayerthatperformsthefinalcalculationsforthetask.sharif2014cnnmadeuseofthisinsightbypre-processinginputsforsupervisedlearningmodelswiththeintermediatelayeroutputsofaconvolutionalnetworkthatwaspre-trainedonanobjectclassificationtask.Theyachievedimpressiveresultsonadiverserangeoftasks,suchasimageretrievalandscenerecognition.ThehierarchicalstructureofANNsalsohasthetheoreticalandpracticaladvantagethatthefeaturesateachlevelarere-usedforthedifferentfeaturesatthehigherlevel.
2.3.2 Unsupervisedrepresentationlearning
ForhierarchicalmethodslikeANNs,representationsaregeneratedatthesametimeasthewholesystemistrainedtominimizeerroronhumantaggeddata.Mostotherrepresentationlearningmethodsareunsupervisedandareabletolearnusefulfeaturesonunlabeleddata.
Pca
Principalcomponentanalysis(PCA)isanunsupervisedlearningmethodinventedbypearson1901liii.Themethodfindsanorthogonallineartransformationforzero-mean,-dimensionaldata.Thecolumnsofaretheprincipalcomponentsofthedata,whichpointtothedirectionofthegreatestvarianceinthedata.Thefirstprincipalcomponentisthesolutiontotheequation
(2.18) |
ThesecondcomponentisfoundbyapplyingEquation2.18againtothetransformeddatathatisgivenbyremovingthecontributionofthefirstcomponentfrom,,andsoon.Allthecomponentscanalsobefoundsimultaneouslybycomputingtheeigendecompositionofthedata’scovariancematrix,asithasbeenshownthattheprincipalcomponentsareequaltotheresultingeigenvectors(shlens2014tutorial).Dimensionalityreductioncanbedonebycreatingamatrix,consistingonlyofthefirstprincipalcomponents,andapplyingittothedata.Thisyieldsthelower-dimensional,transformeddatamatrix.Bydoingthis,thedataisprojectedontothesubspacewiththemaximumvariance.Thismatrixofprincipalcomponentshasthepropertyofminimizingthereconstructionerror.InFigure2.3,weshowavisualizationoftheUCIMLdigitsdataset,whichconsistsofgrayscaleimagesofhand-writtendigits.Thefigureshowstheresultafterthedataisprojectedonitsfirsttwoprincipalcomponents,showingclearclusteringofthedigits.Wealsoshowtheclusteringfoundbyt-distributedstochasticneighborembedding(discussedbelow),whichseparatestheclustersmorecleanlyforthisdataset.
t-SNE
Overthelastfewyears,t-distributedstochasticneighborembedding(t-SNE)(maaten2008visualizing)hasbecomeoneofthemostpopulardimensionalityreductiontechniquesforvisualization(arora2018analysis).Theassumptionbehindthealgorithmisthatthehigh-dimensionalinputdataliesonalocallyconnectedmanifold.First,anauxiliaryasymmetricmeasurebetweeneachpairofdatapointsiscalculatedaccordingtotheequation
(2.19) |
whereasitisofprimaryinteresttomodelpairwisesimilarities.TheconstantisthevarianceofaGaussianthatiscenteredaroundandcontrolshowinfluentialnearbydatapointsareincontrasttofarawaydatapoints.Thenthepairwisesimilaritiesarecomputedusingtheformula
(2.20) |
thesesimilaritiesaredefinedtobesymmetrizedconditionalprobabilitiestoensurethateachdatapointmakesasignificantcontributiontothecostfunction.Next,alower-dimensionalvectorofdatapointsYiscreatedandinitializedrandomly.EachelementinYcorrespondstoanelementinX:similaritiesbetweendatapointsanarecalculatedaccordingtotheformula
(2.21) |
Thelow-dimensionaldatapointsarethenmovedaroundtominimizetheKLdivergence--ameasureofthedifferencebetweentwoprobabilitydistributions444Notethatandcanbeinterpretedasprobabilitiessinceandandforalland.–betweenand
(2.22) |
Equation2.22isminimizedusinggradientdescent,ensuringthatpointsthataresimilarinthehigh-dimensionalspacearealsosimilarinthenew,low-dimensionalspace.
Autoencoders
Theautoencoderisatypeofneuralnetwork555Autoencoderscanconsistofanytypesofneuralnetworklayers.Forexample,anautoencodermadeupofconvolutionallayersiscalledaconvolutionalautoencoders.thatconsistsofanencoderpart,whichmapstheinputtoanencoding(usuallyofasmallersizethantheinput),andadecoderpart,thatoutputsareconstructionoftheinput(Figure2.4).
Theobjectivefunctionisthesquarederrorbetweentheinputandtheoutput
(2.23) |
Usefulrepresentationsariseinthisprocessifthecorrectconstraintsareplacedonthesystem.Withoutconstraints,thesystemcouldenduplearningtheidentityfunction,thattriviallysatisfiestheobjectivefunction:.Thisproblemcanbeovercomebyincludingahiddenlayerintheautoencoderofalowerdimensionalitythantheinputspace.Inthiscase,theautoencoderissaidtobeundercomplete.Undercomplete,single-layerautoencoderswithlinearactivationfunctionsarealmostequivalenttoPCA.The-dimensionalhiddenlayerspansthesamesubspaceasthefirstprincipalcomponents,ortheprincipalsubspace,ofthedata(baldi1989neural).UnlikePCA,however,theweightsofthehiddenlayerarenotguaranteedtobeorthonormalnorordered.Ifthesmallestdimensionalityofahiddenlayerislargerthanthesizeoftheinput,theautoencoderissaidtobeovercomplete.Overcompleteautoencoderscanbepreventedfromlearningtheidentityfunctioniftheobjectivefunction(Equation2.23)iscombinedwitharegularizationterm.Forexample,insteadofreconstructingtheoriginalinputasitis,vincent2008extractingproposethatthegoalcouldbetorecovertheinputafterithasbeencorruptedwithnoise(e.g.Gaussiannoiseorsalt-and-peppernoise).Theideabehindthisisthattheautoencoderhastolearnrepresentationsthatarestableandrobustunderthecorruptionoftheinput,andthatthedenoisingtaskextractsausefulstructureoftheinputdistribution(vincent2010stacked).
2.3.3 Self-supervisedlearning
Approachesthatlearnrepresentationsbywayofsolvingauxiliarytasks,inthesensethattherepresentationthatarisesfromtheoptimizationismoreimportantthanachievingagoodperformanceonthetaskitself,issometimescalledself-supervisedlearningintheliterature(gogna2016semi).Forexample,ha2018worldtrainanautoencodertoreconstructtheobservationsinanRLenvironment,buttheydonottakeadvantageofthereconstructivecapabilitiesofthenetworkwhentheytraintheirRLpolicies,andtheyuseonlytheencoderpartofthenetworkforvisualpre-processing.Anotherexampleisthedenoisingautoencoderfromthepreviouschapter.Onedirectionofself-supervisedlearningthathasbeenfollowedintheliteratureistoinventataskthatrequiresanunderstandingofthedomaintosolvecorrectly,suchasthereconstructivenetworkusedbyha2018world.Anotherexampleistheworkbygidaris2018unsupervised,wholearnrepresentationsbyapplyingrandomrotationstonaturalimagesandtrainanetworktopredictwhichrotationwasapplied.Thisencouragesthenetworktolearnhigh-levelconcepts,suchasbeaks,wingsandtalons,andtheirrelativepositions.Morecommonly,self-supervisedlearningmethodsconsistofobscuringsomepartoftheinputandtrainamodeltopredictthatpartgivensomeothersubsetoftheinput,aswedoinChapter4andChapter5.Somevariationsofthisideainclude,forexample,colorization(zhang2016colorful),wherecolorfulimagesareconvertedtograyscalewiththegoalofpredictingtheoriginalcolors.
Chapter 3 Learninggradient-basedICAbyneurallyestimatingmutualinformation
Inthischapter111Thischapterisadaptedfrom(hlynsson2019learning),whichwaspublishedintheJointGerman/AustrianConferenceonArtificialIntelligence(KünstlicheIntelligenz).,weintroduceanovelmethodoftrainingneuralnetworksinanunsupervisedmannertooutputstatisticallyindependentcomponents,amethodwecallGrICA.Weuseamutualinformationneuralestimation(MINE)network(belghazi2018mine)toguidethelearningofanencodertoproducestatisticallyindependentoutputs.Thisisarecentmethodofestimatingthemutualinformationofrandomvariablesinadeeplearningsetting,andweapplyittogetaqualitativelyequalsolutiontoFastICAonblind-source-separationofnoisysources.WeinvestigatetheusefulnessofourmethodincontrasttoarepresentationlearnedbyaconvolutionalautoencoderforpreprocessingvisualinputsforanRLagent,butthecomparisonisunfavorableforourapproach.Therestofthischapterhasthefollowingorganization:Section 3.1motivatesthedesignofrepresentationswithindependentcomponents.Section 3.2explainstheICAproblemformulation.Section 3.3brieflydiscussesrelatedwork.Section3.4introducesourmethodofusingamutualinformationneuralestimatortoteachaneuralnetworktooutputindependentcomponents.Section3.5showstheexperimentalevaluationofourmethod.Finally,weconcludewithadiscussioninSection 3.6.
3.1 Introduction
Thegeneralobjectiveoftraininganencodertolearnstatisticallyindependent,factorialcodesofthedatahasbeencalledthe"holygrail"ofunsupervisedlearning(schmidhuberunsupervised).Wesuggestthatlearningtorecoverfew,statisticallyindependent,latentvariablesofanRLenvironmentcanspeedupthetrainingofRLagents.Forenvironmentswherehigh-dimensionalobservationsarecreatedfromasmallsetofstatisticallyindependentlatentvariables,thistechniquecouldreducethedimensionalityoftheobservationswithoutdiscardingunnecessaryinformation.AnothertheoreticaladvantageofusingthiskindofapproachinRLsettings,comparedtotheothermethodswedevelopinthisPhDdissertation,isthatitrequiresonlyout-of-contextobservationdatafromtheenvironment.LearningtheGrICArepresentationdoesnotrequirefulltransitionstuples222Weusetodenotethestate,todenotetheaction,todenotetherewardandtodenotethenextstate.andcanthusbeusedwhenthetransitionorrewarddynamicsoftheenvironmentschange.Learningrepresentationsthatoutputstatisticallyindependentfeaturescanbedoneinanynumberofways,forexample,bytryingtomakeeachoutputasunpredictableaspossiblegiventheotheroutputunits(schmidhuber1992learning).Wetaketheapproachofminimizingthemutualinformation,asestimatedbyaMINEnetwork,betweentheoutputunitsofadifferentiableencodernetwork.Thisisdonebysimplealternateoptimizationofthetwonetworks.
3.2 Background
Independentcomponentanalysis(ICA)aimsatestimatingunknownsourcesthathavebeenmixedtogetherintoanobservation.TheusualassumptionsarethatthesourcesarestatisticallyindependentandnomorethanoneisGaussian(jutten2003advances).Thenow-cementedmetaphorisoneofacocktailpartyproblem:severalpeople(sources)arespeakingsimultaneously,andtheirspeechhasbeenmixedtogetherinarecording(observation).Thetaskistounmixtherecordingsuchthatalldialoguescanbelistenedtoclearly.InlinearICA,wehaveadatamatrixwhoserowsaredrawnfromstatisticallyindependentdistributions,amixingmatrix,andanobservationmatrix:
andwewanttofindanunmixingmatrixofthatrecoversthesourcesuptoapermutationandscaling:
Thegeneralnon-linearICAproblemisill-posed(hyvarinen1999nonlinear; darmois1953analyse)asthereisaninfinitenumberofsolutionsifthespaceofmixingfunctionsisunconstrained.However,post-linear(taleb1999source)(PNL)ICAissolvable.Thisisaparticularcaseofnon-linearICAwheretheobservationstaketheform
whereoperatescomponentwise,i.e..Theproblemissolvedefficientlyifisatleastapproximatelyinvertible(ziehe2003blind)andthereareapproachestooptimizetheproblemfornon-invertibleaswell(ilin2004post).Forsignalswithtime-structure,however,theproblemisnotill-posedeventhoughitisfori.i.d.samples(blaschke2007independent; sprekeler2014extension).ToframeICAasanoptimizationproblem,wemustfindawaytomeasurethestatisticalindependenceoftheoutputcomponentsandminimizethisquantity.Therearetwomainwaystoapproachthis:eitherminimizethemutualinformationbetweenthesources(amari1996new; bell1995non; cardoso1997infomax),ormaximizethesources’non-Gaussianity(hyvarinen2000independent; blaschke2004cubica).
3.3 Relatedwork
TherehasbeenaninterestincombiningneuralnetworkswiththeprinciplesofICAforseveraldecades.InPredictabilityMaximization(schmidhuber1992learning),agameisplayedwhereoneagenttriestopredictthevalueofoneoutputcomponentgiventheothers,andtheothertriestomaximizetheunpredictability.Morerecently,DeepInfoMax(DIM)(hjelm2018learning),GraphDeepInfoMax(velivckovic2018deep)andGenerativeadversarialnetworks(goodfellow2014generative),utilizetheworkofBrakeletal.(brakel2017learning)todeeplylearnICA.Ourworkdiffersfromtheseadversarialtrainingmethodsintherulesoftheminimaxgamebeingplayedtoachievethis:oneagentdirectlyminimizesthelower-boundofthemutualinformation,asderivedfromtheDonsker-VaradhancharacterizationoftheKL-Divergence,astheothertriestomaximizeit.
3.4 Method
3.4.1 Reinforcementlearningenvironment
Ourrepresentationistestedona2Denvironmentwheretheagentissupposedtoavoidafieldoflavaandreachagoalontheothersideoftheroom.Thefullstateoftheenvironmentisthewholeroom(Fig3.1,left)andtheobservationisanisometricviewoftheagentanditspointofview(Fig3.1,right).TheobservationsareRGBimagesandtheagentcantakeastepforward,turnleftorturnright.
Theepisodeterminatesiftheagentstepstowardlavaorthegoal.Theagentreceivesapositiverewardifitreachesthegoal,buttheepisodeterminateswithzerorewardifitstepsintothelava.Thereisnochangeiftheagentisfacedtowardthewallandtakesastepforward.
3.4.2 Learningtheindependentcomponents
Wetrainanencodertogenerateanoutputsuchthatanyoneoftheoutputcomponentsisstatisticallyindependentoftheunionoftheothers,i.e.,where
Thestatisticalindependenceofandcanbemaximizedbyminimizingtheirmutualinformation
(3.1) |
Thisquantityishardtoestimate,particularlyforhigh-dimensionaldata.NotethatEquation 3.1canbemoresuccinctlyastheKLdivergencebetweenand:
(3.2) |
donsker1975asymptoticfamouslyprovedthattheKLDivergenceadmitstherepresentation
(3.3) |
wherethedomainisaclosedandboundedsubsetof.belghazi2018mineintroduceamethodofusingtheDonsker-Varadhanrepresentationtoestimatemutualinformationwithneuralnetworks,withanarchitecturetheycallmutualinformationneuralestimation(MINE)networks.Tolearnrepresentationswithindependentcomponents,wethereforeestimatethelowerboundofEq. (3.1)usingaMINEnetwork:
(3.4) |
whereindicatesthattheexpectedvalueistakenoverthejointandsimilarlyfortheproductofmarginals.Thenetworksandareparameterizedbyand.TheencodertakestheobservationsasinputandtheMINEnetworktakestheoutputoftheencoderasaninput.Thenetworkminimizesinorderfortheoutputstohavelowmutualinformationandthereforebestatisticallyindependent.Inordertogetafaithfulestimationofthelowerboundofthemutualinformation,thenetworkmaximizes.Thus,inapush-pullfashion,thesystemasawholeconvergestoindependentoutputcomponentsoftheencodernetwork.Inpractice,ratherthantrainingtheandnetworkssimultaneouslyitprovedusefultotrainfromscratchforafewiterationsaftereachiterationoftraining,sincethelossfunctionsofandareatoddswitheachother.Whentheencoderistrained,theMINEnetwork’sparametersarefrozenandviceversa.
3.5 Results
Wetryourmethodontwoscenarios:(1)wecompareittocanonicalimplementationsofICAonatextbookexampleofestimatingsourcesfromnoisydataand(2)weuseourmethodwithamorecomplexfunctionapproximatorforpreprocessingobservationsinanRLsetting.
3.5.1 Recoveringnoisysignals
Wevalidatethemethod333Fullcodeforthenoisysignalrecoveryexperimentisavailableatgithub.com/wiskott-lab/gradient-based-ica/blob/master/bss3.ipynbaforlinearnoisyICAexample(sklearn).Threeindependent,noisysources—sinewave,squarewaveandsawtoothsignal(Fig. 3.2(a))—aremixedlinearly(Fig. 3.2(b)):
Theencoderisasingle-layerneuralnetworkwithlinearactivation,withadifferentiablewhiteninglayer(schuler2018gradient)beforetheoutput.Thewhiteninglayerisakeycomponentforperformingsuccessfulblindsourceseparationforourmethod.Statisticallyindependentrandomvariablesarenecessarilyuncorrelated,sowhiteningtheoutputbyconstructionbeforehandsimplifiestheoptimizationproblemsignificantly.TheMINEnetworkisaseven-layerneuralnetwork.Eachlayerbutthelastonehas64unitswitharectifiedlinearactivationfunction.Eachtrainingepochoftheencoderisfollowedbyseventrainingepochsof.Estimatingtheexactmutualinformationisnotessential,sofewiterationssufficeforagoodgradientdirection.SincetheMINEnetworkisappliedtoeachcomponentindividually,toestimatemutualinformation(Eq. 3.4),weneedtopasseachsamplethroughtheMINEnetworktimes—onceforeachcomponent.Equivalently,onecouldconceptualizethisashavingcopiesoftheMINEnetworkandfeedingthesamplestoitinparallel,withdifferentcomponentssingledout.Thus,forsamplewefeedin,foreach.BothnetworksareoptimizedusingNesterovmomentumADAM(dozat2016incorporating)withalearningrateof.Forthissimpleexample,ourmethod(Fig. 3.2(c))isequivalentlygoodatunmixingthesignalsasFastICAasimplementedinthescikit-learnpackage(scikit-learn)(Fig. 3.2(d)).Notethat,ingeneral,thesourcescanonlyberecovereduptopermutationandscaling.
3.5.2 Lavafieldenvironment
Fortheseexperiments,welearnourICAfeaturesusingaconvolutionalneuralnetwork.Werollout100episodeswithafullyrandompolicytogatherdataforlearningtheindependentfeatures:anagentisplacedintheupperrightcorneroftheenvironmentandturnsleft,rightortakesastepforwardwithequalprobabilities.Theresultobservationsarethengathereduntiltheepisodeterminates.Thisgivesus4130observationstotraintherepresentationonforthisexperiment.Therepresentationwelearnis32-dimensional.TheMINEnetworkisthesameasabove,buttheencodernetworkisafive-layerconvolutionalneuralnetwork.Thefirsttwolayersareconvolutionallayerseachwith32filters,arectifiedlinearunitactivationandnopadding.Thefirstlayerhasastrideof4andthesecondoneastrideof3.Theoutputoflayer2isthenpassedtoaflatteninglayer,reshapingthetensoroutputtoavectorinputforalineardenselayerwith32units.Theoutputofthedenselayeristhenfinallypassedtoaspheringlayer,givingustheencoding.ThenetworkdescriptionissummarizedinTable 3.1.
Layer | Filters | Kernel | Stride | Padding | Output | Learnable |
Shape | Parameters | |||||
Input | - | - | - | - | 0 | |
Conv. | 32 | 3x3 | 4 | None | 896 | |
ReLU | - | - | - | - | 0 | |
Conv. | 32 | 3x3 | 3 | None | 9248 | |
ReLU | - | - | - | - | 0 | |
Flatten | - | - | - | - | 512 | 0 |
Dense | - | - | - | - | 32 | 16416 |
ReLU | - | - | - | - | 0 | |
Sphering | - | - | - | - | 32 | 0 |
ThetrainingofourICArepresentationfollowsthesameschemeasbefore:wetraintheestimatorforsevenepochsaftereachtrainingepochoftheencoder.Wetrainedtheencoderfor100epochsandtheestimatorfor700epochsforthisexperiment.OurtrainedrepresentationisusedtopreprocessthevisualinputforaRLagent.WechooseActorCriticusingKronecker-FactoredTrustRegion(ACKTR)asimplementedbyStableBaselineswithdefaultparametersandmodel.TheACKTRdefaultmodelisafully-connectedneuralnetworkwithtwolayersof64unitseachandatanhactivationfunction.WetrainedanACKTRmodelfromscratchtwentytimesonourICArepresentation,andshowtheresultsinFig 3.4.Thisindicatesthatweareabletolearntheenvironmentusingourmethodtopreprocesstheinputforareinforcementlearningmethod.
Tovisualizethebehaviorofouragent(Figure 3.5),wechoosethreesuccessfulepisodesfromafully-trainedmodelafter100thousandtimestepsoftrainingandthreeunsuccessfulonesfromamodelwith80thousandtimestepsoftraining.Itisnoteworthythattheagentprefersawidemarginbetweenitselfandthelavafieldasitpassesit,eventhoughamoreoptimalstrategywouldhavetheagentwalktotherightwiththelavaleftimmediatelyonitsleft-handside.Theagentalsosometimesdoublesbackbeforecontinuingtowardthegoalagain.
Wealsotriedtoseewhetherourmethodgeneralizestoavariantoftheenvironmentwherethelowerrowoflavaismovedtothebottom,punishingourstrategy.Therewereonly3successesinathousandtestiterations,showninFigure 3.6,alongwiththreeoftheunsuccessfulepisodes.
Forcomparison,wealsotrainedaconvolutionalautoencoder(CAE)forreconstructiononthesamedatasetweusedtotrainourICArepresentation.WerantheexperimentagainwiththeresultingencodingaftertheCAEwastrained.Theencoderisthesameasthenetworkusedforourrepresentation,exceptthatitdoesnothavethespheringlayer.Thedecodingportionofthenetworkconsistsofa392-unitdenselayerwhoseoutputisreshapedtoatensorandpassedtoaconvolutionallayerwith32filtersofsize.Theoutputisthenup-sampledtoquadruplethewidthandheightofthetensor.Thisisthenfollowedbyanotherconvolutionallayer,ofthesamekindasthepreviousone,andanotherup-samplinglayerthatdoublesthewidthandheightofthetensor.Theoutputthenfinallygoesthroughaconvolutionallayerwith3filtersofsize.EachlayerinthedecoderhasaReLUactivation,exceptforthelastwhichhasalogisticactivationtoreconstructpixelvaluesthathavebeennormalizedlieintherange.Eachconvolutionallayerdoeszero-paddingtopreservethewidthandtheheightoftheinputtensor.SeeTable3.2foranoverviewofthearchitecture.
Layer | Filters | Kernel | Stride | Padding | Output | Learnable |
Shape | Parameters | |||||
\rowcolorlightblueInput | - | - | - | - | 0 | |
\rowcolorlightblueConv. | 32 | 3x3 | 4 | None | 896 | |
\rowcolorlightblueReLU | - | - | - | - | 0 | |
\rowcolorlightblueConv. | 32 | 3x3 | 3 | None | 9248 | |
\rowcolorlightblueReLU | - | - | - | - | 0 | |
\rowcolorlightblueFlatten | - | - | - | - | 512 | 0 |
\rowcolorlightblueDense | - | - | - | - | 32 | 16416 |
\rowcolorlightblueReLU | - | - | - | - | 32 | 0 |
\rowcolorlightredDense | - | - | - | - | 392 | 12936 |
\rowcolorlightredReLU | - | - | - | - | 0 | |
\rowcolorlightredReshape | - | - | - | - | 0 | |
\rowcolorlightredConv. | 32 | 3x3 | 1 | Same | 2336 | |
\rowcolorlightredReLU | - | - | - | - | 0 | |
\rowcolorlightredUpsampling | - | 4x4 | - | - | 0 | |
\rowcolorlightredConv. | 32 | 3x3 | 1 | Same | 9248 | |
\rowcolorlightredReLU | - | - | - | - | 0 | |
\rowcolorlightredUpsampling | - | 4x4 | - | - | 0 | |
\rowcolorlightredConv. | 3 | 3x3 | 1 | Same | 9248 | |
\rowcolorlightredTanh | - | - | - | - | 867 |
WetraintheCAEfor50epochs.Eventhoughthisisalownumberofepochscomparedtothetrainingforourmethod,itsreconstructivepropertiesarealreadyquitegood(Figure3.7).
Werepeattheexperimentasbefore,butnowwiththeCAEfeaturesinsteadofourICArepresentation.ThetrainingcurveisshowninFigure3.8.Thisstraightforwardbaselinealgorithmlearnstosolvetheenvironmenttwiceasfastasourmethod.
3.6 Conclusion
WehaveintroducedanoveltechniquefortrainingadifferentiablefunctiontoperformICA.Themethodconsistsofalternatingtheoptimizationofanencoderandaneuralmutualinformationneuralestimation(MINE)network.Themutualinformationestimatebetweeneachencoderoutputandtheunionoftheothersisminimizedwithrespecttotheencoder’sparameters.ThesolutionlearnedbyourapproachagreeswiththeonelearnedbythecanonicalICAalgorithm,FastICA.Anadvantageofourmethod,however,isthatitistriviallyextendedforovercompleteorundercompleteICAbychangingthenumberofoutputunitsoftheneuralnetwork.Weapplyouralgorithmonhigh-dimensionaldatatotesttherepresentationlearnedbyourmethodfordimensionalityreductionofvisualinputsforanRLagent.Theagentisabletouseourrepresentationtolearnhowtosolveasimplenavigationtask,butthepreprocessingofferedbyourapproachisoutperformedbyaconvolutionalautoencoder.Ourmethodworksinprinciple,ascanbeseenbythenoisysignalrecoveryexperiment,butitseffectivenessforlearningrepresentationsforRLagentsremainsunproven.Eventhoughtheobservationsofthelavafieldenvironmentarefullydeterminedbythreelatentvariablesthatarestatisticallyindependent,theagent’sxandypositions,alongwithitsdirection,ourrepresentationwasstillnotusefulenoughtobeattherelativelysimplebaselines.
Chapter 4 Latentrepresentationpredictionnetworks
Inthischapter111Thischapterisadaptedfrom(hlynsson2020latent).,weintroducearepresentationlearningtechniqueforRLsettingsthatwenameLatentRepresentationPrediction(LARP).ThisnovelsystemtakesadvantageofmoreinformationgivenbytheenvironmentthanourGrICAmethodfromthepreviouschapter,thatonlylearnedfromstaticobservationswithouttakingadvantageoftheknowledgethatthesystemwillbeusedinadynamicsetting.Thatistosay,wewillnowutilizethetripletsfortraining.Ouralgorithmlearnsastaterepresentation,alongwithafunctionthatpredictshowtherepresentationchangeswhentheagenttakesgivenactionsintheenvironment.Insteadofusingoursystemtopreprocessinputsforamodel-freereinforcementlearner,aswedidinthepreviouschapter,nowwetakeadvantageofapredictionfunction,whichisusedasaforwardmodelforsearchonagraphinaviewpoint-matchingtask.Usingarepresentationthatislearnedtobemaximallypredictableforthepredictorisfoundtooutperformpretrainedrepresentations.Thedata-efficiencyandoverallperformanceofourapproachisshowntorivalstandardreinforcementlearningmethods,andourlearnedrepresentationtransferssuccessfullytonovelenvironments.Therestofthechapterisorganizedasfollows:inSection 4.1,wemotivatetheusefulnessofrepresentationsthatarepredictableinthescopeofvisualplanning,andwementionhowweintendtoovercomeacommonpitfallintheirdesign.WemoveonwithdiscussingthemainclassesofrelatedworkinSection 4.2,andsummarizethemostrelevantarticlesfromamongthem.InSection 4.3,wediscussinconcretedetailthedesignofourrepresentation,howweuseitforplanning,andweintroduceanexperimentalenvironmentofourowndesign.TheresultsofourexperimentsarepresentedinSection 4.4,andweconcludewithadiscussionoftheproposedmethodologyandprospectsforfutureworkinSection 4.5.
4.1 Introduction
Deeply-learnedplanningmethodsareoftenbasedonlearningrepresentationsthatareoptimizedforunrelatedtasks.Forexample,theymightbetrainedtoreconstructobservationsoftheenvironment,suchastheconvolutionalautoencoderfromthepreviouschapter.Theserepresentationsarethencombinedwithpredictorfunctionsforsimulatingrolloutstonavigatetheenvironment.Weproposetoratherlearnrepresentationssuchthattheyaredirectlyoptimizedforthetaskathand:tobemaximallypredictableforthepredictorfunction.Thisresultsinrepresentationsthatarewell-suited,bydesign,forthedownstreamtaskofplanning,wherethelearnedpredictorfunctionisusedasaforwardmodel.Whilemodernreinforcementlearningalgorithmsreachsuper-humanperformanceontaskssuchasgameplaying,theyremainwoefullysampleinefficientcomparedtohumans.Analgorithmthatisdata-efficient (hlynsson2019measuring)requiresonlyfewsamplesforgoodperformanceandthestudyofdata-efficientcontroliscurrentlyanactiveresearcharea(corneil2018efficient; buckman2018sample; du2019good; saphal2020seerl).Dimensionalityreductionisapowerfultoolforincreasingthedata-efficiencyofmachinelearningmethods.Therehasbeenmuchrecentworkonmethodsthattakeadvantageofcompact,low-dimensionalrepresentationsofstatesforsearchandexploration (kurutach2018learning; corneil2018efficient; xu2019regression).Oneoftheadvantagesofthisapproachisthatagoodrepresentationaidsinfasterandmoreaccurateplanning.Thisholdsinparticularwhenthelatentspaceisofmuchlowerdimensionalitythanthestatespace(hamilton2014efficient).Forhigh-dimensionalinputs,suchasimagedata,arepresentationfunctionisfrequentlylearnedtoreducethecomplexityforacontroller.Indeepreinforcementlearning,therepresentationandthecontrollerarelearnedsimultaneously.Similarly,arepresentationcaninprinciplebelearnedalongwithaforwardmodelforclassicalplanninginhigh-dimensionalspace.WedothiswithourLARPnetwork,whichisaneuralnetwork-basedmethodforlearningastaterepresentationandatransitionfunctionforplanningwithinthelearnedlatentspace(Fig. 4.1).
Duringtraining,therepresentationandthepredictorarelearnedsimultaneouslyfromtransitionsinaself-supervisedmanner.Wetrainthepredictortopredictthemostlikelyfuturerepresentation,givenacurrentrepresentationandanaction.Thepredictoristhenusedforplanningbynavigatingthelatentspacedefinedbytherepresentationtoreachagoalstate.Optimizingcontrolinthismanner,afterlearninganenvironmentmodel,hastheadvantageofallowingforlearningnewrewardfunctionsinafastanddata-efficientmanner.Aftertherepresentationislearned,wefindsaidgoalstatebyconventionalpathplanning.Disentanglingtherewardfromthetransitionfunctioninsuchawayishelpfulwhenlearningformultipleorchangingrewardfunctions,andaidswithlearningwhenthereisnorewardavailableatall.Thus,itisalsogoodforasparseoradelayed-rewardsetting.Aproblemthatcanariseinrepresentationlearningistheoneoftrivialfeatures.Thiscanhappenwhenthemethodisoptimizinganobjectivefunctionthathasastraightforward,butuseless,solution.Forexample,SlowFeatureAnalysis(SFA)(wiskott2002slow)hastheobjectiveofextractingthefeaturesoftimeseriesdatathatvarytheleastwithtime.Thisiseasilyfulfilledbyconstantfunctions,soSFArequiresthattherepresentationshaveavarianceof–whichconstantfunctionscannotfulfill.Constantfeatureswouldsimilarlybemaximallypredictablerepresentationsforoursystem.Therefore,westudythreedifferentapproachestopreventthistrivialrepresentationfrombeinglearned,weeither:(i)designthearchitecturesuchthattheoutputissphered,(ii)regularizeitwithacontrastivelossterm,or(iii)includeareconstructionlosstermalongwithanadditionaldecodermodule.Wecomparetheseapproachesandvalidateourmethodexperimentallyonavisualenvironment:aviewpoint-matchingtaskusingtheNORBdataset(lecun2004learning),wheretheagentispresentedwithastartingviewpointofanobjectandthetaskistoproduceasequenceofactionssuchthattheagentendsupwiththegoalviewpoint.AstheNORBdatasetisembeddableonacylinder(hadsell2006dimensionality; schuler2018gradient)orasphere(wang2018toybox),wecanvisualizetheactionsastraversingtheembeddedmanifold.Ourapproachcomparesfavorablytostate-of-the-artmethodsonourtestbedwithrespecttodata-efficiency,butourasymptoticperformanceisstilloutclassedbyotherapproaches.
4.2 Relatedwork
Mostoftherelatedworkfallsintothecategoriesofreinforcementlearning,visualplanning,orrepresentationlearning.Theprimarydifferencebetweenoursandothermodel-basedmethodsisthattherepresentationislearnedbyoptimizingauxiliaryobjectiveswhicharenotdirectlyusefulforsolvingthemaintask.
4.2.1 Reinforcementlearning
Therearemanyworksintheliteraturethatalsoapproximatethetransitionfunctionofenvironments,forinstancebyperformingexplicitlatent-spaceplanningcomputations(tamar2016value; gal2016improving; henaff2017model; srinivas2018universal; chua2018deep; hafner2019learning)aspartoflearningandexecutingpolicies.gelada2019deepmdptrainanRLagenttosimultaneouslypredictrewardsaswellasfuturelatentstates.Ourworkisdistinctfromthese,aswearenotassumingarewardsignalduringtraining.ha2018worldcombinevision,memory,andcontrollerforlearningamodeloftheworldbeforelearningadecisionmodel.Apredictivemodelistrainedinanunsupervisedmanner,permittingtheagenttolearnpoliciescompletelywithinitslearnedlatentspacerepresentationoftheenvironment.Themaindifferenceisthattheyfirstapproximatethestatedistributionusingavariationalautoencoder,producingtheencodedlatentspace.Incontrast,ourrepresentationislearnedsuchthatitismaximallypredictableforthepredictornetwork.Similartoourtrainingsetup,oh2015actionpredictfutureframesinATARIenvironmentsconditionedonactions.Thepredictedframesareusedforlearningthetransitionfunctionoftheenvironment,e.g.forimprovingexplorationbyinformingagentsofwhichactionsaremorelikelytoresultinunseenstates.Ourworkdiffersasweareactingwithinalearnedlatentspaceandnotthefullinputspace,andourrepresentationsareusedinaclassicalplanningparadigmwithstartandgoalstatesinsteadofareinforcementlearningone.
4.2.2 Visualplanning
Wedefinevisualplanningastheproblemofsynthesizinganactionsequencetogenerateatargetstatefromaninitialstate,andallthestatesareobservedasimages.VariationalStateTabulations (corneil2018efficient)learnastaterepresentationinadditiontoatransferfunctionoverthelatentspace.However,theirobservationspaceisdiscretizedintoatableusingavariationalapproach,asopposedtoourcontinuousrepresentation.Acontinuousrepresentationcircumventstheproblemofhavingtodeterminethesizeofsuchatableinadvanceorduringtraining.Similarly,cuccu2018playingdiscretizevisualinputusingunsupervisedvectorquantizationandusethatrepresentationforlearningcontrollersforAtarigames.Inspiredbyclassicsymbolicplanning,RegressionPlanningNetworks(xu2019regression)createaplanbackwardfromasymbolicgoal.Wedonothaveaccesstohigh-levelsymbolicgoalinformationforourmethod,andweassumethatonlyhigh-dimensionalvisualcuesarereceivedfromtheenvironment.TopologicalmemoriesoftheenvironmentarebuiltinSemi-parametricTopologicalMemories(savinov2018semi)afterbeingprovidedwithobservationsequencesfromhumansexploringtheenvironment.Nodesareconnectedifapredictorestimatesthattheyareclose.Themethodhasproblemswithgeneralization,whicharereducedinHallucinativeTopologicalMemories(liu2020hallucinative),wherethemethodalsoadmitsadescriptionoftheenvironment,suchasamaporalayoutvector,whichtheagentcanuseduringplanning.Ourvisualplanningmethoddoesnotreceiveanyadditionalinformationonunseenenvironmentsanddoesnotdependonmanualexplorationduringtraining.CausalInfoGAN(kurutach2018learning)andrelatedmethods(wang2019learning)arebasedongenerativeadversarialnetworks(GANs)(goodfellow2014generative),inspiredbyInfoGANinparticular(chen2016infogan),forlearningaplannablerepresentation.AGANistrainedforencodingstartandgoalstates,andtheyplanatrajectoryintherepresentationspaceaswellasreconstructingintermediateobservationsintheplan.Ourmethodisdifferentasitdoesnotneedtoreconstructtheobservationsandtheforwardmodelisdirectlyoptimizedforprediction.
4.2.3 Prediction-basedrepresentationlearning
InPredictableFeatureAnalysis(richthofer2015predictable),representationsarelearnedthatarepredictablebyautoregressionprocesses.Ourmethodismoreflexibleandscalesbettertohigherdimensionsasthepredictorcanbeanydifferentiablefunction.Usingtheoutputofothernetworksaspredictiontargetsinsteadoftheoriginalpixelsisnotnew.Thecasewheretheoutputofalargermodelisthetargetforasmallermodelisknownasknowledgedistillation(bucilua2006model; hinton2015distilling).Thisisusedforcompressingamodelensembleintoasinglemodel.vondrick2016anticipatinglearntomakehigh-levelsemanticpredictionsoffutureframesinvideodata.Givenacurrentframe,aneuralnetworkpredictstherepresentationofafutureframe.Ourapproachisnotconstrainedonlytopretrainedrepresentations,welearnourrepresentationtogetherwiththepredictionnetwork.Moreover,weextendthisgeneralideabyalsoadmittinganactionastheinputtoourpredictornetwork.
4.3 Materialsandmethods
Inthiswork,westudydifferentrepresentationsforlearningthetransitionfunctionofapartiallyobservableMDP(POMDP)andproposeanetworkthatjointlylearnsarepresentationwithapredictionmodelandapplyitforlatentspaceplanning.WesummarizeherethedifferentingredientsoftheLARPnetwork–ourproposedsolution.Moredetaileddescriptionswillfollowinlatersections.Trainingthepredictornetwork:Weuseatwo-streamfullyconnectedneuralnetworktopredicttherepresentationofthefuturestategiventhecurrentstate’srepresentationandtheactionbridgingthosetwostates.Thepredictormoduleistrainedwithasimplemean-squarederrorterm.Handlingconstantsolutions:Therepresentationcouldbetransferredfromotherdomainsorlearnedfromscratchonthetask.IftherepresentationislearnedsimultaneouslywithanestimateofaMarkovdecisionprocess’s(MDP)transitionfunction,precautionsmustbetakensuchthatthepredictionlossisnottriviallyminimizedbyarepresentationthatisconstantoverallstates.Weconsiderthreeapproachesfortacklingtheproblem:spheringtheoutput,regularizingwithacontrastivelossterm,andregularizingwithareconstructivelossterm.Searchinginthelatentspace:Combiningtherepresentationwiththepredictornetwork,wecansearchinthelatentspaceuntilanodeisfoundthathasthelargestsimilaritytotherepresentationofthegoalviewpointusingamodifiedbest-firstsearchalgorithm.NORBenvironment:WeusetheNORBdataset (lecun2004learning)forourexperiments.Thisdatasetconsistsofimagesofobjectsfromdifferentviewpoints,andwecreateviewpoint-matchingtasksfromthedataset.
4.3.1 Ongoodrepresentations
Werelyonheuristicstoprovidesufficientevidenceforagood—albeitnotnecessarilyoptimal—decisionateverytimesteptoreachthegoal.Here,weusetheEuclideandistanceinrepresentationspace:asequenceofactionsispreferrediftheirendlocationisclosesttothegoal.TheusefulnessofthisheuristicsdependsonhowwellandhowcoherentlytheEuclideandistanceencodestheactualdistancetothegoalstateintermsofthenumberofactions.Alearnedpredictornetworkapproximatesthetransitionfunctionoftheenvironmentforplanninginthelatentspacedefinedbysomerepresentation.Thisraisesthequestion:whatistheidealrepresentationforlatentspaceplanning?Ourexperimentsshowthatanopenlyavailable,general-purposerepresentation,suchasapretrainedVGG16(simonyan2014very),canalreadyprovidesufficientguidancetoapplysuchheuristicseffectively.Betterstillarerepresentationmodelsthataretrainedonthedataathand,forexample,uniformmanifoldapproximationandprojection(UMAP)(mcinnes2018umap)orvariationalauto-encoders(VAEs)(kingma2013auto).Onemight,however,askwhataparticularlysuitedrepresentationmightlooklikewhenattainabilityisignored.Itwouldneedtotakethetopologicalstructureoftheunderlyingdatamanifoldintoaccount,sothattheEuclideandistancebecomesagoodproxyforthegeodesicdistance.Oneclassofmethodsthatsatisfythisarespectralembeddings,suchasLaplacianEigenmaps(LEMs)(belkin2003laplacian).Theirrepresentationsaresmoothanddiscriminativewhichisidealforourpurpose.However,theydonoteasilyproduceout-of-sampleembeddings,sotheywillonlybeappliedinanin-samplefashiontoserveasacontrolexperiment,yieldingoptimalperformance.
4.3.2 Predictornetwork
Astherepresentationisusedbythepredictornetwork,wewantittobepredictable.Thus,weoptimizetherepresentationlearnersimultaneouslywiththepredictornetwork,inanend-to-endmanner.Supposewehavearepresentationmapandatrainingsetoflabeleddatatuples,whereistheobservationattimestepandisanactionresultinginastatewithobservation.Wetrainthepredictor,parameterizedby,byminimizingthemean-squarederrorlossover’sparameters:
(4.1) |
whereisoursetoftrainingdata.Weconstructasatwo-stream,fullyconnected,neuralnetwork.Usingthispredictorwecancarryoutplanninginthelatentspacedefinedby.Byplanning,wemeanthatthereisastartstatewithobservationandagoalstatewithandwewanttofindasequenceofactionsconnectingthem.Thenetworkoutputstheexpectedrepresentationafteracting.Usingthis,wecanformulateplanningasaclassicalpathfindingorgraphtraversalproblem.
4.3.3 Avoidingtrivialsolutions
Inthecasewhereistrainableandparameterizedby,thelossforthewholesystemthatonlycaresaboutmaximizingpredictabilityis
(4.2) |
foragivendataset.Withnoconstraintsonthefamilyoffunctionsthatcanbelongto,weruntheriskthattherepresentationcollapsestoaconstant.Constantfunctionstriviallyyieldzerolossforanysetifoutputstheinputstateagainforany,i.e:Constantrepresentationsareoptimalwithrespecttopredictability,buttheyareunfortunatelyuselessforplanning,asweneedtodiscriminatedifferentstates.ThisobjectiveisnotpresentintheproposedlossfunctioninEq. (4.2)andwemustthusaddaconstraintoranotherlosstermtofacilitatedifferentiatingthedifferentstates.Thereareseveralwaystolimitthefunctionspacesuchthatconstantfunctionsarenotincluded,forexamplewithdecoder(goroshin2015learning)oradversarial(denton2017unsupervised)lossterms.Inthiswork,wedothiswith(i)aspheringlayer,(ii)acontrastiveloss,or(iii)areconstructiveloss.
(i)Spheringtheoutput
TheproblemoftrivialsolutionsissolvedinSFA(wiskott2002slow)andrelatedmethods(escalante2013solve; schuler2018gradient)byconstrainingtheoverallcovarianceoftheoutputtobe.Includingthisconstrainttooursettingyieldstheoptimizationformulation:
(4.3) | ||||||
subjectto | ||||||
Weenforcethisconstraintinournetworkviaarchitecturedesign.Thelastlayerperformsdifferentiablesphering(schuler2018gradient; hlynsson2019learning)ofthesecond-to-lastlayer’soutputusingthewhiteningmatrix.Wegetusingpoweriterationofthefollowingiterativeformula:
(4.4) |
wherethesuperscripttrackstheiterationnumberandcanbeanarbitraryvector.Thepoweriterationalgorithmconvergestothelargesteigenvectorofamatrixinafewhundred,quickiterations.Theeigenvalueisdetermined,andwesubtracttheeigenvectorfromthematrix:
(4.5) |
theprocessisrepeateduntilthespheringmatrixisfound
(4.6) |
Thewholesystem,includingthespheringlayer,canbeseeninFig 4.2,withanabstractconvolutionalneuralnetworkastherepresentationandafully-connectedneuralnetworkasthepredictionfunction.
(ii)Contrastiveloss
Constantsolutionscanalsobedealtwithinthelossfunctioninsteadofviaarchitecturedesign.hadsell2006dimensionalityproposetosolvethiswithalossfunctionthatpullstogethertherepresentationofsimilarobjects(inourcase,statesthatarereachablefromeachotherwithasingleaction)butpushesaparttherepresentationofdissimilarones:
(4.7) |
whereisamarginandissome—usuallytheEuclidean—norm.Therepresentationofdissimilarobjectsispushedapartonlyiftheinequality
(4.8) |
isviolated.Duringeachtrainingstep,wecompareeachobservationtoasimilarandadissimilarobservationsimultaneously (schroff2015facenet)bypassingatripletof(positive,anchor,negative)observationsduringtrainingtothreecopiesof.Inourexperiments,thepositivecorrespondstothepredictedembeddingofgivenand,theanchor,isthetrueresultingembeddingafteranactionisperformedinstateand,thenegative,istherepresentationofanarbitrarilychosenobservationthatisnotreachablefromwithasingleaction.Forenvironmentswherethisisdeterminable,suchasinourexperiments,thiscanbeassessedfromtheenvironment’sfullstate.Whenthisinformationisn’tavailable,ensuringforandthatisagoodproxy,eventhoughthiscanresultinsomeincorrecttriplets.Forexample,whentheagentrunsinaself-intersectingpath.Wedefinetherepresentationoftheobservationattimestepasandthenext-steppredictionforreadabilityandour(positive,anchor,negative)tripletisthusandweminimizethetripletloss:
(4.9) |
Itwouldseemthatandareinterchangeable,sincethesecondtermisincludedonlytopreventtherepresentationfromcollapsingintoaconstant.However,ifthelossfunctionis
(4.10) |
thenthenetworkisrewardedduringtrainingformakingpooratpredictingthenextrepresentationinsteadofsimplypushingtherepresentationofandawayfromeachother.Therearetwomainwaystosetthemargin,oneisdynamicallydeterminingitperbatch(sun2014deep).Theother,whichwechoose,isconstrainingtherepresentationtobeonahypersphereusingnormalizationandsettingasmallconstantmarginsuchas(schroff2015facenet).ThearchitectureforthetrainingschemeusingthecontrastivelossregularizationisdepictedinFig. 4.3.
(iii)Reconstructiveloss
TrivialsolutionsareavoidedbyGoroshinetal.(goroshin2015learning)byintroducingadecodernetworktoasystemthatwouldotherwiseconvergetoaconstantrepresentation.Weincorporatethisintuitionintoourframeworkwiththelossfunction
(4.11) |
whereisapositive,realcoefficienttocontroltheregularizationstrength.Fig. 4.4showshowthemodelsandlossfunctionsarerelatedduringthetrainingoftherepresentationandpredictorusingbothapredictiveandareconstructivelossterm.
ThedesiredeffectoftheregularizationcanalsobeachievedbyreplacingthesecondterminEq. (4.3.3)with.Bydoingthiswewouldmaximizethereconstructivepropertyofthelatentcodeinandofitself,whichisnotinherentlyusefulforplanning.Weinsteadaddanadditionallevelofpredictivepowerin:inadditiontopredictingthenextrepresentation,itspredictionmustalsobeusefulinconjunctionwiththedecoderforreconstructingthenewtrueobservation.Thisapproachcanhavethelargestcomputationaloverheadofthethree,dependingonthesizeofthedecoder.Weconstructthedecodernetworksuchthatitcloselymirrorsthearchitectureof,withconvolutionsreplacedbytransposedconvolutionsandmax-poolingreplacedbyupsampling.
4.3.4 Trainingthepredictornetwork
WetraintherepresentationnetworkandpredictornetworkjointlybyminimizingEq. (2),Eq. (10)orEq. (12).Thepredictornetworkcanalsobetrainedonitsownforafixedrepresentationmap.Inthiscase,istaskedasbeforewithpredictingaftertheactionisperformedinthestatewithobservationbyminimizingthemean-squarederrorbetweenand.ThenetworksarebuiltwithKeras(chollet2015keras)andoptimizedwithrmsprop(tieleman2012lecture).
4.3.5 Planningintransition-learneddomainrepresentationspace
Weuseamodifiedbest-firstsearchalgorithmwiththetrainedrepresentationsforourexperiments(Algorithm2).
Fromagivenstate,theagentperformsasimulatedrollouttosearchforthegoalstate.Foreachaction,theinitialobservationispassedtothepredictorfunctionalongwiththeaction.Thisresultsinapredictednext-steprepresentation,whichisaddedtoaset.Theactionstakensofarandresultingineachpredictionarenotedalso.Therepresentationthatisclosesttothegoal(usingforexampletheEuclideandistance)isthentakenforconsiderationandremovedfromtheset.Thisprocessisrepeateduntilthemaximumnumberoftrialsisreached.Thealgorithmthenoutputsthesequenceofactionsresultinginthepredictedrepresentationthatistheclosesttothegoalrepresentation.Tomakethealgorithmfaster,weonlyconsiderpathsthatdonottakeustoastatethathasalreadybeenevaluated,eveniftheremightbeadifferenceinthepredictionsfromgoingthisroundaboutway.Thatis,ifapermutationoftheactionsinthenextpathtobeconsideredisalreadyinanevaluatedpath,itwillbeskipped.Thishasthesameeffectastranspositiontablesusedtospeedupsearchingametrees.Pathsmightbeproducedwithredundancies,whichcanbeamendedwithpath-simplifyingroutines(e.g.takeonestepforwardinsteadofonestepleft,oneforwardthenoneright).WedoModel-PredictiveControl(garcia1989model),thatis,afterapathisfound,oneactionisperformedandanewpathisrecalculated,startingfromthenewposition.Sincetheplanningispossiblyoveralongtimehorizon,wemighthaveacasewhereapreviousstateisrevisited.Toavoidloopsresultingfromthis,wekeeptrackofvisitedstate-actionpairsandavoidanalreadychosenactionforagivenstate.
4.3.6 NORBviewpoint-matchingexperiments
Forourexperiments,wecreateanOpenAIGymenvironmentbasedonthesmallNORBdataset(lecun2004learning).Thecodefortheenvironmentisavailableathttps://github.com/wiskott-lab/gym-norbandrequiresthepickledNORBdatasethostedathttps://s3.amazonaws.com/unsupervised-exercises/norb.p.Thedatasetcontains50toys,eachbelongingtooneoffivecategories:four-leggedanimals,humanfigures,airplanes,trucks,andcars.Eachobjecthasstereoscopicimagesundersixlightingconditions,9elevations,and18azimuths(in-scenerotation).Inalloftheexperiments,wetrainthemethodsonninecarclasstoys,testingontheothertoys.EachtrialinthecorrespondingRLenvironmentrevolvesaroundasingleobjectunderagivenlightingcondition.Theagentispresentedwithastartandagoalviewpointoftheobjectandtransitionsbetweenimagesuntilthecurrentviewpointmatchesthegoalwhereeachactionoperatesthecamera.Tobeconcrete,theactionscorrespondtoturningaturntablebackandforthby,movingthecameraupordownbyand,inoneexperiment,changingthelighting.Thetrialisasuccessiftheagentmanagestochangeviewpointsfromthestartpositionuntilthegoalviewpointismatchedinfewerthantwicetheminimumnumberofactionsnecessary.Wecomparetherepresentationslearnedusingthethreevariantsofourmethodtofiverepresentationsfromtheliterature,namely(i)LaplacianEigenmaps(belkin2003laplacian),(ii)thesecond-to-lastlayerofVGG16pretrainedonImageNet(deng2009imagenet),(iii)UMAPembeddings(mcinnes2018umap),(iv)convolutionalencoder(masci2011stacked)and(v)VAEcodes(kingma2013auto).Asfixedrepresentationsdonotchangethroughoutthetraining,theycanbesavedtodisk,speedingupthetraining.Asareference,weconsiderthreereinforcementlearningmethodsworkingdirectlyontheinputimages:(i)DeepQ-Networks(DQN) (mnih2013playing),(ii)ProximalPolicyOptimization(PPO) (schulman2017proximal)and(iii)WorldModels (ha2018world).Thedatasetisturnedintoagraphforsearchbysettingeachimageasanodeandeachviewpoint-changingactionasanedge.Thetaskoftheagentistotransitionbetweenneighboringviewinganglesuntilagoalviewpointisreached.Thetotalnumberoftrainingsamplesisfixedat25600.Forourmethod,asampleisasingle(,,)triplettobepredictedwhilefortheregularRLmethodsitisa(,,,)tuple.
4.3.7 Modelarchitectures
Input
ThenetworkencodesthefullNORBinput,apixelgrayscaleimage,tolower-dimensionalrepresentations.Thesystemasawholereceivestheimagefromthecurrentviewpoint,theimageofthegoalviewpointandaone-hotencodingofthetakenaction.Theimageinputsareconvertedfromintegersrangingfrom0to255tofloatingpointnumbersrangingfrom0to1.
Representationlearnerarchitecture
Weusethesamearchitectureforthenetworkinallofourexperimentsexceptforvaryingtheoutputdimension,Table 4.1.
Layer | Filters/Units | Kernelsize | Strides | Outputshape | Activation |
---|---|---|---|---|---|
Input | (96,96,1) | ||||
Convolutional | 64 | (45,45,64) | ReLU | ||
Max-pooling | (22,22,64) | ||||
Convolutional | 128 | (9,9,128) | ReLU | ||
Flatten | (10368) | ||||
Dense | 600 | (600) | ReLU | ||
Dense | #Features | (#Features) | Linear |
Regularizingdecoderarchitecture
ThedecodernetworkhasthearchitecturelistedinTable 4.2.Itisdesignedtoapproximatelyinverseeachoperationintheoriginalnetwork.
Layer | Filters/Units | KernelSize | Strides | Outputshape | Activation |
---|---|---|---|---|---|
Input | (#Features) | ||||
Dense | 512 | (512) | ReLU | ||
BN | (512) | ||||
Dense | 12800 | (12800) | ReLU | ||
BN | (12800) | ||||
Reshape | (10,10,128) | ||||
CT | 128 | (23,23,128) | ReLU | ||
Upsampling | (46,46,128) | ||||
BN | (46,46,128) | ||||
CT | 64 | (95,95,64) | ReLU | ||
BN | (95,95,64) | ||||
CT | 1 | (96,96,1) | Sigmoid |
Predictornetwork
Thepredictornetworkisatwo-streamdenseneuralnetwork.Eachstreamconsistsofadenselayerwitharectifiedlinearunit(ReLU)activation,followedbyabatchnormalization(BatchNorm)layer.Theoutputsofthesestreamsarethenconcatenatedandpassedthrough3denselayerswithReLUactivations,eachonefollowedbyaBatchNorm,andthenanoutputdenselayer,seeTable4.3.
Layer | Filters/Units | Outputshape | Activation |
---|---|---|---|
Stream:Input | (#Features) | ReLU | |
Stream:Dense | 256 | (256) | ReLU |
Stream:BatchNormalization | (256) | ||
Stream:Input | (#Actions) | ReLU | |
Stream:Dense | 128 | (128) | ReLU |
Stream:BatchNormalization | (128) | ||
Concatenateandstreams | (384) | ||
Dense | 256 | (256) | ReLU |
BatchNormalization | (256) | ||
Dense | 256 | (256) | ReLU |
BatchNormalization | (256) | ||
Dense | 128 | (128) | ReLU |
BatchNormalization | (128) | ||
Dense | #Features | (#Features) | Linear |
4.4 Results
Withourempiricalevaluation,weaimtoanswerthefollowingresearchquestions:
-
(Monotonicity)IstheEuclideandistancebetweenasuitablerepresentationandthegoalrepresentationdecreasingasthenumberofactionsthatseparatethemdecreases?
-
(Trainedpredictability)Istrainingarepresentationforpredictability,asproposed,feasible?
-
(Dimensionality)Whatisthebestdimensionalityofthelatentspaceforourplanningtasks?
-
(Solutionconstraints)Intermsofplanningperformance,whatarepromisingconstraintstoplaceontherepresentationtoavoidtrivialsolutions?
-
(Benchmarking)HowdoesplanningwithLARPcomparetoothermethodsfromtheRLliterature?
-
(Generalization)Howwelldoesourmethodgeneralizetounseenenvironments?
wewillrefertotheseresearchquestionsbynumberbelowastheygetaddressed.
4.4.1 Latentspacevisualization
Whentherepresentationandpredictornetworksaretrained,weapplyAlgorithm2totheviewpoint-matchingtask.Asdescribedabove,thegoalistofindasequenceofactionsthatconnectsthestartstatetothegoalstate,wherethetwostatesdifferintheirconfigurations.Tosupportthequalitativeanalysisofthelatentspace,weplotheatmapsofsimilaritybetweenthegoalrepresentationandthepredictedrepresentationofnodesduringsearch(Fig 4.6).Ofthe10cartoysintheNORBdataset,werandomlychose9forourtrainingsetandtestontheremainingone.
In-sampleembedding:LaplacianEigenmaps
First,weconsiderresearchquestion1(monotonicity).Inordertogetthebest-caserepresentation,weembedthetoyusingLaplacianEigenmaps.EmbeddingasingletoyinthreedimensionsusingLaplacianEigenmapsresultsinatube-likeembeddingthatencodesbothelevationandazimuthangles,seeFig4.5.Threedimensionsareneededsothatthecyclicazimuthcanbeembeddedcorrectlyasand.
Iftherepresentationisnowusedtotrainthepredictor,onewouldexpectthattherepresentationbecomesmonotonicallymoresimilartothegoalrepresentationasthestatemovestowardthegoal.InFig4.6weseethatthisisthecaseandthatthisbehaviorcanbeeffectivelyusedforagreedyheuristics.Whilethemonotonicityisnotalwaysexactduetoerrorsintheprediction,Fig4.6stillqualitativelyillustratesabest-casescenario.
Weconcludefromthisthat,forasuitablerepresentation,theEuclideandistancebetweenacurrentrepresentationandthegoalrepresentationismonotonicallyincreasingasafunctionofthenumberofactionsthatseparatethem.Thissupportstheuseofaprediction-basedlatentspacesearchforplanning.
Out-of-sampleembedding:pretrainedVGG16representation
Next,weconsiderthepretrainedrepresentationoftheVGG16networktogetarepresentationthatgeneralizestonewobjects.Wetrainthepredictornetworkandplottheheatmapofthepredictedsimilaritybetweeneachstateandthegoalstate,beginningfromthestartstate,inFig 4.7.
Theheatdistributioninthiscaseismorenoisy.Togetaviewoftheexpectedheatmapprofile,weaverageseveralfiguresofthistypetoshowbasinsofattractionduringthesearch.Eachheatmapisshiftedsuchthatthegoalpositionisatthebottom,middlerow(Fig 4.8,a).Here,itisobviousthatthegoalandtheflipped(azimuth)versionofthegoalareattractorstates.Thisisduetotherepresentationmapbeingsensitivetotheroughshapeoftheobject,butbeingunabletodistinguishfinerdetails.In(Fig 4.8,b)wedisplayanaggregateheatmapwhentheagentcanalsochangethelightingconditions.Ourvisualizationsshowagradienttowardthegoalstateinadditiontovisuallysimilarfar-away-states,sometimescausingthealgorithmtoproducesolutionsthatarethepolaroppositeofthegoalconcerningtheazimuth.Predictionerrorsalsopreventtheplanningalgorithmfromfindingtheexactgoalforeverytask,evenifitisnotdistractedbythepolar-opposite.
Toinvestigatetheaccuracyofthesearchwithrespecttoeachdimensionseparately,weplotthehistogramofdistancesbetweenthegoalstatesandthesolutionstatesinFig 4.9.Thegoalandstartstatesarechosenrandomly,withtherestrictionthattheazimuthdistanceandelevationdistancebetweenthemareeachuniformlysampled.Fortherestofthechapter,alltrialsfollowthissamplingprocedure.TheresultslooklessaccurateforelevationthanazimuthbecausetheelevationchangesaresmallerthantheazimuthchangesintheNORBdataset.ThedifferencebetweenthegoalandsolutionviewpointsinFig 4.7left,forexample,ishardlyvisible.Ifonewouldscalethehistogramsbyangleandnotbybins,thedrop-offwouldbesimilar.
4.4.2 Latentspacedimensionality
Withthenextexperiment,weaimtoanswerresearchquestion2(trainedpredictability).Whiletuningthedetailsofthedesign,wealsotackleresearchquestions3(dimensionality)and4(solutionconstraints).Wedoanablationstudyofthedimensionalityoftherepresentationforourmethod(Table 4.4).ThetestcarisanunseencartoyfromtheNORBdataset,andthetraincarcomesfromthetrainingset.
Representation | Dimensions | TrainingCar() | TestCar() |
---|---|---|---|
LARP(Contrastive) | 96 | 59.3 | 56.8 |
64 | 64.1 | 60.5 | |
32 | 72.3 | 59.4 | |
16 | 74.1 | 59.3 | |
8 | 82.7 | 58.0 | |
LARP(Sphering) | 96 | 41.1 | 37.8 |
64 | 93.9 | 53.7 | |
32 | 89.8 | 51.9 | |
16 | 85.2 | 42.6 | |
8 | 85.1 | 40.1 | |
LARP(Decoder) | 96 | 58.0 | 51.9 |
64 | 79.5 | 63.0 | |
32 | 77.8 | 61.7 | |
16 | 51.9 | 45.2 | |
8 | 51.1 | 42.4 | |
VGG16 | 902 | 62.4 | 55.1 |
RandomSteps | 3.5 | 3.5 |
Thereisnoclearwinner:thenetworkwiththespheringlayerdoesthebestononeofthecarsusedduringtraining,whilethereconstructive-lossnetworkdoesthebestontheheld-outtestcar.Thesharpdifferenceinperformancebetween64and94sphering-regularizedrepresentationcanbeexplainedbythenumericalinstabilityofthepoweriterationmethodfortoolargematrixdimensions.TheVGG16representationisnotthehighestperformeronanyofthecartoys.ManyofVGG16’srepresentationvaluesare0forallimagesintheNORBdataset,soweonlyusethosethatarenonzeroforanyoftheimages.Wesuggestthatthishighnumberofdeadunitsisduetotherepresentationbeingtoogeneralforthetaskofmanipulatingrelativelyhomogenousobjects.Anotherdrawbackofusingpretrainednetworksisthatinformationmightbeencodedthatisunimportantforthetask.Thishastheeffectthatoursearchmethodisnotguaranteedtooutputthecorrectsolutioninthelatentspace,astheremightbedistractingpocketsoflocalminima.Therandombaselinehasanaveragesuccessrateof3.5,whichisveryclearlyoutperformedbyourmethod.As64isthebestdimensionalityfortherepresentationonaverage,wecontinuewiththatnumberforourmethodinthetransferlearningexperiment.Weconcludethattheproposedmethodoftrainingarepresentationforpredictabilityisfeasible.Sofarwehaveevidencethat64isthebestdimensionalityoftherepresentation’slatentspaceforourplanningtasks.However,itisnotyetconclusivewhatthebestrestrictionistoplaceontherepresentationtoavoidthetrivialsolution,intermsofplanningperformance.
4.4.3 ComparisonwithotherRLmethods
Nowwedivertourattentiontoresearchquestion5(benchmarking),wherewecompareourmethodtotheliterature.ForthecomparisonwithstandardRLmethods,weusethedefaultconfigurationsofthemodel-freemethodsDQNandPPOasdefinedinOpenAIBaselines (baselines).Ourmodel-basedcomparisonischosentobeworldmodels(ha2018world)fromhttps://github.com/zacwellmer/WorldModels.WemakesurethatthecomparedRLmethodsaresimilartooursystemintermsofthenumberofparametersaswellasarchitecturelayoutandcomparethemwithourmethodonthecarviewpoint-matchingtask.ThetasksetupisthesameasbeforeandisconvertedtoanOpenAIgymenvironment:astartobservationandagoalobservationarepassedtotheagent.Iftheagentmanagestoreachitwithin2timestheminimumnumberofactionsrequired(theminimumnumberiscalculatedbytheenvironment),theagentreceivesarewardandthetaskisconsideredasuccess.Otherwise,norewardisgiven.TheresultsofthecomparisoncanbeviewedinFig. 4.10.Eachpointinthecurvecontainseachmethod’smeansuccessrate:theaverageofthecumulativerewardfrom100testepisodesfrom5differentinstantiationsoftheRLlearner,soitistheaveragerewardover500episodesintotal.Thetestepisodesaredoneonthesameenvironmentasisusedfortraining,exceptthatthepolicyismaximallyexploitingandminimallyexploring.
Inourexperiments,theDQNnetworksaremuchmoresampleinefficientthanPPO,whichinturnismoresampleinefficientthanourmethod.However,ourmethodismoretime-consumingduringtesttime.Werequireaforwardpassofthepredictornetworkforeachnodethatissearchedbeforewetakethenextstep,whichcangrowrapidlyifthetargetisfaraway.Incontrast,onlyasinglepassthroughthetraditionalRLnetworksisrequiredtocomputethenextaction.Ourmethodreachessuccessrateonthetraincar(Table 4.4)using25.6ksamples,butthebestPPOrunonlyreachesaftertrainingonthesamenumberofsamples.ThebestsinglePPOrunneeded41.3ksamplestogethigherthansuccessrate,andtheaverageperformanceishigherthanataround55ksamples.Afterthat,somePPOlearnersdeclinedagaininperformance.TheworldmodelspolicyquicklyreachesthesamelevelofperformanceasDQNgotafter50kstepsandPPOafterapproximately8ksteps,butitdoesn’timprovebeyondthat.
WeconcludethatourmethodcomparesfavorablytoothermethodsfromtheRLliteratureintermsofdata-efficiency.
4.4.4 Modifyingtheenvironment
Wenowmodifytheenvironmenttoanswerresearchquestion6(generalization).Toseehowthemethodscomparewhenobstaclesareintroducedtotheenvironment,werepeatthetrialononeofthecarobjectsexceptthattheagentcannolongerpassthroughstateswhoseelevationvaluesaredivisibleby10andazimuthvaluesaredivisibleby40(Fig. 4.11,(a)).Asbefore,thegoalandstartlocationscanhaveanyazimuth-elevationpair,buttheagentcannotmoveintostateswiththepropertiesindicatedbytheblackrectangle.Everyactionisavailabletotheagentatalllocationsasbefore,buttheagent’sstateisunchangedifitattemptstomovetoastatewithablackrectangle.WetrainedLARPusingthecontrastiveloss,PPO,andDQNagentsuntiltheyreachedaccuracyonourplanningtaskandthentestedthemwiththeaddedobstacles.Ourmethodlosesaboutperformance,butPPOloses.Nevertheless,wecancontinuetrainingPPOuntilitquicklyreachestopperformanceagain(Fig. 4.11,(b)).Ourmethodisnotre-trainedforthenewtask,andDQNdidnotreachagoodperformanceagaininthetimeweallottedforre-training.Thus,weseethatourmethodisquiteflexibleandgeneralizeswellwhenobstaclesareintroducedtotheenvironment.
4.4.5 Transfertodissimilarobjects
Wenowconsiderresearchquestion6(generalization)furtherbyinvestigatinghowwellourmethodtransfersknowledgefromonedomaintoanother.Selectingthebestdimensionalityfortherepresentationfromtheprevioussetofexperiments,weinvestigatefurthertheirperformanceinhardersituationsusingunseen,non-carobjects.Themodelsaretrainedonthesamecarobjectsasinthepreviousexperiment,buttheyaretestedonanarrayofdifferentplasticsoldiers:akneelingsoldierholdingabazooka,astandingsoldierwitharifle,aNativeAmericanwithabowandspearandacowboywitharifle(Fig. 4.12).
Qualitativeresults
Weofferavisualizationofthelearnedrepresentationofthekneelingsoldiertoyusingourmethod,aconvolutionalencoder,andVGG16inFig. 4.13.Eachembeddingwasreducedto2dimensionsusingt-SNE.Everymethodstructuresthedomainsimilarly.Inthebottomrow,weseethatthelargestclustersforallmethodsaretheoneswiththehighest(tealdots)illuminationsettings,whichisexplainedbytheeffectofthelightingonthepixelvalueintensities.Withintheseclusters,weseeclusteringbasedontheazimuth(middlerow).Finally,withintheseclusters,thereisagradientstructurebasedonelevation(toprow).Thisisduetotheelevationchanginginsmallerstep-sizes,with5degreedifferences,thanazimuthwith20degreedifferences.
Quantitativeresults
OurexperimentsincludethepretrainedVGG16networkbecausewebelievethataflexiblealgorithmshouldratherbebasedonagenericmulti-purpose-representationandnotonaspecificrepresentation.Tooursurprise,itisoutperformedbyourapproachoneveryobject,evenastheviewpoint-matchingtaskisdoneonobjectsthataredifferentfromtheoriginalcartoystheyaretrainedon.WealsotrytransferlearningaVAEandaconvolutionalencoderusingthesamenetworkasourrepresentationandaUMAPembedding,with64featureseach.Wedisplaytheresultsofthesameviewpoint-matchingtasksasbeforeusingthesedifferentrepresentationsontheunseenarmyfiguresinTable 4.5.Convolutionalencodershavethelowestperformanceoutofallthemethods,buttheVAErepresentationhasthemostsimilarperformancetoourownrepresentations.
Representation | Soldier | Soldier | NativeAmerican | Cowboy | |
---|---|---|---|---|---|
(Kneeling) | (Standing) | withBow | withRifle | Mean | |
LARP(Contrastive) | 22.7 | 11.1 | 12.1 | 20.7 | 16.7 |
LARP(Sphering) | 18.4 | 15.3 | 15.2 | 15.8 | 16.2 |
LARP(Decoder) | 14.3 | 16.6 | 14.7 | 15.6 | 15.3 |
VAE | 14.4 | 13.1 | 13.6 | 13.8 | 13.7 |
VGG16 | 12.3 | 12.9 | 11.8 | 13.6 | 12.7 |
UMAP | 8.6 | 10.8 | 9.7 | 11.1 | 10.1 |
Conv.Encoder | 4.9 | 14.3 | 11.5 | 6.4 | 9.3 |
Ourrepresentationisbetterthantheothersforthisexperiment.Butweseethattheperformanceofourmethoddropssignificantlyasweattemptgeneralizingtodifferentobjectswithchangedinputstatistics.Notethateventhoughtherepresentationclustering(Fig. 4.13)wasmoreclear-cutusingconvolutionalencodersandVGG,comparedtoours,theydidnotperformaswellinthetrials.Weattributethistothefactthatthemainlossterminourrepresentationwastheoneoffuturestatepredictability,whichmaynotgiveasclearofastructureintwodimensionsasarepresentationthatistrainedonlyforstaticimagereconstruction.
4.5 Conclusion
Inthischapter,wepresentLatentRepresentationPrediction(LARP)networkswithapplicationstovisualplanning.WejointlylearnamodeltopredicttransitionsinMarkovdecisionprocesseswitharepresentationtrainedtobemaximallypredictable.Thisallowsustoaccuratelysearchthelatentspacedefinedbytherepresentationusingaheuristicgraphtraversalalgorithm.Wevalidateourmethodonaviewpoint-matchingtaskderivedfromtheNORBdataset,andwefindthatarepresentationthatisoptimizedjointlywiththepredictornetworkperformsbestinourexperiments.Acommonissueofunsupervisedrepresentationlearningisoneoftrivialsolutions:aconstantrepresentationwhichoptimallysolvestheunsupervisedoptimizationproblembutbringsacrossnoinformation.Toavoidthetrivialsolution,weconstrainthetrainingbyintroducingaspheringlayeroralosstermthatiseithercontrastiveorreconstructive.Anyoftheseapproacheswilldothejobofpreventingtherepresentationsfromcollapsingtoconstants,andnoneofthemdisplaysstrongerperformancethantheothersinourexperiments.OurLARPrepresentationiscompetitivewithpretrainedrepresentationsforplanningandcomparesfavorablytootherreinforcementlearning(RL)methods.Ourapproachisasoundsolutionforlearningausefulrepresentationthatissuitableforplanningonlyfrominteractions.Furthermore,wefindthatourmethodhasbetterdata-efficiencyduringtrainingthanseveralreinforcementlearningmethodsfromtheliterature.However,adisadvantageofourapproachcomparedtostandardRLmethodsisthattheexecutiontimeofourmethodscalesworsewiththesizeofthestatespace,asaforwardpassiscalculatedforeachnodeduringthelatentspacesearch,potentiallyresultinginacombinatorialexplosion.Ourapproachisadaptabletochangesinthetasks.Forexample,oursearchwouldonlybeslightlyhinderedifsomeobstacleswereplacedintheenvironmentorsomestateswereforbiddentotraversethrough.Furthermore,ourmethodisindependentofspecificrewardsordiscountrates,whilestandardRLmethodsareusuallyrestrictedintheiroptimizationproblems.Often,thereisachoicebetweenoptimizingdiscountedorundiscountedexpectedreturns.Simulation/rollout-basedplanningmethodsarenotrestrictedinthatsense:Ifrewardtrajectoriescanbepredicted,onecanoptimizearbitraryfunctionsoftheseandregularizebehavior.Forexample,arisk-averseportfoliomanagercanprioritizesmoothrewardtrajectoriesovervolatileones.Futurelinesofworkshouldinvestigatefurthertheeffectofthedifferentconstraintsontheend-to-endlearningofrepresentationssuitedforapredictiveforwardmodel,aswellasconsideringnovelones.Thesearchalgorithmcanbeimprovedandmadefaster,especiallyforhigher-dimensionalactionspacesorcontinuousones.OurnetworkcouldalsoinprinciplebeusedtotrainanRLsystem,forinstance,byencouragingittoproducesimilaroutputsasoursandtherebycombiningdata-efficiencywithfastperformanceduringinferencetime.
Chapter 5 Rewardpredictionforrepresentationlearningandrewardshaping
ThepreviouschapterintroducedtheLARPnetworkandshoweditsusefulnessinview-pointmatchingexperiments,incomparisontoreinforcementlearning(RL)methods,particularlythemodel-basedones.Thenetworklearnsstatefeaturesthataretailor-madeforapredictionmodule,whichisthenusedforgraph-basedsearch.Thelearningofthestaterepresentationandpredictorisdoneinaself-supervisedmanner,independentlyofarewardsignal.Themethodisdata-efficientandisabletospeedtolearningfornewtasksafterpre-trainingonasimilarone.However,themethodrequiresaforwardpassofthenetworkforeverynodethatisconsideredduringthegoalsearch,whichscalespoorlywiththesizeofthestatespacecomparedtomethodsthatonlyneedtoprocessthecurrentstate.Dependingonthebranchingfactoroftheenvironmentandthedetailsofthelatentrepresentationgraphsearch,thiscanresultinacombinatorialexplosion.Inthischapter111Thischapterisadaptedfrom(hlynsson2021reward).,weapproachtheproblemoflearningstaterepresentationsforRLfromanotherangle:insteadofpredictingtheresultsofactions,welearnarepresentationthatispredictiveofthelocaldistancefromthegoalinsingle-goalenvironments.Therepresentationislearnedalongsidearewardpredictorthatlearnstoestimateeitheraraworasmoothedversionofthetruerewardsignal.Weaugmentthetrainingofout-of-the-boxRLagentsbyshapingtherewardusingourrewardpredictorduringpolicylearning.Usingourrepresentationforpreprocessinghigh-dimensionalobservations,aswellasusingthepredictorforrewardshaping,isshowntosignificantlyenhanceActorCriticusingKronecker-factoredTrustRegionandProximalPolicyOptimizationinsingle-goalenvironmentswithvisualinputs.Theremainderofthechapterhasthefollowingstructure:WestartwithachapterintroductioninSection5.1,followedbyanexplanationofrequiredbackgroundknowledgeforthischapterinSection5.2.WegointorelatedworkinSection5.3,afterwhichthedetailsofourapproachisoutlinedinSection5.4.ThemethodologyofourexperimentsisexplainedinSection5.5.WedisplayanddiscusstheresultsofourexperimentsinSection5.6.Finally,weclosethechapterwithconcludingremarksinSection5.7.
5.1 Introduction
EventhoughthedominanceofhumansisbeingtestedbyRLagentsonnumerousfronts,therearestillgreatdifficultiesforthefieldtoovercome.Forinstance,thedatathatisrequiredforalgorithmstoreachhumanperformanceisonafarlargerscalethanthatneededbyhumans.Furthermore,thegeneralintelligenceofhumansremainsunchallenged.EventhoughanRLagenthasreachedsuperhumanperformanceinonefield,itsperformanceisusuallypoorwhenitistestedinnewareas.Thestudyofmethodstoovercometheproblemofdata-efficiencyandtransferabilityofRLagentsinenvironmentswheretheagentmustreachasinglegoalisthefocalpointofthiswork.Weconsiderasimplewayoflearningastaterepresentationbypredictingeitheraraworasmoothedversionofasparserewardyieldedbyanenvironment.Thetwoobjectives,learningastaterepresentationandpredictingthereward,aredirectlyconnectedaswetrainadeepneuralnetworkfortheprediction,andthehiddenlayersofthisnetworklearnareward-predictivestaterepresentation.Therewardsignaliscreatedbycollectingdatafromarelativelylownumberofinitialepisodesusingacontrollerthatactsrandomly.Therepresentationisthenextractedfromanintermediatelayerofthepredictionmodelandre-usedasgeneralpreprocessingforRLagents,toreducethedimensionalityofvisualinputs.Theagentprocessesinputscorrespondingtoitscurrentstateaswellasthedesiredendstate,whichisanalogoustomentallyvisualizingagoalbeforeattemptingtoreachit.Thisgeneralapproachofrelyingonstaterepresentations,thatarelearnedtopredicttherewardratherthanmaximizingit,hasbeenmotivatedintheliterature(lehnert2020reward)andweshowthatourrepresentationiswell-suitedforsingle-goalenvironments.Ourworkaddstotherecentlygrowingbodyofknowledgerelatedtodeepunsupervised(hlynsson2019learning)orself-supervised(schuler2018gradient)representationlearning.WealsoinvestigatetheeffectivenessofaugmentingtherewardforRLagents,whentherewardissparse,withanovelproblem-agnosticrewardshapingtechnique.Therewardpredictor,whichisusedtotrainourrepresentation,isnotonlyusedasapartofanauxiliarylossfunctiontolearnarepresentation,butitisalsousedduringtrainingtheRLsystemtoencouragetheagenttomoveclosertoagoallocation.SimilartoadvantagefunctionsintheRLliterature(schulman2015high),giventhetrainedrewardpredictor,theagentreceivesanadditionalrewardsignalifitmovesfromstateswithalowpredictedrewardtostateswithahigherpredictedreward.Wefindthisrewardaugmentationtobebeneficialforourtestenvironmentwiththelargeststate-space.
5.2 Background
5.2.1 Rewardshaping
Sparserewardsinenvironmentsisacommonproblemforreinforcementlearningagents.Theagent’sobjectiveistolearnhowtoassociateitsinputswithactionsthatleadtohighrewards,whichcanbealengthyprocessiftheagentonlyrarelyexperiencespositiveornegativerewards.Rewardshaping(mataric1994reward; ng1999policy; brys2015policy)isapopularmethodofmodifyingtherewardfunctionofanMDPtospeeduplearning.Itisusefulforenvironmentswithsparserewardstoaugmentthetrainingoftheagent,butskillfulapplicationsofrewardshapingcaninprincipleaidtheoptimizationforanyenvironment–althoughtheefficacyoftherewardshapingishighlydependentonthedetailsoftheimplementation(clark2016faulty).Inthelastfewyears,rewardshapinghasbeenshowntobeusefulforcomplexvideogameenvironments,suchasreal-timestrategygames(efthymiadis2013using)andplatformers(brys2014multi)andithasalsobeencombinedwithdeepneuralnetworkstoimproveagentsinfirst-personshootergames(lample2017playing).Asanillustration,considerlearningapolicyforcarracing.Ifthegoalistotrainanagenttodriveoptimally,thensupplyingitwithapositiverewardforreachingthefinishlinefirstisintheorysufficient.However,ifitispunishedforactionsthatareneverbeneficial,forinstancecrashingintowalls,itprioritizeslearningtoavoidsuchsituations,allowingittoexploremorepromisingpartsofthestatespace.Furthermore,justreachingthegoalisinsufficientifthereiscompetition.Tomakesurethatwehaveawinningracer,asmallnegativerewardcanbeintroducedateverytimesteptourgetheagenttoreachthefinishlinequickly.Notethatthedetailsoftherewardshapinginthisexamplerequiresdomainknowledgefromadesignerwhoisfamiliarwiththeenvironment.Itwouldbemoregenerallyusefuliftherewardshapingwouldbeautonomouslylearned,justasthepolicyoftheagent,asweproposetodointhiswork.
5.2.2 Reward-predictivevs.reward-maximizingrepresentations
lehnert2020rewardmakethedistinctionbetweenreward-maximizingrepresentationsarereward-predictiverepresentations.Theyarguehowreward-maximizingrepresentationscantransferpoorlytonewenvironments,whilereward-predictiverepresentationsgeneralizesuccessfully.TakethesimplegridworldnavigationenvironmentsinFig.5.1,forexample.Theagentstartsatarandomtileinthegridandgetsarewardof+1byreachingtherightmostcolumninEnvironmentAorbyreachingthemiddlecolumninEnvironmentB.ThestatespaceinEnvironmentAcanbecompressedfromthegridtoavectoroflength3,ofreward-predictiverepresentations.Topredictthediscountedreward,itsufficestodescribetheagent’sstatewithifitisinthethrow.
Thereward-maximizingrepresentationforEnvironmentAismuchsimpler:thewholestatespacecanbecollapsedtoasingleelement,withtheoptimalpolicyofalwaysmovingtotheright.Iftheserepresentationsarekept,thenthereward-predictiverepresentationisinformativeenoughforaRLagenttolearnhowtosolveEnvironmentB.Thereward-maximizingrepresentationhasdiscardedtoomanydetailsoftheenvironmenttobeusefulforsolvingthisnewenvironment.
5.2.3 Successorfeatures
Thesuccessorrepresentationalgorithmlearnstwofunctions:theexpectedrewardreceivedaftertransitioningintoastate,aswellasthematrixofdiscountedexpectedfutureoccupancyofeachstate,assumingthattheagentstartsinagivenstateandfollowsaparticularpolicy.Knowingthequantitiesandallowsustorewritethevaluefunction:
(5.1) |
Themotivationforthisalgorithmisthatitcombinesthespeedofmodel-freemethods,byenablingfastcomputationsofthevaluefunction,withtheflexibilityofmodel-basedmethodsforenvironmentswithchangingrewardcontingencies.Thismethodismadeforsmall,discreteenvironments,butithasbeengeneralizedforcontinuousenvironmentswithso-calledfeature-basedsuccessorrepresentations,orsuccessorfeatures(SFs)(barreto2016successor).TheSFalgorithmsimilarlycalculatesthediscountedexpectedrepresentationoffuturestates,giventheagenttakestheactioninthestateandfollowsapolicy:
(5.2) |
whereissomestaterepresentation.BoththeSFandtherepresentationcanbedeepneuralnetworks.
5.3 Relatedwork
5.3.1 Reward-predictiverepresentations
lehnert2020rewardcomparesuccessorfeatures(SFs)toanonparametricBayesianpredictorthatistrainedtolearntransitionandrewardtablesfortheenvironment,eitherwithareward-maximizingorareward-predictivelossfunction.lehnert2020successorproveunderwhatconditionssuccessorfeatures(SFs)areeitherreward-predictiveorreward-maximizing(seedistinctioninSection5.2.2).TheyalsoshowthatSFsworksuccessfullyfortransferlearningbetweenenvironmentswithchangingrewardfunctionsandunchangedtransitionfunctions,buttheygeneralizepoorlybetweenenvironmentswherethetransitionfunctionchanges.Ourworkisdistinctfromthereward-predictivemethodsthattheycompare,asourrepresentationdoesnotneedtocalculateexpectedfuturestateoccupancy,asisthecaseforSFs.Ourmethodscalesbetterformorecomplicatedstate-spacesbecausewedonottabulatethestates,astheydowiththeirBayesianmodel,butlearnarbitrarycontinuousfeaturesofhigh-dimensionalinputdata.Inadditiontothat,learningourrewardpredictorisnotonlya"surrogate"objectivefunction,asweuseitforrewardshapingaswell.
5.3.2 Rewardshaping
Theadvantagesofrewardshapingarewellunderstoodintheliterature(mataric1994reward).ArecenttrendinRLresearchisthestudyofmethodsthatcanlearntherewardshapingfunctionautomatically,withouttheneedof(oftenfaulty)humanintervention.marashi2012automaticassumethattheenvironmentcanbeexpressedasagraphandthatthisgraphformulationisknown.Underthesestrongassumptions,theyperformgraphanalysistoextractarewardshapingfunction.Morerecently,zou2019rewardhaveproposedameta-learningalgorithmforpotential-basedautomaticrewardshaping.Ourapproachisdifferentfrompreviousworkasweassumenoknowledgeabouttheenvironmentandtrainasimplepredictortoapproximate(potentiallysmoothed)rewards,whichisthenusedtoconstructapotential-basedrewardshapingfunction.
5.3.3 Goal-conditionedreinforcementlearning
kaelbling1993learningstudiedenvironmentswithmultiplegoalsandsmallstate-spaces.Intheirproblemsetting,theagentmustreachaknownbutdynamicallychanginggoalinthefewestnumberofmoves.Theobservationspaceisofalowenoughdimensionfordynamicprogrammingtobesatisfactoryintheircase.schaul2015universalintroducetheUniversalValueFunctionApproximatorsandtackleenvironmentsoflargerdimensionsbylearningavaluefunctionneuralnetworkapproximatorthatacceptsboththecurrentstateandagoalstateastheinputs.Inasimilarvein,pathak2018zerolearnapolicythatisgivenacurrentstateandagoalstateandoutputsanactionthatbridgesthegapbetweenthem.hlynsson2020latentlearnapredictablerepresentationthatispairedwitharepresentationpredictorandcombineitwithgraphsearchtofindagivengoallocation.Incontrasttotheseapproaches,welearnareward-predictiverepresentationinaself-supervisedmanner,whichisusedtopreprocessrawinputsforRLpolicies.
5.4 Approach
Inthissection,weexplainourapproachmathematically.Intuitively,wetrainadeepneuralnetworktopredicteitheraraworasmoothedrewardsignalfromasingle-goalenvironment.Theoutputofanintermediatelayerinthenetworkisthenextractedastherepresentation–forexample,bysimplyremovingthetoplayersofthenetwork.Thefullrewardpredictornetworkisusedforrewardshapingbyrewardingtheagentformovingfromlowerpredictedvaluestowardhigherpredictedvaluesofthenetwork.
5.4.1 Learningtherepresentation
Supposethatisadifferentiablefunctionparameterizedbyandisapositiveinteger.WeusetoapproximatethediscountedreturninaPOMDPwithasparsereward:theagentreceivesarewardof0foreachtimestepexceptwhenitreachesagoallocation,atwhichpointitreceivesapositiverewardandtheepisodeterminates.Givenanexperiencebuffer,wecreateanewdataset.Thenewrewardsarecalculatedaccordingtotheequation
(5.3) |
whereisadiscountfactorandisthedifferencebetweenandthetimestepindexofthefinaltransitioninthatepisode,forsomemaximumtimehorizon.Throughoutourexperiments,wekeepthevalueofthediscountfactorequaltoandwetrainonor.Assumethatourdifferentiablerepresentationfunctionisparameterizedbyandmapsthe-dimensionalrawobservationofthePOMDPtothe-dimensionalfeaturevector.Wetraintherepresentationforthediscounted-rewardpredictionbyminimizingthelossfunction
(5.4) |
withrespecttotheparametersofandtheparametersofoverthewholedataset.SeeFig.5.2foraconceptualoverviewofourrepresentationlearning.
5.4.2 Rewardshaping
ng1999policydefinearewardshapingfunctionaspotential-basedifthereexistsafunctionsuchthatforallstatesthefollowingequationholds:
(5.5) |
andistheMDP’sdiscountfactor.Theyproveforsingle-goalenvironmentsthateveryoptimalpolicyfortheMDPisalsooptimalforitsrewardshapedcounterpart,andviceversa.Theyalsoshow,foragivenstatespaceandactionspace,thatifisnotpotential-based,thenthereexistatransitionfunctionandarewardfunctionsuchthatnooptimalpolicyinisoptimalin.Decidingthattherewardshapingfunctionshouldbepotential-basedisjustthefirststepinitsdesign.Nowassumethatwehaveanenvironmentwhereanagentistaskedwithreachingagoalstate.Thatis,foragivendistancefunctiontheagentreceivesarewardof1ifitiscloseenoughtothegoallocation,,forsomerewardthreshold.Otherwise,itreceivesarewardof0.Thedistancebetweentheagent’slocationnewandthegoallocationcanbeausefulvaluetocalculateinthedesignofarewardshapingfunction
(5.6) |
However,thisdependsontheenvironment,astheagentcouldgetstuckinlocaloptimabeforecomingclosetothegoal,i.e.ifitwouldhavetomovethrougharegionwithalargebeforeitcangloballyminimizeit.ThiscouldforexamplebethecaseinamazeenvironmentifistheEuclideandistancebetweenthecoordinatesoftheagent’slocationandthegoallocationandthereisawallbetweentheagentandthegoal.trott2019keepingproposetosolvethisbyincorporatingthepotentiallocaloptimaintherewardshapingfunctionasso-called"anti-goals"tobeavoided
(5.7) |
Thesestatescanbehand-pickedbydomainexperts.However,addinganti-goalslikethiscoulditerativelyintroduceevenmorelocaloptimaandasolutiontotheoriginalproblemisnotguaranteed.Itisgenerallynottruethatthedistancefunctionandallthevariablesneededtocalculateit,suchasthecoordinatesoftheagentandthegoalinamaze,areavailabletotheagent.Evenifwerecomputable,usingitnaivelycanbringaboutitsownproblems,aswasalludedtoabove.WearguethatinsteadofusinginEquation 5.6,itwouldbebettertomeasurethedistancebetweentheagentandthegoalintermsofhowmanyactionstheagenthastotakeuntilthegoalisreached.Thisfunctionisnotassumedtobegiven,butitcanbeestimatedastheagentisbeingtrainedontheenvironment,forinstancebyoptimizingEquation 5.4.Additionally,wewouldlikeourrewardshapingfunctiontobepotential-based(Equation5.5)toreapthetheoreticaladvantages.Thus,weproposeapotential-basedrewardshapingfunctionbasedonthediscountedrewardpredictor
(5.8) |
where,istherewardpredictorandisourrepresentationfromtheprevioussection.Notethatbothandareassumedtobefullytrainedbeforethepolicyoftheagentistrained,forexampleusingdatagatheredbyarandompolicy,buttheycaninprinciplealsobeupdatedasthepolicyisbeinglearned.Thefactorscalesdowntheintensityoftherewardshapingwhereisthenumberofepisodesthattheagenthasexperiencedandisthemaximumnumberofepisodeswheretheagentistrainedusingrewardshaping.Thestrengthoftherewardshapingisthehighestinthebeginningtocounteractpotentiallyadverseeffectsoferrorsintherewardpredictor.Itisalsomoreimportanttoincentivizemovingtowardthegeneraldirectionofthegoalintheearlystagesoflearning,afterwhichtheun-augmentedrewardsignaloftheenvironmentisallowedto"speakforitself"andguidethelearningoftheagenttowardthegoalprecisely.
5.5 Methodologyandimplementation
5.5.1 Environment
ThemethodistestedonthreedifferentgridworldenvironmentsbasedontheMinimalisticGridworldEnvironment(MiniGrid)(gym_minigrid).Tilescanbeempty,occupiedbyawalloroccupiedbylava.ThestructureoftheenvironmentsfitnaturallyintoourPOMDPtupletemplate(Eq.2.1):
-
Theconstituentstatesofaredeterminedbytheagent’slocationanddirection(facingnorth,west,southoreast).SeeFig.5.2(a)forthreedifferentworldstatesinoneofourenvironments.
-
Theactionspaceconsistsofthreeactions:(1)turnleft,(2)turnrightand(3)moveforward.
-
Thetransitionfunctionisdeterministic.Theagentrelocatestothetileitfacesifitmovesforwardandthetileisempty,andnothinghappensifthetileisoccupiedbyawall.Theepisodeterminatesifthetileisoccupiedbylavaorthegoal.Theagentrotatesinplaceifitturnsleftorright.
-
Reachingthegreentilegoalgivesarewardof,everyotheractiongives0points.Theenvironmentautomaticallytimesoutaftersteps.
-
Differsbetweenthethreeenvironments(seebelow).
-
Allsubsetoftiles,representedbyarrays,fromthepointofviewofanagentwhocannotseethroughwalls,seeFig.5.2(b).
-
thediscountfactoris.
Weconsiderthefollowingthreeenvironments:
Two-roomenvironment
Theworldisagridoftiles,splitintotworooms,wherewallsareplacedatdifferentlocationstofacilitatediscriminationbetweentheroomsfromtheagent’spointofview.(Fig.5.3).Theagentisplacedbetweenthetworooms,facingarandomdirection.Thegoalisatoneofthreepossiblelocations.Thisisamodifiedversionoftheclassicalfour-roomenvironmentlayout(sutton1999between).
Lavagapenvironment
Inthisenvironment,theagentisinaroomwithacolumnoflavaeitheroneortwospacesinfrontoftheagent(Fig.5.4)withagapinarandomrow.Theagentalwaysstartsintheupperleftcornerandthegoalisalwaysinthelowerrightcorner.
Four-roomenvironment
Anexpansiontothetwo-roomenvironmentwithtwoadditionalrooms(Fig.5.5).Inthissetup,boththeagentandthegoallocationareplacedatrandomlocationswithinthegridworld.
5.5.2 Baselines
WecombineourrepresentationswithtwoRLalgorithmsasimplementedinStableBaselines(stable-baselines)usingthedefaulthyperparameters:
-
(ACKTR)ActorCriticusingKronecker-FactoredTrustRegion(wu2017scalable),whichcombinesactor-criticmethods,trust-regionoptimization,anddistributedKroneckerfactorizationtoenhancedata-efficiency.
-
(PPO2)AversionoftheProximalPolicyOptimizationalgorithm(schulman2017proximal).Itmodifiestheoriginalalgorithmbyusingclippedvaluefunctionsandanormalizedadvantage.
Forbothalgorithms,sixvariationsarecompared:
-
(DeepRL)TheRLalgorithmlearnstherepresentationfromscratchonrawimages
-
(SF)Theinputispreprocessedusingsuccessorfeatures
-
(Ours1r)Theinputispreprocessedusingourrepresentation,trainedonrawrewardpredictions
-
(Ours1r+Shaping)Theinputispreprocessedusingourrepresentationandtherewardisshaped,trainedonrawrewardpredictions
-
(Ours64r)Theinputispreprocessedusingourrepresentation,trainedonsmoothedrewardpredictions
-
(Ours64r+Shaping)Theinputispreprocessedusingourrepresentationandtherewardisshaped,trainedonsmoothedrewardpredictions
Carehasbeentakentoensurethateachvariationhasthesamearchitectureandthesamenumberofparameters.
5.5.3 Modelarchitectures
EverymodelisrealizedasaneuralnetworkusingKeras(chollet2015keras).Below,therepresentationandpolicynetworksareusedforourmethodandtheSFcomparison,therewardpredictionnetworkisusedonlyforourmethodandthedeepRLnetworkisusedonlyforthedeepRLcomparison,wheretheRLalgorithmalsolearnstherepresentation.Therepresentationnetworksaretwoconvolutionalnetworks(Table5.1)withainput,takingeithertheagent’scurrentobservationorthegoalobservation.
Layer | Filters | Filtersize | Stride | Padding | Outputshape |
Inputtensor | - | - | - | - | |
Convolution | 8 | 3 | None | ||
ReLU | - | - | - | - | |
2Dmaxpooling | 8 | - | None | ||
Convolution | 16 | 2 | None | ||
ReLU | - | - | - | - | |
Flatten | - | - | - | - | 16 |
Dense | - | - | - | - | 16 |
Thefirstlayersubsamplestheinput,keepingonlyeveryothercolumnandrow.Thisisfollowedby8filtersofsizewithastrideof3.ThisisfollowedwithaReLUactivationandamaxpoolinglayerwithastridevalueof2.Thepoolinglayer’soutputispassedtoalayerwith16convolutionalfiltersofsizeandastrideof2andaReLUactivationfunction.Theoutputisthenflattenedandpassedtoadenselayerwith16unitsandalinearactivation,definingthedimensionoftherepresentation.Nozeropaddingisappliedintheconvolutionallayersorthepoolinglayer.Thepolicynetworksarethree-layerfully-connectednetworks(Table5.2)acceptingtheconcatenatedoutputoftherepresentationnetworkfortheagent’scurrentpointofviewandthegoalobservationasaninput.Thefirsttwolayershave64unitsandaReLUactivation,andthelastlayerhas3unitsandalinearactivationfunction.Thethreeunitsrepresentthethreeactionsleft,right,andforwardinaonehotencoding.Winnertakesallisusedtodecideontheaction.
Layer | Units | Outputshape |
Inputtensor | - | 32 |
Dense | 64 | 64 |
ReLU | - | 64 |
Dense | 64 | 3 |
ReLU | - | 3 |
Dense | 3 | 3 |
Ourrewardpredictionnetworkisathree-layerfully-connectednetwork(Table5.3)withthesameinputasthepolicynetwork:theconcatenatedrepresentationoftheagent’scurrentviewandthegoalobservation.Thefirsttwolayershave256unitsandaReLUactivation,butthelastlayerhas1unitandalogisticactivationfunction.
Layer | Units | Outputshape |
Inputtensor | - | 32 |
Dense | 256 | 64 |
ReLU | - | 64 |
Dense | 256 | 3 |
ReLU | - | 3 |
Dense | 1 | 3 |
Logistic | - | 3 |
ThedeepRLnetworkstackstherepresentationnetworkandthepolicynetworkontopofeachother.Therepresentationnetworkacceptstheinputandoutputsthelow-dimensionalrepresentationtothepolicynetworkthatoutputstheactionscores.
5.5.4 Trainingtherepresentationandpredictornetworks
Wecollectadatasetofthousandtransitionsbyfollowingarandompolicyinthetwo-roomenvironment.Forthisdatacollection,eachepisodehasachancetohavethegoallocationinthebottomroomorontheleftsideofthetoproom(seetheleftandmiddlepicturesinFig.5.2(a)).Therewardpredictorandtherepresentationaretrainedinthismannerforallexperiments,includingthelavagapandthefour-roomenvironment.Thus,weusearepresentationandrewardpredictorthathaveneverseenlava.Fortheexperimentswithsmoothedrewards,thesparserewardassociatedwiththeobservationsinthedatasetisaugmentedbyassociatinganewrewardtothestatesleadingtoobservationswithapositivereward,accordingtoEquation5.3,withadiscountfactorof.Additionally,aftertherewardhasbeen(potentially)smoothedinthisway,observationsassociatedwithapositiverewardareoversampledtimestobalancethedataset,regardlessofwhethertherewardhasbeenaugmentedornot.
5.6 Resultsanddiscussion
Intheexperiments,wecompareRLagentsthatlearntheirrepresentationsfromscratch(DeepRL)toagentsthatpreprocesstheirinputswithdifferentrepresentations.Wecompareourrepresentation,trainedonrawrewardpredictions–with(Ours1r)orwithoutrewardshaping(Ours1r+Shaping)–toourmethodtrainedonsmoothedrewardprediction,alsowith(Ours64r)orwithoutrewardshaping(Ours64r+Shaping).Weuse"64r"todenotethatourmethodwastrainedwithrewardshapingand"1r"todenotethatourmethodwastrainedwithoutaugmentation.Asabaseline,wecompareourrepresentationtoareward-predictiverepresentationfromtheliterature,SuccessorFeatures(SFs).
5.6.1 Two-roomenvironment
Westartbyvisualizingtheoutputsofourrewardpredictorintherooms,dependingonthegoallocation,inFig.5.6.Eachsquareindicatestheaveragepredictedrewardfortransitioningtothecorrespondingtileintheroom.Thepredictedrewardspikesinanarrowregionaroundthetwogoallocationsthatwereusedtotraintherawrewardpredictor(Fig.5.5(a)),buttheareaofstateswithhighpredictedrewardsiswideraroundthetestgoal.Thisdifferenceisduetooverfittingonthespecifictrainingpathsthatweremorefrequentlytakentowardtherespectivegoals,butthisdoesnotharmthegeneralizationcapabilitiesofthenetwork.Thepeakynessofthepredictionsdisappearswhenthepredictoristrainedonthesmoothedrewards(Fig.5.5(b)).However,higherpredictedrewardsinthecorneroftheotherroomappear.Bothscenarios,rawandsmoothedrewardprediction,showpromisefortheapplicationofrewardshapingunderourtrainingscheme,astheagentwouldbenefitfromfindingneighborhoodswithhighervaluesofpredictedrewarduntilitreachesthegoal,insteadofhavingtorelysolelyonasparserewardthatisonlygivenwhentheagentlandsexactlyonthegoalstate.
InFigure5.7,weillustratethevarianceofthemeanreward(leftside)andthevarianceoftheoptimalperformance(rightside)ofthedifferentmethods,asafunctionofthetimestepstakenfortraining.Weaverageover10runsandineachrunweperform10testrollouts,soeachpointistheaggregateof100episodesintotal.222Notethatthestandarddeviationofthemeanrewardofallepisodes,fromallrunsputtogether,isapproximatelyhigherthanthestandarddeviationofthemeanofthemeansorthemeanofthemins.Theerrorbandsindicatetwostandarddeviations.ThismethodologyofgeneratingtheplotsalsoappliestoFig.5.9,Fig.5.10andFig.5.12.ThelearningcurvesofbothACKTRandPPO2getclosetothehighestachievablemeanrewardofthefastestusingourrepresentations.TherenosignificantbenefitfromusingsmoothedrewardshapingforACKTR,andtherawrewardshapingisinfactharmfulinthiscase.ForPPO2,theagentusingourrepresentationthatistrainedonrawrewardpredictionslearnsthefastest.RegulardeepRL,wheretherepresentationsarelearnedfromscratch,isclearlyoutperformedbythevariantsthatusereward-predictiverepresentations.WebelievethatthisisbecauseRLagentscangenerallybenefitfromtheinputbeingpreprocessed,asthecomputationaloverheadforlearningthepolicyisreduced.Thiseffectisenhancedwhenthepreprocessingisgood,whichisthecaseforourreward-predictiverepresentation:itabstractsawayunnecessaryinformationasitistrainedtooutputfeaturesthatindicatethedistancebetweentheagentandthegoal,whenthegoalisinview.Thedifferenceinaggregatedmeanrewardsvs.aggregatedminimumepisodelengthscanbeexplainedduetosystematicallydifferentbehaviors.Forexample,anagentmighthaveaweaklong-termstrategyofcheckingthedifferentrooms,givingitpooraveragemeanrewards,butastrongshort-termtacticoftakingthedirectcoursetothegoalwhenitseesit,givingitagoodaverageminimumepisodelength.
5.6.2 Lavagapenvironment
Learningfromscratch
TheheatmapsofaveragepredictedrewardsarevisualizedinFig.5.8.Therewardpredictorwastrainedonthetwo-roomenvironment.Thetilesclosesttothegoalhavethehighestvalues,withaparticularlysmoothgradienttowardthegoalforthesmoothed-rewardpredictor,whichdemonstratesthatthereispotentialgainfromtransferringtheprediction-basedrewardshapingbetweensimilarenvironments.ThelearningperformanceofthedifferentmethodscanbeseeninFig.5.9.Thedecidedlyfastestlearningcanbeobservedwhentheactor-criticmethodiscombinedwithourrepresentation,trainedonrawrewardpredictionsandwithoutrewardshaping.RegulardeepRListhesecond-best,butwithaverylargevarianceontheperformance.OurrewardshapingvariationsandtheSFsareverycloseinperformance,albeitsignificantlyworsethantheothertwo.Thepoorperformanceofrewardshapingcanbeexplainedbythefactthatthereareveryfewstates,whichmakestherewardshapingunnecessaryinsuchasimpleenvironment.AllthemethodslookmoresimilarwhenPPO2optimizationisapplied,withrespecttothemeanrewards,butourvariantthatistrainedonsmoothrewardpredictionandusesrewardshapingreachesthehighestaverageperformanceinthelastiterations.
Transferlearning
Toinvestigatehowthemethodscompareforadaptingtonewenvironments,wetrainedthepoliciesfor8000stepsonthetwo-roomenvironmentbeforelearningtosolvethelavagapenvironment,seeFig.5.10.Ourmethod,withoutrewardshaping,facilitatesthefastestlearningforACKTRinthiscase.DeepRListhemostseverelyaffectedbythischange,whichisprobablyduetothemethodlearningareward-maximizingrepresentationinoneenvironmentthatdoesnottransferwelltoanotherenvironment.EveryPPO2variationlooksbadforthisscenario,butthesmooth-rewardpredictionrepresentationwithrewardshapinghasthehighestmeanrewardandourraw-rewardpredictionrepresentationhasthelowestaverageminimumepisodelength.
Wevisualizetrajectoriesofanagentthatistrainedonourrepresentation(Ours1r)asittraversesthelavagapsenvironment(Fig.5.11).Forinspectionofcaseswhereitfails,wechooseanagentthathasbeentrainedforthousandtimestepsonlyandhasaroundmeanreward.
5.6.3 Four-roomenvironment
Inourfinalcomparison,weaddtwoadditionalroomstothetwo-roomenvironmentandrandomizeboththegoallocationandthestartingpositionoftheagent,withtheresultsshowninFig.5.12.Lookingattheminimumepisodelengths,fortheACKTRlearner,ourraw-rewardpredictionrepresentationwithrewardshapingperformsbestandtheonewithoutrewardshapingcomesinsecond.ThereislittlediscernibledifferencebetweentheperformanceofSFsandDeepRL,buttheybothperformsignificantlyworsethanourmethods.Thescaleofthemeanrewardisagreatdeallowerthaninthepreviousexperiments,sincetheaveragedistancebetweenthestartingtileoftheagentandthegoalismuchlargerthanintheprevioustwoenvironments.Forthisscenario,allthemethodslooksimilarlybadforthePPO2policy,exceptforourraw-rewardrepresentations,withrewardshaping,whichhasthelowestminimumepisodelength.Thebigadvantageofrewardshapinginthisenvironmentcomparedtothetwo-roomenvironmentcanbeexplainedbytheincreasedcomplexity,makingtherewardshapingmorehelpfulinguidingtheagent’ssearch.Inthepreviousexperiments,theagentandgoallocationsstartatfixedlocations,allowingtheagentstosolveitbyrotememorization.Therewardshapingfunctioncalculatedbytheraw-rewardpredictorfaressignificantlybetterinthissituation.Wehypothesizethatthisisduetothesmoothed-rewardpredictordistractingtheagentbypushingittocorners,asthevisualizationinFig.5.5(b)wouldsuggest.Therewardshapinggivenbytheraw-rewardpredictorismorediscriminative,asweseeinFig.5.5(a).Theagentreceivesapositiverewardassoonasthegoalreachesitspointofview,whichisanylocationuptosixtilesinfrontofitandnofurtherthan3tilesawayfromittotheleftortotheright.Thisallowstherewardshapingfunctiontoguidetheagentdirectlytothegoal,assumingthattheyareinthesameroomandthatthereisnowallobstructingtheagent’sfieldofvision.
ThreesuccessfulandthreefailedtrajectoriesofanACKTRagentthathasbeentrained,usingourrepresentation(Ours1r)foramilliontimestepsarevisualizedinFig.5.13.Wecanseeundesirablebehaviorinboththesuccessfulandthefailedtrajectories,thattheagentwasteseffortre-visitingtilesithasalreadybeento.
5.7 Conclusion
Processinghigh-dimensionalinputsforreinforcementlearning(RL)agentsremainsadifficultproblem,especiallyiftheagentmustrelyonasparserewardsignaltoguideitsrepresentationlearning.Inthiswork,weputforwardamethodtohelpalleviatethisproblemwithamethodoflearningrepresentationsthatpreprocessesvisualinputsforRLmethods.Ourcontributionsare(i) areward-predictiverepresentationthatistrainedsimultaneouslywitharewardpredictorand(ii) arewardshapingtechniqueusingthistrainedpredictor.Thepredictorlearnstoapproximateeithertherawrewardsignalorasmoothedversionofit,anditisusedforrewardshapingbyencouragingtheagenttotransitiontostateswithhigherpredictedrewards.Weusedaviewofthegoalasasecondinputforthemethodsinourexperiments,butthisisinprinciplenotnecessary,asmovingtowardthegreentileasitbecomesvisibleissufficient.Removingthegoalinputmightencouragetheagentstolearnpoliciesthatscanalltheroomsfasteruntilthegoalreachesitsfieldofvision.Wehaveshowntheusefulnessofourrepresentationandourrewardshapingschemeinaseriesofgridworldexperiments,wheretheagentreceivesahigh-dimensionalobservationofitsgoalasaninputalongwithanobservationofitsimmediatesurroundings.Preprocessingtheinputusingthisrepresentationspeedsupthetrainingoftwoout-of-the-boxRLmethods,ActorCriticusingKronecker-FactoredTrustRegionandProximalPolicyOptimization,comparedtohavingthesemethodslearntherepresentationsfromscratch.Inourmostcomplicatedexperiment,combiningourrepresentationwithourrewardshapingtechniqueisshowntoperformsignificantlybetterthanthevanillaRLmethods,whichhintsatitspotentialforsuccess,especiallyinmorecomplexRLscenarios.
Chapter 6 Comparisonofourthreemethods
WecompareinthischapterthealgorithmsthatwedevelopedoverthecourseofthePhDwork:thegradient-basedICArepresentation(GrICA)fromChapter3,thelatentrepresentationpredictionnetwork(LARP)fromChapter4andthereward-predictionrepresentationfromChapter5(RewPred).Weusetherepresentationscalculatedbythesemethodstopreprocessinputsfordeepreinforcementlearning(DeepRL)agentsonfourdifferenttasks:acart-polebalancingenvironment,thetwo-roomandfour-roomgoal-findingenvironmentsfromthepreviouschapterandanobstacleavoidancetask.Theimplementationofourmethodsisthesameasdescribedintheexperimentalsectionoftheircorrespondingchapter,unlessspecifiedotherwise.InSection6.2weintroducethenewvisualcart-poleenvironmentanddiscussthenetworkarchitectures.InSection6.3wedescribeanddiscusstheresultsoftheexperiments.
6.1 Introduction
AseachmethodwasintroducedoverthecourseofthePhDwork,itwasonlycomparedtoothersimilarmethodsintheliterature.Wenowtaketheopportunitytocomparethemagainsteachotherindifferentenvironments.Thecart-poleenvironmentandtheobstacleavoidanceenvironmenthavethecommonalitythattheybothrequireonlyreactiveshort-termplanningtomaximizethereward,butthegoal-findingenvironmentsrequiremorelong-termplanningtofindandreachthegoal.EveryoneofourrepresentationsisusedforpreprocessingthevisualobservationsofeachenvironmentforRLalgorithms.TheRLagentsarethesameonesusedinthepreviouschapter:ACKTRandPPO2,withthedefaultparametersfromtheStableBaselinespackage.Aswewerenotabletosolvethecart-poleenvironmentusingACKTR,onlyPPO2isusedforthattask.WecomparethemethodfromChapter4,LARP,inthismanner,eventhoughitisnottrainedforpreprocessinginputsformodel-freeagentsbutisdevelopedtobepairedwithapredictorforpredictivestaterepresentationrollouts
6.2 Methods
6.2.1 Visualcart-pole
WestartbycomparingourmethodsasvisualpreprocessingforaPPO2policyonavariantoftheclassicalcart-polebalancingtask.Apoleisattachedtoacartthatmovesalongafrictionlesstrack.Thegoalistokeepthepoleuprightforaslongaspossible,buttheagentmustpushthecarttotheleftortotherightateverytimestep–choosingtoremaininplaceisnotanoption.Theoriginalenvironmentsuppliesfourscalarvariablestotheagent:thecartpositionandvelocityandthepoleangleandangularvelocity.Theagentreceivesapositiverewardofforeachtransition.Insteadofusingthesescalars,weadaptthistasktoourparadigmbyusingavisualizationoftheenvironment,seeFig.6.1.Tobeginwith,weextracttherenderingoftheenvironmentthatismeantfordebuggingandvisualizationpurposes.Aswecanapproximatethecartpositionandvelocityandthepoleangleandangularvelocityusingtwoadjacentframes,weusethecurrentandpreviousobservations.Tosimplifytheinputfortheagent,wetakethedifferenceofthetwoframes,croptheimagearoundthecartandbinarizeit.Theonlyinformationthatislostinthisprocess,comparedtotheoriginalenvironment,isthepositionofthecart,whichisnotsoimportantforsolvingthetask.Inadditiontogivingtheagentapositiverewardofforeverytimestepitkeepsthepolebalanced,wealsosupplyitwithanegativerewardofwhenthepolefallsdown.
Ourmethodsaretrainedon5000transitionsthatarecollectedfromarandompolicy.Eventhoughthisisnotasparse-rewardenvironment,wecalculatednewrewardsfortheRewPrednetworkaccordingtoEquation5.3withadiscountfactorofandamaximumhorizonof.Thishastheeffectthateverytimestepisassociatedwitharewardof,exceptforthesixtransitionsleadinguptotheendoftheepisode,whichhavetherewardsand.
6.2.2 Roomenvironments
WecomparetheperformanceofPPO2andACKTRpoliciesastheyuseourmethodsforpreprocessingonthetwo-roomandfour-roomgoaltasksfromthepreviouschapter,seeSection5.5.1.ThedatagatheringfortherepresentationsusedforpreprocessingfortheRLagentsisunchangedfromthepreviouschapter.
6.2.3 Obstacleavoidance
Inthisgridworldenvironment,theagentisplacedinaroomfilledwithcircularobjectsthatmoverandomlyineachstep(Fig.6.2).
Theagentreceivesvisualinputsofdimensionandmuststayasfarawayfromtheobjectsaspossible,asitreceivesarewardthatisequaltotheEuclideandistancebetweenitselfandtheclosestcircle.Theepisodeendswitharewardofaftertimestepsorwhentheagentcollideswithacircle,whichgivesarewardof.Thediscountfactorfortheenvironmentis0.9.
6.2.4 Architectures
Allrepresentationsandpolicieshavethesamearchitectureastheinthepreviouschapter,exceptthattheoutputoftherepresentationhasbeenloweredtoa16-dimensionalvectorforlowercomputationaltimes,seeTable5.1andTable5.2.TheRLmodelshavethesamearchitectureexceptthattherepresentationandpolicymodulesaretrainedend-to-end.ThepredictionmoduleoftheLARPnetworkisthesameasinChapter4,seeTable4.3.ThemutualinformationneuralestimatornetworkusedfortrainingGrICAisthesameasinChapter3,seeSection3.4.
6.3 Resultsanddiscussion
6.3.1 Visualcart-pole
Theresultsfromthevisualcart-poleexperimentcanbeseeninFig.6.3.ThedeepRLagentachievesthehighestmeanrewardbyasignificantmargin,probablybecausetheenvironmentrequirestheagenttotakequickactionsinresponsetothefast-changingenvironment.
ThelearningcurveistheworstwhentheagentistrainedontheLARPrepresentations,whichistobeexpectedasitislearnedtobepairedwitharepresentationpredictorandgraphsearch,whichitisnotusedforthisscenario.TheRewPredandGrICAlearningcurveslooksimilarlygood,withtheRewPredrepresentationachievingaslightlyhigheraveragereward.Bothrepresentationsshouldtheoreticallybeusefulhere:learningarepresentationthatispredictiveofhowclosethepoleisfromfallingdown(RewPred)oughttoguidetheagent’sactionsawayfromstateswherethepoleisabouttofalldown,andrecoveringthethreestatisticallyindependentlatentvariablesgeneratingtheenvironment(GrICA),whichwouldbeaperfectcompressionoftheenvironment.Inthisscenario,however,theydonotbeattherepresentationlearningoftheDeepRLalgorithm.
6.3.2 Roomgoal-finding
Theresultsfortrainingonthesmooth-rewardpredictionRewPred111Denoted"Ours64r"inthepreviouschapter.andDeepRLrepresentationsarekeptintheplotsfromthepreviouschapter,andwehaveaddedtheresultsfortheGrICAandtheLARPrepresentations.
Two-roomenvironment
Theresultsforthegoal-findingtask,whentheagentstartsfacingarandomdirectionbetweentworoomsandmustlocateastaticgoal,canbeseeninFig.6.4.Wedisplayheretheversionofthereward-predictiveRewPredrepresentationfromthepreviouschapterthatistrainedonsmoothedrewards.
FortheACKTRpolicy,boththeLARPandtheGrICAlearningcurves222SeethepreviouschapterfortheRewPredresultsareconsiderablylowerthanthoseoftheothertwo,withtheGrICArepresentationagainlookingslightlybetteroutofthetwo,bothwhenthemeanrewardandtheminimumepisodelengthsareconsidered.ThisisprobablyalsoduetothefactthattheadvantageoftheLARPrepresentationdisappearswhenthepredictorthatitistrainedwithisdiscardedandtherepresentationisbeingusedinamodel-freesetting.ForthePPO2policy,neitherthepoliciesthattrainontheLARPnortheGrICArepresentationsshowsignsoflearningtosolvetheenvironment.
Four-roomenvironment
Inthisexperiment,nocombinationofRLagentandrepresentationshowsprogresstowardsolvingtheenvironment,exceptforRewPredpairedwithACKTR.BoththeLARPandGrICAvariantsbarelydisplaylearninginthepreviousexperiment,butallsignsofprogressdisappearwhentheenvironmentissufficientlycomplex.ThetrainingresultscanbeseeninFig.6.5.
6.3.3 Obstacleavoidance
Thealgorithm’slearningcurvesfortheobstacleavoidanceexperimentcanbeseeninFig.6.6.ForbothPPO2andACKTR,weobservethatpreprocessingwithourmethodsisnotbeneficialastheenvironmentissolvedthefastestusingdeepRL,andbyasignificantmarginforACKTR.ThisshowsthatourmethodisoutperformedbyregulardeepRLforenvironmentswhereshort-term,reactivepoliciesareneeded–incontrasttothegoal-findingtasks,wheretheagentneedstoexecutealong-termplan.
6.4 Conclusion
Inthischapter,weinvestigatedhowthethreemethods,thatwehavedevelopedinthisPhDwork,comparewhentheyareusedforpreprocessingvisualinputsforRLagents.TheRLagentsaretestedinfourdifferentenvironments,twoofwhichrequireshort-termdecision-makingandtheothertworequirelong-termdecision-makingforsuccess.OurrepresentationthatwasdevelopedinChapter3,GrICA,doesnotfacilitatelearningforanyoftheenvironments.ThesameholdstruefortheChapter4representation,LARP,althoughthatonewasnotdevelopedforpreprocessinginputstoRLagentsbutrathertobeusedinconjunctionwithapredictionfunctionforplanninginalatentrepresentationspace.Ourreward-predictiverepresentationfromChapter5isshowntospeeduplearningforadeepRLagentonthelong-termplanningtasks.
Chapter 7 Summaryandconclusion
Justasthecomplexityofthetasksthatdeepreinforcementlearning(RL)isbeingappliedtoincreases,sodoestheapparentnumberofproblemsthefieldhas.ThegoalofthisresearchwastodiscoverusefulrepresentationsthatcanbeusedtoalleviatetwoofmoderndeepRL’sproblems,namely,thoseoftraininginstabilityanddatainefficiency.OverthecourseofthePhDwork,wedevelopedthreerepresentationlearningmethodsandinvestigatedtheirsuitabilityforfulfillingthisgoal:
-
Agradient-basedICAmethodforlearningstatisticallyindependentfeatures(Chapter3).Weestimatethemutualinformationbetweenoneoutputcomponentofanencoderandalltheothersusinganeuralnetwork.Inapush-pullfashion,themutualinformationestimatorandtheencoderaretrained,resultinginasystemthatoutputsstatisticallyindependentfeaturesoftheinput.TheoutputofGrICAiscomparetotheoutputofFastICA,anestablishedICAproblem-solvingmethodintheliterature,fornoisyblindsignalseparation.
-
Thissystem,whichwecall"LatentRepresentationPredictor"(Chapter4),learnsatransitionmodeloftheenvironment.Weevaluateoursystembycombiningitwithgraphsearchtomanipulatetoyobjectstomatchagivenviewpoint.Ouralgorithmlearnsastaterepresentationjointlywithaone-steplookaheadpredictor.Wediscussandcomparethreedifferentconstraintsthatcanbeplacedonthesystemtopreventthesolutionfromcollapsingtoaconstantfunction.OurapproachoutperformsdeepRLinalow-dataregimeontheviewpoint-matchingtask.
-
Areward-predictiverepresentation,thatislearnedalongwithajointlylearnedrewardpredictor(Chapter5).Therewardpredictorisemployedforrewardshaping:Theagentisrewardedformovingfromstatesoflowpredictedrewardstostatesofhigherpredictedrewards.Themethodistestedinseveralgridworldenvironmentswheretheagentmustreachagoal.ThelearningofdeepRLagentsisspedupwhentheirinputsarepreprocessedbytheRewPredrepresentationinourexperimentsandtherewardshapingishelpfulwhentheenvironmentissufficientlycomplex.
NoteveryunsupervisedlearningtechniqueyieldsastaterepresentationthatisusefulinthecontextofRL,asseenbytheresultsacrosstheboardusingourGrICAalgorithm.However,basedonourresults,wehavefoundthatthetrainingofdeepRLmethodscanbeaugmentedbylearningappropriaterepresentationsinaself-supervisedmannerinenvironmentswheretheagentmustcarryoutlong-termplanningtoreachasinglegoalstate.AnimportantlimitationofthisworkisthehighcostofcomputationalresourcesrequiredforcarryingoutRLresearch.Toillustrate,reproducingDeepMind’s2017Gopaper(silver2017mastering)isestimatedtocost35milliondollarsusingGoogle’scloudcomputingservice(dan2018how)–abudgetcurrentlyunavailabletoPhDstudents.Reproducingtheirresultsonlyinvolvesmaximizingtheperformanceofasinglemodel,whichdoesnottakeintoaccounttheworkbehindfindingthefinalmodel.RLmethodshavemanypotentialsettingstotune,andrunninghyperparametersearchesiscostly.Forthisreason,weconcentratedonlyonusingthedefaultparametersofRLmodelimplementationsasofferedbytheStableBaselinespackage.Tomakemattersworse,theperformanceofRLtechniquescanbehighlysensitivetothehyperparametersettings,andbadluckcancausetheresearchertodismissamethodifunfortunatevaluesareinitiallychosen.Alargercomputationalbudgetallowsresearcherstotrymorecombinationsofparameterswhennewmethodsaretested,lesseningtheriskofpromisingapproachesbeingprematurelydismissedduetobadluck.ThemethodswerealsotestedforpreprocessingthevisualinputsofdeepRLalgorithmsinenvironmentswherethereisasetofstatethattheagentmustavoid,namely,preventingapolefromfallingoffacartorpreventingtheagentfromcollidingwithrandomlymovingobjects.Preprocessingtheinputsusingourmethodsisnotbeneficialinthesecases,potentiallyindicatingthatallowingdeepRLalgorithmstolearnthestaterepresentationsispreferablefortasksthatrequireimmediatereactionstoquicklychangingenvironments,particularlywhereinactionleadstoanundesirableoutcome.Althoughourgradient-basedICAwasnotfoundadvantageousinourgoalofaugmentingdeepRLtraining,itissuccessfulinrecoveringindependent,noisysourcesjustaswellasFastICA.Apromisingavenueofresearchistotakeadvantageoftheflexibilityofferedbyourmethod.Ouralgorithmcanbepairedwithanydifferentiablefunction,suchasconvolutionalneuralnetworks(convnets),totacklemoredifficultproblems,forexamplenonlinearICA.EventhoughgeneralnonlinearICAproblemsareill-posed,regularizationcanbeappliedtomakethemwell-posed.Thiscaninprinciplebedoneusingourmethodviathedesignoftheneuralnetwork.Oneaspectofrepresentationlearningthathasbecomepopularinrecentyears,butwasnotinvestigatedinthisPhDwork,istheapplicationofmemorymodulesinneuralnetworks.Theroleofmemorycaneasilybeintegratedintoourmethodsbydesigningourfunctionapproximatorsasartificialrecurrentneuralnetworks,forexamplebycombininglongshort-termmemorynetworkswithconvnets.AlthoughtheproblemoflearningrepresentationsinthecontextofRLremainsunsolved,theimplicationofourworkisthattakingadvantageoftherichunsupervisedsourceofsupervisionthatishiddeninRLenvironmentscanleadtodata-efficient,stablealgorithmsthatareresilienttochangesintheenvironment.AstraightforwardwayofextendingourworkwouldbetotakeadvantageofthefulldatathatisgivenineverytransitioninanRLenvironment:thecurrentstate,theactiontakeninthestate,theresultingnextstateandtherewardgiventotheagentbytheenvironment.ItwouldbeaninterestingfuturelineofworktodothisbysimplycombiningthelossfunctionofLARP,whichtrainsontuples,andRewPred,whichtrainsontuples.Thisnewalgorithmwouldtraintherepresentationtobesimultaneouslypredictableforarepresentationpredictor,butalsotobeusefulforarewardpredictor.SeeFig.7.1foraVenndiagramthatsummarizesthesourcesofsupervisionthatourproposedextensionwoulduseinadditiontothesourcesofsupervisionthatareusedbyourmethods.
Thisdissertationaddstoarapidlygrowingbodyofknowledgethatsitsattheintersectionofdeeplearning,reinforcementlearningandrepresentationlearning.Wecontributethreenovelmethodsoflearningstaterepresentationstotheliteratureandexperimentallyevaluatetheireffectivenessinthecontextofreinforcementlearning.Alongroadtoartificialintelligencethatmatchesourgeneralandefficientproblem-solvingcapabilitystillliesaheadofus.\pdfbookmark[0]Bibliographybibliography
Bibliography
Appendix A
[1]Nomenclaturenomenclature\nomenclature[C]StateSpace\nomenclature[C]ActionSpace\nomenclature[C]RealNumbers\nomenclature[S]RLReinforcementLearning\nomenclature[S]MDPMarkovDecisionProcess\nomenclature[S]POMDPPartiallyObservableMDP\nomenclature[S]MLPMultilayerPerceptron\nomenclature[S]ANNArtificialNeuralNetwork\nomenclature[S]DLDeepLearning\nomenclature[S]PCAPrincipalComponentAnalysis\nomenclature[S]ICAIndependentComponentAnalysis\nomenclature[S]SFSuccessorFeatures\nomenclature[S]GrICAOurGradient-basedICAAlgorithm(fromChapter4)\nomenclature[S]LARPLatentRepresentationPrediction(fromchapter5)\nomenclature[S]RewPredOurReward-predictiveRepresentation(fromChapter6)\nomenclature[S]MINEMutualInformationNeuralEstimation\nomenclature[S]CAEVariationalAutoencoder\nomenclature[S]VAEConvolutionalAutoencoder\nomenclature[S]LEMLaplacianEigenmaps\nomenclature[S]Conv.Convolutional\nomenclature[S]ReLURectifiedLinearUnit\nomenclature[S]t-SNEt-distributedStochasticNeighborEmbedding\nomenclature[S]PPOProximalPolicyOptimization\nomenclature[S]DQNDeepQ-Network\nomenclature[S]ACKTRActorCriticusingKronecker-FactoredTrustRegion\nomenclature[S]MSEMeanSquaredError\nomenclature[S]MBRLModel-basedReinforcementLearning\nomenclature[S]MBRLModel-freeReinforcementLearning\nomenclature[V]Representation\nomenclature[V]Learningrateparameter\nomenclature[V]Policyfunction\nomenclature[V]OptimalPolicy\nomenclature[V]Theparametersofadifferentiablefunction\nomenclature[C]StateTransition\nomenclature[C]Probability\nomenclature[C]Rewardfunction\nomenclature[C]state-valuefunctionof\nomenclature[C]action-valuefunctionof\nomenclature[C]Dataset\nomenclature[V]Observationspace\nomenclature[C]Observationfunction\nomenclature[C]Timestepindex\nomenclature[C]Action\nomenclature[C]State\nomenclature[C]Observation\nomenclature[C]Reward\nomenclature[V]Discountfactor\nomenclature[C]Lossfunction\printnomenclature