\makenomenclature\pdfbookmark

[0]Abstractabstract

Visualprocessingincontextof

reinforcementlearningHlynurDavíðHlynssonAthesispresentedforthedegreeof

DoctorofEngineeringFacultyofElectricalEngineeringandInformationTechnology

Ruhr-UniversityBochum

Germany

July2021

Visualprocessingincontextofreinforcementlearning

DissertationforthedegreeofDoctorofEngineeringoftheFacultyofElectricalEngineeringandInformationTechnologyattheRuhr-UniversitätBochum

Nameoftheauthor:	HlynurDavíðHlynsson
Placeofbirth:	Reykjavík
Firstsupervisor:	Prof.Dr.LaurenzWiskott
	Ruhr-UniversitätBochum,Germany
Secondsupervisor:	Prof.Dr.TobiasGlasmachers
	Ruhr-UniversitätBochum,Germany
Yearofthesissubmission:	2021
Dateoftheoralexamination:	March17th2022

Abstract

Althoughdeepreinforcementlearning(RL)hasrecentlyenjoyedmanysuccesses,itsmethodsarestilldatainefficient,whichmakessolvingnumerousproblemsprohibitivelyexpensiveintermsofdata.Weaimtoremedythisbytakingadvantageoftherichsupervisorysignalinunlabeleddataforlearningstaterepresentations.ThisthesisintroducesthreedifferentrepresentationlearningalgorithmsthathaveaccesstodifferentsubsetsofthedatasourcesthattraditionalRLalgorithmsuse:(i)GRICAisinspiredbyindependentcomponentanalysis(ICA)andtrainsadeepneuralnetworktooutputstatisticallyindependentfeaturesoftheinput.GrICAdoessobyminimizingthemutualinformationbetweeneachfeatureandtheotherfeatures.Additionally,GrICAonlyrequiresanunsortedcollectionofenvironmentstates.(ii)LatentRepresentationPrediction(LARP)requiresmorecontext:inadditiontorequiringastateasaninput,italsoneedsthepreviousstateandanactionthatconnectsthem.Thismethodlearnsstaterepresentationsbypredictingtherepresentationoftheenvironment’snextstategivenacurrentstateandaction.Thepredictorisusedwithagraphsearchalgorithm.(iii)RewPredlearnsastaterepresentationbytrainingadeepneuralnetworktolearnasmoothedversionoftherewardfunction.TherepresentationisusedforpreprocessinginputstodeepRL,whiletherewardpredictorisusedforrewardshaping.Thismethodneedsonlystate-rewardpairsfromtheenvironmentforlearningtherepresentation.Wediscoverthateverymethodhastheirstrengthsandweaknesses,andconcludefromourexperimentsthatincludingunsupervisedrepresentationlearninginRLproblem-solvingpipelinescanspeeduplearning.

\pdfbookmark

[0]KurzfassungderDissertationkurzfassung

KurzfassungderDissertation

ObwohltiefesVerstärkungslernen(VL)indenletztenJahrengroßeErfolgeerzielthat,sinddessenMethodenimmernochdatenineffizient,wasdieLösungvielerProblemeunerschwinglichmacht.WiruntersuchendieMöglichkeit,dieszubeheben,indemwirdasinformationsreicheÜberwachungssignalinnichtgekennzeichneteDatenfürdieDarstellungvonLernzuständennutzen.IndieserArbeitwerdendreiverschiedeneRepräsentationslernalgorithmenvorgestellt,dieZugriffaufverschiedeneTeilmengenderDatenquellenhaben,dieherkömmlicheVL-AlgorithmenzumLernenverwenden:(i)GrICAistvonderunabhängigenKomponentenanalyse(ICA)inspiriertundtrainierteintiefesneuronalesNetzwerk,umstatistischunabhängigeKomponentenderEingabeauszugeben.GrICAminimiertdiegemeinsamenInformationenvoneinzelnenMerkmalenmitdenjeweilsanderenMerkmalen.ZusätzlicherfordertGrICAlediglicheineunsortierteSammlungvonUmgebungszuständen.(ii)LatentRepresentationPrediction(LARP)erfordertmehrKontextdaten:AlsEingabebenötigtsiezusätzlichzueinemZustandauchdenentsprechendenvorherigenZustandundeineHandlung,welchedieseverbindet.DieMethodelerntZustandsdarstellungen,indemsiedieDarstellungdesnächstenZustandsderUmgebungmithilfeeinesaktuellenZustandsundeineraktuellenAktionvorhersagt.DerPrädiktorwirdzusammenmiteinemGraphensuchalgorithmusverwendet.(iii)RewPredlerntdieZustandsdarstellung,indemeintiefesneuronalesNetzwerktrainiertwirdeinegeglätteteVersionderBelohnungsfunktionzulernen.DieDarstellungwirdzurVorverarbeitungvonEingabenimtiefenVLverwendet,währendderBelohnungsprädiktoralsBelohnungsformungdient.DieseMethodebenötigteinzigStatus-Belohnungs-PaareausderUmgebung,umdieDarstellungzulernen.Wirstellenfest,dassjedeMethodeihreStärkenundSchwächenhat,undschließenausunserenExperimenten,dassdasEinbeziehenvonunbeaufsichtigtemRepräsentationslerneninVL-ProblemlösungspipelinesdasLernenbeschleunigenkann.\pdfbookmark[0]Dedicationdedication

\pdfbookmark

[0]Acknowledgementsacknowledgements

Acknowledgements

IwanttofirstthankmysupervisorProf.LaurenzWiskottforgivingmetheopportunitytoresearchthenichesofmachinelearningthatIfindinteresting.Hisinsightfulfeedbackandoutstandingenthusiasmandintuitionforthefieldprovedinvaluabletomeandothersinthisfastgrowingareaofresearch.Theadviceofmysecondsupervisor,TobiasGlasmachers,alsoprovedextremelyhelpful,especiallyonthetopicofreinforcementlearning.I’mgratefulfortheendlessloveandsupportfrommypartnerLisaSchmitz,whomadethetimeofmyPhDstudiesthebestinmylife–sofar.Myspecialthanksgoalsotoherparents,RosemarieSchmitzandGeorgSchmitz,fortheirsupportduringthistime.I’mthankfulforthehelpfulandpleasantenvironmentcreatedbytheotherPhDstudentsoftheInstitutfürNeuroinformatik(INI):MerlinSchüler,RobinSchiewer,ZahraFayyaz,EddieSeabrook,MortizLange,FrederickBaucks,JanBollenbacherandJanTekülve.IwouldalsoliketothanktwoalumnioftheINI,AlbertoEscalanteandFabianSchönfeld,forthesupporttheyhavegivenme.IalsowanttothanktheINIstaffoutsidethegroupfortheirhelpovertheyears:ArnoBerg,AngelikaWilleandKathleenSchmidt.Lastbutnotleast,IwanttothankmylovingfamilyforcreatingthecircumstancesthatgavemeroomtotraintheskillsthatIneededtopursueaPhDtobeginwith:ÓlöfIngibjörgEinarsdóttir,HlynurHöskuldsson,ÓlafurHlynssonandHöskuldurHlynsson.

1 Introduction
2 Background
3 Learninggradient-basedICAbyneurallyestimatingmutualinformation
4 Latentrepresentationpredictionnetworks
5 Rewardpredictionforrepresentationlearningandrewardshaping
6 Comparisonofourthreemethods
7 Summaryandconclusion
A

List of Figures

2.1 Illustration:Afully-connectedneuralnetwork
2.2 Illustration:Aconvolutionlayer
(a) Inputwith(height $\times$ width $\times$ depth)dimensionsof( $3 \times 3 \times 1$ ).
(b) Inputwith(height $\times$ width $\times$ depth)dimensionsof( $3 \times 3 \times 2$ ).
2.3 Example:DimensionalityreductionbyPCAandt-SNE
2.4 Illustration:Anautoencoder
3.1 Example:Thelavafieldenvironment
3.2 Illustration:Ourindependentfeaturelearningsystem
3.3 Result:Noisysignalrecovery
(a) Theoriginalsources.
(b) Linearmixtureofsources.
(c) Sourcesrecoveredbyourmethod.
(d) SourcesrecoveredbyFastICA.
3.4 Result:Rewardduringtrainingonthelavafieldenvironment
3.5 Result:Trajectoriesinthelavafieldenvironment
3.6 Result:Trajectorieswithshiftedlavafields
3.7 Result:Reconstructionbyautoencoder
3.8 Result:Rewardduringtrainingonthelavafieldenvironment(convolutionalautoencoder)
4.1 Illustration:Latentrepresentationpredictionnetwork
4.2 Illustration:Predictiverepresentationlearningwithspheringregularization
4.3 Illustration:Predictiverepresentationlearningwithcontrastivelossregularization
4.4 Illustration:Predictiverepresentationlearningwithdecoderlossregularization
4.5 Result:LaplacianEigenmaprepresentationspaceofaNORBtoy
(a) *
(b) *
(c) *
(d) *
(e) *
(f) *
4.6 Result:HeatmapofLaplacianeigenmaplatentspacesimilarity
4.7 Result:HeatmapofVGG16latentspacesimilarity
4.8 Result:AggregateheatmapsofVGG16representationsimilaritiesontestdata
4.9 Result:Histogramsofelevation-wiseandazimuth-wiseVGG16errors
4.10 Result:LARPreinforcementlearningcomparison
4.11 Result:LARPre-trainingafterplacingobstaclesinacheckerboardpattern
4.12 Example:Toysfortransferlearningexperiments
4.13 Result:LARPLatentspacevisualization
(a) LARPelevationt-SNE
(b) CAEelevationt-SNE
(c) VGG16elevationt-SNE
(d) LARPazimutht-SNE
(e) CAEazimutht-SNE
(f) VGG16azimutht-SNE
(g) LARPlightingt-SNE
(h) CAElightingt-SNE
(i) VGG16lightingt-SNE
5.1 Illustration:Reward-maximizingvs.reward-predictiverepresentations
5.2 Illustration:Learningandusingtherepresentation
5.3 Example:Thetwo-roomenvironment
(a) Fullworldstates.
(b) Agent’spointofview.
(c) Goalobservations.
5.4 Example:Thelavagapenvironment
5.5 Example:Four-roomenvironment
5.6 Result:Predictedrewards,two-roomenvironment
(a) Rawrewardprediction.
(b) Smoothedrewardprediction.
5.7 Result:Two-roomenvironment
5.8 Result:Predictedrewards,lavagap
5.9 Result:Lavagapexperiment
5.10 Result:Re-learningexperiment
5.11 Result:Lavagaptrajectories
(a) Sixsuccessfulepisodes.
(b) Sixfailedepisodes.
5.12 Result:Fullfour-roomenvironment
5.13 Result:Four-roomtrajectories
(a) Threesuccessfultrajectories
(b) Threefailedtrajectories
6.1 Example:Visualcart-pole
6.2 Example:Obstacleavoidanceenvironment
6.3 Results:Visualcart-pole
6.4 Results:Gridworldcomparison,singlegoallocation
6.5 Results:Gridworldcomparison,four-roomgoal-finding
6.6 Results:Obstacleavoidance
7.1 Illustration:SupervisionsourceVenndiagram

List of Tables

1.1 Result:Inputdatatypepermethod
3.1 Result:OurconvolutionalICAnetwork
3.2 Result:Convolutionalautoencodernetworkarchitecture
4.1 Result:LARPrepresentationnetworkarchitecture
4.2 Result:LARPregularizingdecoderarchitecture
4.3 Result:LARPrepresentationpredictorarchitecture
4.4 Result:Ablationstudyoftherepresentationdimensionality
4.5 Result:LARPtransferlearningperformance
5.1 Result:Representationnetwork
5.2 Result:Policynetwork
5.3 Result:Rewardpredictionnetwork

Chapter 1 Introduction

Mankindhasbeeninterestedintheconceptofinfusinginanimateobjectswithitsintellectforthousandsofyears,withstoriesofartificiallyintelligentbeingsreachingbackthousandsofyears.ThisinterestmanifestsitselfforexampleinthestorytoldbyancientGreeksofthegreatautomataTalos,whowascraftedoutofbronzebythesmithinggodHephaestustoprotectthemythologicalqueenEuropa(rhodios2008argonautika).Automatahavebeenbuiltbycraftsmenfromdifferentculturesthroughouttheages,buttheyhavebeensimplemechanicalbeings(mccorduck1979machines).Thepossibilityofsatisfyingthehumandesiretocraftintelligentbeingshasonlyariseninthemiddleofthe19thcenturywiththefoundingofartificialintelligence(AI)asanacademicdiscipline.ThevastscopeofAIhasgivenrisetodifferentfields,eachwithitsownapplicationdomains,methodologiesandphilosophies.Onesuchexampleismachinelearning(ML),anareaofAIthatisconcernedwithalgorithmsthatleveragedatafordecision-making.Thisfieldhasexperiencedgreatsuccessinbothacademiaandindustryinthelastdecadewithincreasedaccesstopowerfulcomputersandlargedatabasesinadditiontoadelugeofadvancedcomputationaltechniquesandflexiblesoftwaresolutions(clark20152015).Thefieldisnotwithoutitsdrawbacks,however.Machinelearningalgorithmsneeddatacorrespondingtomonthsoryearsofhumanexperiencetogetcompetenceintasksthatapersoncanmasterinminutes.Peoplehavetheadvantagethattheycomeequippedwithastrongerunderstandingoftheworld.Inthisdissertation,weaimtoleveltheplayingfieldforMLmodelsthatperformsequentialdecision-makingbyexploringdifferentwaysforthemto"understand"theworldthroughrepresentations.InSection1.1,wediscussoneofthemostpromisingfieldsofartificialintelligence,deepreinforcementlearning,andmentionitsvictories.ThecurrentsituationisevaluatedinSection1.2wherewedescribethechallengesanddisadvantagesofthefield.ThevalueoftheresearchdirectionweputforwardisunderlinedinSection1.3aswellasourresearchobjectiveandhypothesis.ThechapterconcludeswithSection1.4whereweoutlinetheorderofcontentinthedissertation.

1.1 Deepreinforcementlearning

Awatershedmomentforartificialintelligencehappenedwhenkrizhevsky2012imagenetcombinedseveraltechniquesfromtheliteratureandconstructedadeep¹¹1Neuralnetworksthatprocesstheinputhierarchicallyusingatleastmorethantwolayersofcomputationallayersarecalleddeepneuralnetworktooutperformthecompetitionbyasignificantmargininanimageclassificationcontest.Thiswasthecatalystoftheso-calleddeeplearningrevolution(sejnowski2018deep)whichhasimpactedfieldssuchasnaturallanguageprocessing(wolf2020transformers),bioinformatics(li2019deep),computervision(khan2018guide),frauddetectionandmanyothers(alom2019state).Theareaofmachinelearningconcernedwiththetrainingofdeepneuralnetworksiscalleddeeplearning.Deeplearningmethodshavetheadvantagethattheyautonomouslylearnpatternsinthedatainahierarchicalmanner.Forexample,edgesareusefulpatternsforpicturesofshapessuchassquares,trianglesandcircles(patrick2010ai).Theedgescanbecombinedtocornersandthenumberofcornerscanbecountedfordistinguishingbetweenthedifferentshapes.Sincedeeplearningencompassesabroadsetofmachinelearningalgorithms,itcanbereadilycombinedwithotherareasofmachinelearning.Onesuchareaisreinforcementlearning,wheregeneralgoal-directeddecision-makingproblemsarestudied.Reinforcementlearning(RL)methodstrainmodelsthatareinteractingsequentiallywiththeirenvironmentstomaximizearewardsignal.DeeplearningisfrequentlycombinedwithRLtechniques,allowingthemodelstomaptheinputs,suchashigh-dimensionalimagedata,directlytoactions.Thiscombinationhasyieldedpromisingresultsindifferentareas,rangingfromrecommendersystems(zhang2019deep)overautonomousdriving(kiran2021deep)toplayinggames(silver2017mastering).

1.2 Openproblems

Aknownproblemofdeepneuralnetworksisthattheyrequirealargeamountofdataforadequateperformance.Thisdatacanbeprohibitivelyexpensivetoobtain–eitherintermsoftimeneededtocreatedataandtrainthemodelsforreinforcementlearningmethodsormonetarycostofacquiringhuman-labeledtrainingdataforsupervisedlearningmethods–whichhasencouragedthedevelopmentofmethodsthatlearnrepresentationsfromstreamsofmorereadilyavailable,unlabeleddata.Thisproblemincreasesinseverityindeepreinforcementlearning(DRL).Intheuncompromisinglytitledblogpost,Deepreinforcementlearningdoesn’tworkyet,irpan2018deepidentifiesseveralfundamentalproblemsofDRL.Oneofthemistheproblemofsampleinefficiency,wheremanyhighlypublicizedstate-of-the-artresultsonvideogamesrequirehundredsofmillionsofframesofexperiencetoachieveperformancethathumansreachinamatterofminutes.Anotherproblemistheoneofinstability.Deepneuralnetworksarehighlyexpressiveandoptimizelargenumbersofparameters.ThismakesthedesignofDRLmodelsdifficult,asthesearchofhyperparameters²²2Ahyperparameterisbroadlyspeakinganydesignchoicemadebytheprogrammerbeforethelearningofthe"regular"parametersstarts.thatsolvetheproblemcanbequitetime-consuming.Evenwhenapromisingsetofhyperparametersisfound,thedifferencebetweentheperformanceofdifferentmodelslearnedfromscratchcanbesignificant,dependingontherandomseed.ThisincreasedvariancecomesfromthenewsourceofrandomnessthatisintroducedtoRLmodels,comparedtoregularregressionlearning:theagentsactionsarestochastic,increasinglysointhebeginningoflearning³³3Thereisatradeoffbetweenexploringtheenvironmentandexploitingtheexpectedrewardsignal.AcommonstrategyforRLagentsistostartthelearningwithahighchanceofperformingrandomactionstoexploredifferentstatesoftheenvironmentandthendecreasethischanceasthelearningprogresses..

1.3 Researchaim

Inthisdissertation,weproposemethodsforunsupervisedandself-supervisedlearningofrepresentationsforgoal-directedbehavior.Self-supervisedlearningmethodsuseasubsetoftheinputtopredicttherestofit,foregoingtheneedofannotationswhiletakingadvantageofthepowerfulmachineryofsupervisedlearningmethods.Tacklingtheopenproblemsofdatainefficiencyandinstabilityoutlinedaboveinordertofurtherthefieldisourintentionwiththisthesis.Wedosobydevelopingandinvestigatingthreedifferentapproaches:(i)unsupervisedlearningofarepresentationforRLagents,(ii)amethodofjointlylearningapredictorforplanningarepresentationthatisgoodforthetransitionprediction,and(iii)learningarepresentationforRLagentsasthebyproductofrewardprediction.WerelatethedataneededtolearntherepresentationsforourmethodstotheavailabledatainthecontextofRLinTable1.1.Ourhypothesisisthatsuitablestaterepresentationsthatreducethecomplexityofhigh-dimensionalinputsinRLsettingscansupportamorestableanddataefficientlearningthanhavingdeepRLalgorithmslearnstaterepresentationsfromscratch.

Subsetof ${s, a, r, s^{'}}$ requiredforlearning	Method	Chapter
${s}$	GrICA	3
${s, a, s^{'}}$	LARP	4
${s, r, s^{'}}$	RewPred	5

Table 1.1: Inputdatatypepermethod.IfanRLagentisinastate

s

andperformstheaction

a

,itwillreceivearewardof

r

andtransitiontothestate

s^{'}

.Themethodsproposedinthisthesislearnstaterepresentationsbyprocessingdifferentsubsetsofthedatatuples

{s, a, r, s^{'}}

1.4 Thesisoutline

Hereweoutlinethestructureofthethesis.Threeofthechaptersareadaptedfromtheworkthatwerepublishedoverthecourseofthedoctoralwork.\nobibliography*

Chapter2:Background.Inthischapter,wegoinfurtherdetailsonthemaintopicsinthisthesisanddiscusstheirfundamentals.WeintroducetheformalismofMarkovdecisionprocessesandexplainthedifferencebetweenmodel-basedandmodel-freereinforcementlearningalgorithms.Themachinerybehinddeeplearningisthenexplainedandthemainbuildingblocksofdeepneuralnetworksareillustrated.Thechapterconcludeswithadiscussionofthemainrepresentationlearningmethods.
Chapter3:Learninggradient-basedICAbyneurallyestimatingmutualinformation.Thischapterdiscussesanadaptionofindependentcomponentanalysis(ICA)forDL.Weintroduceanovelapplicationofaneuralmethodformutualinformationestimationtolearnarepresentationwithstatisticallyindependentfeatures.Thechapterisanadaptedversionof
- \bibentry
  
  hlynsson2019learning(hlynsson2019learning)
Chapter4:Latentrepresentationpredictionnetworks.Thischapterdiscussesamethodformanipulableenvironmentsforjointlylearningarepresentationofobservationsandamodelforpredictingthenextrepresentation,givenanaction.Welearntherepresentationinaself-supervisedmanner,withouttheneedofarewardsignal.Weintroduceanewenvironmentthatisakintomanipulatingtoyobjectsforaviewpointmatchingtask.Therepresentationiscombinedwithagraph-searchalgorithmtofindthegoalviewpoint.Thechapterisanadaptedversionof
- \bibentry
  
  hlynsson2020latent(hlynsson2020latent)
Chapter5:Rewardpredictionforrepresentationlearningandrewardshaping.Thischapterdiscussesaself-supervisedlearningmethodtomaphigh-dimensionalinputstoalowerdimensionalspaceforRLagents.Weintroduceatechniquewherearepresentationlearnedforarewardpredictorisusedtoshapetherewardfortheagents.Thechapterisanadaptedversionof
- \bibentry
  
  hlynsson2021reward(hlynsson2021reward)
Chapter6:Comparisonofourmethods.Inthischapter,wedirectlycomparethethreedifferentmethodstostate-of-the-artdeepRLmethodsonfourdifferentenvironments:avisualpole-balancingenvironment,twogoal-findingenvironmentandanobstacleavoidanceenvironment
Chapter7:Summaryandconclusion.Thischapterclosesthedissertationwithabriefsummaryofthethesis,concludingremarksandpossiblefuturework.Thefollowingworkwasalsopublishedoverthecourseofthedoctoralstudies:
- \bibentry
  
  hlynsson2019measuring(hlynsson2019measuring)
Thepapercomparessupervisedlearningmethods,butitistoodissimilarintopicfromtherestoftheworkandisthuschosentobeomittedfromthisdissertation.

Chapter 2 Background

Thischapterlaysoutthefundamentalconceptsofmachinelearningthatformsthefocalpointoftherestofthedissertation.InSection2.1,welayoutthemainobjectofstudyinreinforcementlearning(RL),partiallyobservableMarkovdecisionprocesses,andpresentabrieftaxonomyofreinforcementlearningalgorithms.WeexplainthebasicsofartificialneuralnetworksanddeeplearninginSection2.2.Themostcommonlyusedtypeofneuralnetworkusedforprocessingvisualdata,theconvolutionalneuralnetwork,isthendescribed,alongwithsomeofitsmainbuildingblocks.InSection2.3,wediscussrepresentationlearning(alsoknownasfeaturelearning)andmotivateitinthecontextofreinforcementlearning.Foramorein-depthdiscussionofthesetopics,wereferthereadertothecomprehensivetextbookonRLbysutton2018reinforcement,thedeeplearningbookbyGoodfellow-et-al-2016andtheexcellentsurveybybengio2013representationonrepresentationlearning.

2.1 Reinforcementlearning

Inthissection,weformalizeRLfortherestofthethesis.RLisoneofthemaindisciplinesofmachinelearning,anditcovershowagentscanlearntobehaveoptimallyinanenvironmenttomaximizeacumulativereward.

2.1.1 PartiallyobservableMarkovdecisionprocesses

Apartially-observableMarkovdecisionprocess(POMDP)isageneralframeworkformodelingsequentialdecisionprocessesinenvironmentsthatcanbestochastic,complexandcontainhiddeninformation.Formally,itisatuple

(S, A, P, R, P (s_{0}), Ω, O, γ)

(2.1)

whichwealsorefertoastheenvironment.Thetupleismadeupofthefollowingelements:

Thestatespacedefinesthepossibleconfigurationsoftheenvironment
Theactionspacedescribeshowtheagentisabletointeractwiththeenvironment
Thetransitionfunction $P : S \times A \to P (S)$ dictatestheeffectsofdifferentactionsindifferentstates
Therewardfunction $R : S \times A \times S \to R$ determinestheimmediaterewardgiventotheagentfortransitioningbetweenanytwostateswithanyaction
Theinitialstatedistribution
Theobservationspacedefinestheaspectsoftheenvironmentthattheagentcanperceive
Theobservationfunction $S \times A \to P (Ω)$ defineswhat(potentiallytransformed)subsetoftheenvironmenttheagentreceivesafteractinginagivenstate
Therewarddiscountfactor

Theenvironmentstartsinastatedrawnfrom $P (s_{0})$ ,fromwhichtheagentinteractssequentiallywiththeenvironmentbychoosingaction $a_{t}$ fromactionspace $A$ attimesteps $t$ .Theagentreceivesanobservation $o_{t}$ andareward $r_{t}$ aftereachaction.TheobjectiveofanRLagentistolearnapolicy $π$ thatdeterminesthebehavioroftheagentintheenvironmentbymappingstatestoaprobabilitydistributionover $A$ ,written $π (a, s) = P (a_{t} = a | s_{t} = s)$ .Adiscountfactor $γ \in (0, 1)$ isusuallyincludedinthedefinitionofPOMDPs,anditcomesintoplayintheoptimizationfunctionoftheagent.Namely,thepolicyshouldmaximizetheexpecteddiscountedfuturesumofrewards,ortheexpectedreturn,wherethereturnisdefinedas

R = \infty \sum t = 0 γ^{t} r_{t}

(2.2)

Thevaluefunctionisdefinedastheexpectationofthereturn(Eq.2.2),givenapolicy $π$ andaninitialstate $s_{0} = s$

v_{π} (s) = E [R | s_{0} = s, π] = E [\infty \sum t = 0 γ^{t} r_{t} | s_{0} = s, π]

(2.3)

Thereisatleastoneoptimalpolicy $π^{*}$ thatisbetterthanorequaltoothers: $v_{π^{*}} (s) \geq v_{π^{'}} (s)$ forallstates $s$ andallotherpolicies $π^{'}$ .

2.1.2 Model-freealgorithms

Model-freereinforcementlearninglearnsthepolicyoravaluefunctiondirectlyfromexperiencewithoutattemptingtoapproximatethedynamicsoftheenvironment.Twopopularclassesofmodel-freemethodsarevalue-basedmethodsandpolicy-basedmethods.Value-basedmethodsapproximateeitherthevaluefunction(sutton1988learning)oranotherusefulfunctionthatissimilartothevaluefunction,theaction-valuefunction $q$ .Thisfunctionisdefinedastheexpectedreturnoffollowingthepolicy $π$ aftertakinganaction $a$ inastate $s$ :

q_{π} (s, a) = E_{π} [R_{t} | s_{t} = s, a_{t} = a] = E_{π} [\infty \sum k = 0 γ^{k} r_{t + k + 1} | s_{t} = s, a_{t} = a]

(2.4)

Estimatingtheaction-valuefunctionisapivotalstepforalgorithmssuchasQ-learning(watkins1992q).Asimpleone-stepQ-learningupdatingruleis(sutton2018reinforcement):

q_{π} (s_{t}, a_{t}) \leftarrow q (s_{t}, a_{t}) + η [r_{t + 1} + γ max a q (s_{t + 1}, a) - q (s_{t}, a_{t})]

(2.5)

where $η$ isapositivelearningrateparameterandtheinitialvaluesof $q_{π} (s_{t}, a_{t})$ arechosenarbitrarily.Q-learningisguaranteedtoconvergetotheoptimalpolicy’saction-valuefunction $q_{π^{*}}$ ,undercertainconditions¹¹1Thisdependsonagoodlearningratescheduleandexplorationtechniques,whicharedifficulttodetermineinpractice,whichinturnyieldstheoptimalpolicy: $π∗=\argmaxBaqπ∗(a,s)$ .Thismethodtabulatesthevaluesandthusworkswithdiscreteactionsandstatespaces.Q-learninghasbeencombinedwithdeepneuralnetworkstoworkforactionsandstatespacesofhigherdimensions(mnih2015human).Policy-basedmethodsdonotlearnavaluefunction,butratherlearnthepolicydirectlybyoptimizinganobjectivefunctionwithrespectto $π$ .Wedescribetwoofthosemethodsthatweemployinthiswork:(1)proximalpolicyoptimization(PPO)(schulman2017proximal)and(2)actorcriticusingKronecker-factoredtrustregion(ACKTR)(wu2017scalable).PPOoptimizestheobjectivefunction

L^{C L I P} (θ) = {^E}_{t} [min ({ratio}_{t} (θ) {^A}_{t}, clip ({ratio}_{t} (θ), 1 - ϵ, 1 + ϵ) {^A}_{t}]

(2.6)

where $ϵ$ isahyperparameter, ${ratio}_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t}, s_{t})}$ and $^A$ isanestimatoroftheadvantagefunction $A = q_{π} (s, a) - v_{π} (s)$ .Thecliptermreturns ${ratio}_{t} (θ)$ if $1 - ϵ < {ratio}_{t} (θ) < 1 + ϵ$ ,otherwisethevalueisclippedtothecloserboundaryvalue.ThefullPPOalgorithmisshowninAlgorithm1.

1: for iteration

= 1, 2, \dots

2: for actor

= 1, 2, \dots, n

3: Runpolicy

π_{θ_{old}}

inenvironmentforttimesteps

4: Computeadvantageestimates

{^A}_{1}, \dots, {^A}_{t}

5: end for

6: Optimize

L^{C L I P}

withrespectto

θ

,withkepochsandminibatchsize

m \leq n t

θ_{o l d} \leftarrow θ

8: end for

Algorithm 1 Proximalpolicyoptimization

ACKTRappliesthepolicygradientupdates

θ \leftarrow θ - η {^F}^{- 1} \nabla_{θ} L

(2.7)

where $^F \approx E [\nabla_{π} log π (a_{t} | s_{t}) (\nabla_{θ} log π (a_{t} | s_{t}))^{T}]$ , $L$ isthelog-likelihoodoftheoutputdistributionofthepolicyandthelearningrate $η = min (η_{max}, \sqrt{\frac{2 δ}{Δ θ^{T}^F Δ θ}})$ iscontrolleddynamicallywiththetrustregionparameter $δ$ topreventthepolicyfromconvergingprematurelytoapoorpolicy.

2.1.3 Model-basedalgorithms

Model-basedreinforcementlearning(MBRL)algorithmslearntheoptimalpolicy $π^{*}$ byfirstestimatingthetransitionfunction $~ P \approx P$ andtherewardfunction $~ R \approx R$ .Thesefunctionsareusuallycalledtheenvironmentdynamicsorworldmodelandarelearnedinasupervisedfashionfromadatasetofobservedtransitions, $D = {(s_{t}, a_{t}, r_{t}, s_{t + 1})_{i}}$ .Theworldmodelscanbeusedinmultipledifferentways,dependingonthealgorithm,toderivetheoptimalpolicy.Forexample,sampling-basedplanningalgorithmsuse $~ P$ and $~ R$ tosampleactionsequencesandcalculatetheirexpectedvalues:

(At,…,At+τ)=\argmaxBAt:t+τE[t+τ∑k=tγk~R(sk,ak)|st=s,st+1,…,st+τ∼~P]

(2.8)

TheagentfollowstheactionsequenceassociatedwiththehighestexpectedrewardinEquation2.8.Thisisoftencombinedwithmodel-predictivecontrol(MPC),whereanewactionsequenceiscalculatedaftertakingthefirstactioninthelastsequence.Therearedifferentwaysofchoosingcandidateactionsequences,withthesimplestbeingtherandomshootingalgorithm(richards2005robust),thatdrawstheactionsfromauniformdistribution.

2.2 Deeplearning

Forthelastfewyears,deeplearninghasbeenonthecenterstageofmachinelearningresearch.Wemakeextensiveuseofdeeplearninginthisworkbecauseofitsparallelizability,itsefficientscalingwithlargedatasetsanditscapabilitytoapproximatecomplexfunctions.Themostbasictypeofdeeplearningmethodisthefeedforwarddeepnetwork,whichcompriseslayersofartificialneurons.Thetheoreticalcapabilitiesofartificialneuralnetworkswereguaranteedbycybenko1989approximation:hisUniversalApproximationTheoremhastheimplicationthatanycontinuousfunctionofrealnumberswithvaluesinaEuclideanspacecanbeapproximatedbyaneuralnetworkwithonehiddenlayer.Unfortunately,itisapureexistencetheorem,leavingthetaskofconstructingthenetworktotheengineer.

2.2.1 Theartificialneuron

Anartificialneuron(rosenblatt1958perceptron)isamathematicalfunctionthatmultiplieseachinputwithaconstant,addsabiastothelinearcombinationandthenappliesanon-linearitytotheoutcome:

y = φ (m \sum i = 1 w_{i} x_{i} + b)

(2.9)

Thenon-linearity $φ$ isknownastheactivationfunction,thecoefficients $w_{i}$ areknownastheweightsandtheterm $b$ isknownasthebias.Wenowbrieflydiscusssomecommonlyusedactivationfunctionsthatareemployedinthisthesis.Foramorecomprehensiveoverviewofrecenttrendsintheusageofactivationfunctions,weencouragethereadertolookatacomparisonbynwankpa2018activation.Thelogisticfunctioncanbeusedforbinaryclassification.

φ_{logistic} (x) = \frac{1}{1 + e^{- x}}

(2.10)

Thisfunction"squashes"theinputstoliebetween $0$ and $1$ ,givingtheoutputaprobabilisticinterpretation.Thesoftmaxfunctionisanextensionofthelogisticfunctionforseveralclasses

φ_{softmax} (x)_{i} = \frac{e^{x_{i}}}{\sum_{i = 1}^{n} e^{x_{i}}}

(2.11)

Theoutputofthesoftmaxfunctionisavectorofthesamedimensionalityastheinputvectorandsumsto $1$ .Thehyperbolictangentfunction(tanh)squashestheinputtoliebetween-1and1

φ_{tanh} (x) = \frac{e^{x} - e^{- x}}{e^{x} - e^{- x}}

(2.12)

Thishasthecomputationaladvantageoverthelogisticfunctionthatbiasesinthegradientsareavoidedand0-centereddatagivesrisetolargerderivativesduringoptimizationofthenetworks(lecun2012efficient),makingthemamorefrequentchoiceasanactivationfunctioninhiddenlayers.Themostpopularnonlinearityfordeepneuralnetworksistherectifierfunction

φ_{ReLU} (x) = max (0, x)

(2.13)

Therectifierfunctionoffersthesameadvantagesasthetanhfunctionbutatalowercost,asevaluatingexponentialsandperformingdivisionisavoided.Therectifierfunctionisalsocalledarectifiedlinearunit,anditiscommonlyabbreviatedas"ReLU".

2.2.2 Feedforwardneuralnetworks

ComputationalunitsimplementingthefunctioninEq.2.9canbearrangedhierarchically,withtheinputofaneuronconsistingoftheoutputofotherneurons.Anexamplefeedforwardneuralnetworkormultilayerperceptron(MLP)²²2Feedforwardneuralnetworksaresometimeslooselyreferredtoasmulti-layerperceptrons(MLPs),namedafteranearlyartificialneuronmodelcalledtheperceptron.However,perceptronsuseahardthresholdactivationfunctionwhilemodernMLPscanuseanydifferentiableactivation,sotheyareoftennotperceptrons,inthestrictmeaningoftheword.isdepictedinFigure2.1.

Fig 2.1: AFully-connectedneuralnetwork.Thisnetworkhasthreeinputsandtwolayers:onehiddenlayerwithfourunitsandanoutputlayerwithtwounits.

Thefigureshowsanetworkwithonehiddenlayer,butitcaninprinciplehaveanynumberofhiddenlayers.Thesameistrueforthenumberofinputsandoutputs.

2.2.3 Optimizingneuralnetworks

Trainingadeepneuralnetworkinvolvestrainingdataandalossfunction.Fortraininganartificialneuralnetwork,anappropriatelossfunctionhastobefoundtomatchboththetaskathandalongwiththefinallayer’sactivationfunction.Thelossfunctionmeasuresthedifferencebetweentheoutputofthenetwork,whenthedataispassedthroughit,andthedesiredoutcome.Theparameters $θ$ ofthenetworkarethenadjustedtowardtheoptimal $θ^{*}$ thatminimizethelossfunctionoverthedata

θ∗=\argminBθn∑i=1L(fθ(xi),yi)

(2.14)

where $n$ isthenumberofdatapointsand $f_{θ} (x_{i})$ isthepredictionofaneuralnetworkwithparameters $θ$ ,forsample $x_{i}$ withthetruevalue $y_{i}$ .Thequantity $\sum_{i = 1}^{n} L (f_{θ} (x_{i}), y_{i})$ isalsoknownastheempiricalrisk.Onesuchexampleisthemean-squarederror(MSE)loss

L_{MSE} (f_{θ} (x_{i}), y_{i}) = {(y_{i} - f_{θ} (x_{i}))}_{i}^{2}

(2.15)

MaximizingthelikelihoodofGaussiandatawithrespecttotheparametersoftheassumedmodelthatgeneratedthedataisequivalenttominimizingtheMSE,makingitapopularchoiceforregressiontasks(i.e.whentheoutputlayeractivationislinearorReLU).Forclassificationnetworkswithalogisticorsigmoidactivationoutput,asuitablelossfunctionisthecross-entropylossfunction

L_{CE} (f_{θ} (x_{i}), y_{i}) = - (C \sum j = 1 y_{i j} \cdot log (f_{θ} (x_{i})))

(2.16)

where $C$ isthenumberofclasses³³3If $C = 3$ ,thenthelabelvectorcould,forexample,taketheform $y_{1} = (0, 1, 0)$ ..SimilarlytoMSE,thislossfunctionisalsomotivatedbythefactthatminimizingthecross-entropylossisequivalenttomaximizingthelikelihoodofuniformlydistributedi.i.d.data(yao2019negative).Sofar,thelossfunctionswehaveseenrequirealabel $y_{i}$ asapartoftheinput.Thismakesthemsupervisedlearninglosses.Manycommonlyusedlossfunctionsexistthatdonotrequirelabels,thosearecalledunsupervisedlearninglosses.Oncewehavedecidedonalossfunctiontominimize,thenextstepistochoosetheoptimizationalgorithm.ThemostpopularonesareimplementedinsoftwarelibrariessuchasKeras(chollet2015keras),MXNet(chen2015mxnet),Tensorflow(abadi2016tensorflow),Pytorch(NEURIPS2019_9015)andseveralothers.Themostcommonwayoftrainingdeepneuralnetworksisbyemployingavariationofthegradientdescent(curry1944method)algorithm.Gradientdescentmethodstakeadvantageofthefactthatafunctiondecreasesthefastestinthenegativedirectionofitsgradient,convergingatalocalminimum.Analgorithmcalledbackpropagation(linnainmaa1970representation)computesthegradientofthelossfunctionwithrespecttoeachparameter(e.g.weightsandbiases)viathechainrulefromcalculus.Thesegradientsarethenusedforanupdatestepforeachparameter:

θ^{[i + 1]} \leftarrow θ^{[i]} - η \frac{\partial L}{\partial θ^{[i]}}

(2.17)

where $i$ keepstrackoftheindexoftheiteration.Theparameter $η$ isknownasthelearningrateoftheoptimizationalgorithm.TheclassicalgradientdescentmethodinEquation2.17calculatestheaveragelossovertheentiredataset.Thiscanbemadefaster,withoutlosingconvergenceguarantees,byperformingaweightupdateusingthegradientfromonlyasubsetofthetrainingdata–oratrainingbatch–ineachiteration.Thisstochasticapproximationofgradientdescentiscalledstochasticgradientdescent.Agoodlearningrateisimportantforthepracticalconvergenceofstochasticgradientdescent:ifitisverysmall,thenthetimeittakestoconvergecanbetoolong.However,ifitistoolarge,thenthereisariskofovershootingthelocalminima.Itisgenerallygoodtostartoffwithalargerlearningrateandthenmakeitsmallerwithtime.Determiningexactlywhentodecreasethesizeofthelearningrate,andbyhowmuch,canbelaboriousinpractice.Forthisreason,therehavebeenproposedseveralgradientdescentmethodsthatautomaticallyfindthislearningrateschedulewithadaptivelearningrates,forexample,rmsprop(tieleman2012lecture)andAdam(kingma2014adam).

2.2.4 Convolutionalneuralnetworks

ThenetworkinFig.2.1isafully-connectedordenseneuralnetwork,becauseeveryunitisconnectedtoeveryunitintheprecedinglayer.Thereareother,morespecialized,neuralnetworksthatarenotfully-connected,oneofthemostimportantclassbeingconvolutionalneuralnetworks.Forinputdatawithaspatialstructure,forexampleimages,convolutionalneuralnetworksareveryefficient.Incontrasttodenseneuralnetworks,eachunitinconvolutionalneuralnetworksonlyreceivesasinputasubsetoftheoutputsfromthepreviouslayer.Morespecifically,eachunitonlyreceivesinputsfromunitsthatareinspatialproximityofoneanother.WaldoTobler’sFirstLawofGeographycapturessuccinctlythemotivationbehindconvolutionalnetworks(tobler1970computer):"everythingisrelatedtoeverythingelse,butnearthingsaremorerelatedthandistantthings".Anotherkeypropertyofconvolutionalneuralnetworksistheoneofsharedweights–eachcomputationalunitinthesamelayerhasthesamesetofweights,eventhoughtheyreceivedifferentinputs.Thesepropertieshavethepracticalconsequencethatthenumberofparametersiscutdownsubstantially:eachneuronprocessinga $64 \times 64$ grayscaleimagewouldrequire $64 \cdot 64 + 1 = 4097$ weights,whichisthenmultipliedagainbythenumberofneuronsinthelayerforthetotalparametercount.Ontheotherhand,aconvolutionallayerwhereeachneuronprocessesa $7 \times 7$ window(arelativelylargewindowsize)aroundapixelwouldrequire $7 \cdot 7 + 1 = 50$ parameters–forthewholelayer,duetothesharedweights.Thus,processingtheimageinputinthisexamplewithaconvolutionallayerinsteadofadenselayerwithasingleneuronreducesthenumberofparametersbyafactorof $80$ .Inadditiontotheactivationfunction,wespecifythevalueoffourhyperparametersforconvolutionallayerswhenwedescribespecificnetworkarchitecturesinthisthesis:thenumberoffilterstoslidealongtheheightandwidthoftheinput,thesize,orthereceptivefield,ofthefilters,thestrideandhowmuchzeropaddingtouse.Thereceptivefielddictatesthesizesofthespatialdimensions(height,width)oftheinputthattheneurontakes.InFigure2.1(b),forinstance,wewouldsaythatthefiltersizeis( $2 \times 2$ ),despitethedimensionoftheweightsbeing( $2 \times 2 \times 2$ )–thisisduetothesizeoftheinputdepth.Notethateventhoughtheseheightandwidthvaluesareconstrained,theneuronalwaysprocessesthefulldepthoftheinput.Thestridecontrolshowmanystepsthefilterstakeastheinputisprocessedalongitsspatialdimensions.Zeropaddingamountstoaddingrowsandcolumnsaroundtheinput,composedentirelyofzeros,alsoalongthedepth.Paddingisoftendonetomaketheoutputvalid,forinstance,ifthestrideorfiltersizewouldpotentiallycauseaneurontoprocessinputsthatare"outofbounds".Thisisoftencalledvalidpadding.Anotherpurposeofpaddingistopadtheinputwithzerostokeeptheoriginalspatialdimensionsunchanged,thisiscalledsamepadding.Forexample,zeroscanbeaddedaroundanimageofsize $12 \times 12$ beforeitisprocessedbyanetworkthatrequiresaninputof $16 \times 16$ .TherelationshipbetweentheinputandoutputofaconvolutionalfilterisillustratedinFigure 2.1(a).InFigure 2.1(b)weaddadepthdimension.Inourexample,thestrideis1.However,intheexample,ifwewouldwanttoincreasethestridevaluethenwewouldhavetointroducezeropadding.

(a) Inputwith(height $\times$ width $\times$ depth)dimensionsof( $3 \times 3 \times 1$ ).

Generally,anydeepneuralnetworkiscalledaconvolutionalneuralnetworkifithasoneormoreconvolutionallayers.Thisholdstrueevenifnotallthelayersareconvolutionallayers.Somepopularlayertypesinclude:

Subsamplinglayersthatkeeponlyevery $n$ throwandcolumntoreducethecomputationalcomplexity.Thisisusuallyonlydoneasafirstpreprocessingstepforveryhighdimensionalinputs,wherethrowingawaytheinformationisnotasharmfulasintheintermediatelayersofthenetwork.
Maxpoolinglayersslidealongthewidthandheightofeachdepthsliceintheinputandreturnthelargestsinglevalueintheirwindow.Theyreducethecomputationalcomplexitybyreducingtheinputdimension,andtheymaketherepresentationapproximatelyinvarianttosmalltranslations.
Flatteninglayersaretechnicallayersthatre-shapetensororarrayinputstovectors.
Normalizationlayersforre-centeringandre-scalinginputstolayers.Theyhelpspeedingupandstabilizingthelearningprocess.

Deepneuralnetworksoftenhaveanumberofparametersinthethousandsorbillions.Thismakestheinterpretationofthecalculationsdifficult,especiallyduetothenumberoflayers.Visualizationsofthefirstfewconvolutionallayersintrainednetworkshasbeendone(zeiler2014visualizing),withtheresultthatthefirstlayer’sfiltersusuallycaptureedges,corners,andcolorcombinations.Thesecondlayerthencombinesthesefeaturesintomorecomplicatedpatterns.Higherfiltersthencombinethesefeaturesfurtherintotextures,objectpartsorevenwholeobjects.

2.3 Representationlearning

Incomputerscienceingeneral,andmachinelearninginparticular,thechoiceoftherepresentationofthedatathatisbeingprocessediscrucial.Thiscouldmeanchoosingtherightdatastructureforthetask,suchasdesigningadatabaseforfastsearching.Thiscouldalsomeanchoosingtherightindependentvariablesforastatisticalmodel.Theextractionofusefulinformationaboutthedataisthusanimportanttask.Thisisespeciallytrueiftheinputisfromahigh-dimensionalspace,withthetermcurseofdimensionality(bellman1957dynamic)beingusedsincethelate1950sfordescribingproblemsofthisnature:theamountofdataneededtomakestatisticallysignificantclaimsgrowsexponentiallywiththedimensionalityofthespacethatthedataresidesin.Thismakesthediscoveryofmethodsforreducingthedimensionalityoftheinput,withoutdiscardingimportantinformation,anattractiveprospect.

2.3.1 Supervisedrepresentationlearning

Representationsariseinartificialneuralnetworks(ANNs)whentheyaretrainedforaregressionorclassificationobjective.OneviewofANNsisthatthehiddenlayersperformfeatureextractionontheinput,transformingittoamoresuitableformfortheoutputlayerthatperformsthefinalcalculationsforthetask.sharif2014cnnmadeuseofthisinsightbypre-processinginputsforsupervisedlearningmodelswiththeintermediatelayeroutputsofaconvolutionalnetworkthatwaspre-trainedonanobjectclassificationtask.Theyachievedimpressiveresultsonadiverserangeoftasks,suchasimageretrievalandscenerecognition.ThehierarchicalstructureofANNsalsohasthetheoreticalandpracticaladvantagethatthefeaturesateachlevelarere-usedforthedifferentfeaturesatthehigherlevel.

2.3.2 Unsupervisedrepresentationlearning

ForhierarchicalmethodslikeANNs,representationsaregeneratedatthesametimeasthewholesystemistrainedtominimizeerroronhumantaggeddata.Mostotherrepresentationlearningmethodsareunsupervisedandareabletolearnusefulfeaturesonunlabeleddata.

Pca

Principalcomponentanalysis(PCA)isanunsupervisedlearningmethodinventedbypearson1901liii.Themethodfindsanorthogonallineartransformation $f (X) = W^{T} X$ forzero-mean, $d$ -dimensionaldata $X$ .Thecolumns $w_{i}$ of $W$ aretheprincipalcomponentsofthedata $X$ ,whichpointtothedirectionofthegreatestvarianceinthedata.Thefirstprincipalcomponentisthesolutiontotheequation

wi=\argmaxB||w||=1||Xw||2

(2.18)

ThesecondcomponentisfoundbyapplyingEquation2.18againtothetransformeddata $^X$ thatisgivenbyremovingthecontributionofthefirstcomponentfrom $X$ , $^X = X - ((w_{1})^{T} X) w_{1}$ ,andsoon.Allthecomponentscanalsobefoundsimultaneouslybycomputingtheeigendecompositionofthedata’scovariancematrix,asithasbeenshownthattheprincipalcomponentsareequaltotheresultingeigenvectors(shlens2014tutorial).Dimensionalityreductioncanbedonebycreatingamatrix $W_{L}$ ,consistingonlyofthefirst $L$ principalcomponents,andapplyingittothedata.Thisyieldsthelower-dimensional,transformeddatamatrix $T_{L} = W_{L}^{T} X$ .Bydoingthis,thedataisprojectedontothesubspacewiththemaximumvariance.Thismatrix $W_{L}$ ofprincipalcomponentshasthepropertyofminimizingthereconstructionerror $| | X - W_{L} T_{L} | |^{2}$ .InFigure2.3,weshowavisualizationoftheUCIMLdigitsdataset,whichconsistsof $8 \times 8$ grayscaleimagesofhand-writtendigits.Thefigureshowstheresultafterthedataisprojectedonitsfirsttwoprincipalcomponents,showingclearclusteringofthedigits.Wealsoshowtheclusteringfoundbyt-distributedstochasticneighborembedding(discussedbelow),whichseparatestheclustersmorecleanlyforthisdataset.

t-SNE

Overthelastfewyears,t-distributedstochasticneighborembedding(t-SNE)(maaten2008visualizing)hasbecomeoneofthemostpopulardimensionalityreductiontechniquesforvisualization(arora2018analysis).Theassumptionbehindthealgorithmisthatthehigh-dimensionalinputdataliesonalocallyconnectedmanifold.First,anauxiliaryasymmetricmeasurebetweeneachpairofdatapointsiscalculatedaccordingtotheequation

p_{j | i} = \frac{exp (- | | x_{i} - x_{j} | |^{2} / 2 σ_{i}^{2})}{\sum_{k \neq i} exp (- x | | x_{i} - x_{k} | |^{2} / 2 σ_{i}^{2})}

(2.19)

where $p_{i | i} = 0$ asitisofprimaryinteresttomodelpairwisesimilarities.Theconstant $σ_{i}$ isthevarianceofaGaussianthatiscenteredaround $x_{i}$ andcontrolshowinfluentialnearbydatapointsareincontrasttofarawaydatapoints.Thenthepairwisesimilaritiesarecomputedusingtheformula

p_{j i} = \frac{p_{j | i} + p_{i | j}}{2}

(2.20)

thesesimilaritiesaredefinedtobesymmetrizedconditionalprobabilitiestoensurethateachdatapointmakesasignificantcontributiontothecostfunction.Next,alower-dimensionalvectorofdatapointsYiscreatedandinitializedrandomly.EachelementinYcorrespondstoanelementinX:similarities $q_{i j}$ betweendatapoints $y_{i}$ an $y_{j}$ arecalculatedaccordingtotheformula

q_{i j} = \frac{(1 + | | y_{i} - y_{j} | |^{2})^{- 1}}{\sum_{k} \sum_{l \neq k} (1 + | | y_{k} - y_{l} | |^{2})^{- 1}}

(2.21)

Thelow-dimensionaldatapoints $y_{i}$ arethenmovedaroundtominimizetheKLdivergence--ameasureofthedifferencebetweentwoprobabilitydistributions⁴⁴4Notethat $q$ and $p$ canbeinterpretedasprobabilitiessince $\sum_{i, j} p_{i j} = \sum_{i, j} q_{i j} = 1$ and $p_{i j} > 0$ and $q_{i j} > 0$ forall $i$ and $j$ .–between $p$ and $q$

KL (P ∥ Q) = \sum i \neq j log \frac{p_{i j}}{q_{i j}}

(2.22)

Equation2.22isminimizedusinggradientdescent,ensuringthatpointsthataresimilarinthehigh-dimensionalspacearealsosimilarinthenew,low-dimensionalspace.

Autoencoders

Theautoencoderisatypeofneuralnetwork⁵⁵5Autoencoderscanconsistofanytypesofneuralnetworklayers.Forexample,anautoencodermadeupofconvolutionallayersiscalledaconvolutionalautoencoders.thatconsistsofanencoderpart,whichmapstheinput $x$ toanencoding(usuallyofasmallersizethantheinput),andadecoderpart,thatoutputsareconstruction $y$ oftheinput(Figure2.4).

Fig 2.4: Anautoencoder.Thisisanexampleofaconvolutionalautoencoderthatistrainedforreconstructingsealphotographs.Theencodingvector $z$ isusuallymuchsmallerthanthetotalnumberofpixelsintheinput.

Theobjectivefunctionisthesquarederrorbetweentheinputandtheoutput

L (x, y) = | | x - y | |^{2}

(2.23)

Usefulrepresentationsariseinthisprocessifthecorrectconstraintsareplacedonthesystem.Withoutconstraints,thesystemcouldenduplearningtheidentityfunction,thattriviallysatisfiestheobjectivefunction: $| | x - y | | = | | x - x | | = 0$ .Thisproblemcanbeovercomebyincludingahiddenlayerintheautoencoderofalowerdimensionalitythantheinputspace.Inthiscase,theautoencoderissaidtobeundercomplete.Undercomplete,single-layerautoencoderswithlinearactivationfunctionsarealmostequivalenttoPCA.The $p$ -dimensionalhiddenlayerspansthesamesubspaceasthefirst $p$ principalcomponents,ortheprincipalsubspace,ofthedata(baldi1989neural).UnlikePCA,however,theweightsofthehiddenlayerarenotguaranteedtobeorthonormalnorordered.Ifthesmallestdimensionalityofahiddenlayerislargerthanthesizeoftheinput,theautoencoderissaidtobeovercomplete.Overcompleteautoencoderscanbepreventedfromlearningtheidentityfunctioniftheobjectivefunction(Equation2.23)iscombinedwitharegularizationterm.Forexample,insteadofreconstructingtheoriginalinputasitis,vincent2008extractingproposethatthegoalcouldbetorecovertheinputafterithasbeencorruptedwithnoise(e.g.Gaussiannoiseorsalt-and-peppernoise).Theideabehindthisisthattheautoencoderhastolearnrepresentationsthatarestableandrobustunderthecorruptionoftheinput,andthatthedenoisingtaskextractsausefulstructureoftheinputdistribution(vincent2010stacked).

2.3.3 Self-supervisedlearning

Approachesthatlearnrepresentationsbywayofsolvingauxiliarytasks,inthesensethattherepresentationthatarisesfromtheoptimizationismoreimportantthanachievingagoodperformanceonthetaskitself,issometimescalledself-supervisedlearningintheliterature(gogna2016semi).Forexample,ha2018worldtrainanautoencodertoreconstructtheobservationsinanRLenvironment,buttheydonottakeadvantageofthereconstructivecapabilitiesofthenetworkwhentheytraintheirRLpolicies,andtheyuseonlytheencoderpartofthenetworkforvisualpre-processing.Anotherexampleisthedenoisingautoencoderfromthepreviouschapter.Onedirectionofself-supervisedlearningthathasbeenfollowedintheliteratureistoinventataskthatrequiresanunderstandingofthedomaintosolvecorrectly,suchasthereconstructivenetworkusedbyha2018world.Anotherexampleistheworkbygidaris2018unsupervised,wholearnrepresentationsbyapplyingrandomrotationstonaturalimagesandtrainanetworktopredictwhichrotationwasapplied.Thisencouragesthenetworktolearnhigh-levelconcepts,suchasbeaks,wingsandtalons,andtheirrelativepositions.Morecommonly,self-supervisedlearningmethodsconsistofobscuringsomepartoftheinputandtrainamodeltopredictthatpartgivensomeothersubsetoftheinput,aswedoinChapter4andChapter5.Somevariationsofthisideainclude,forexample,colorization(zhang2016colorful),wherecolorfulimagesareconvertedtograyscalewiththegoalofpredictingtheoriginalcolors.

Chapter 3 Learninggradient-basedICAbyneurallyestimatingmutualinformation

Inthischapter¹¹1Thischapterisadaptedfrom(hlynsson2019learning),whichwaspublishedintheJointGerman/AustrianConferenceonArtificialIntelligence(KünstlicheIntelligenz).,weintroduceanovelmethodoftrainingneuralnetworksinanunsupervisedmannertooutputstatisticallyindependentcomponents,amethodwecallGrICA.Weuseamutualinformationneuralestimation(MINE)network(belghazi2018mine)toguidethelearningofanencodertoproducestatisticallyindependentoutputs.Thisisarecentmethodofestimatingthemutualinformationofrandomvariablesinadeeplearningsetting,andweapplyittogetaqualitativelyequalsolutiontoFastICAonblind-source-separationofnoisysources.WeinvestigatetheusefulnessofourmethodincontrasttoarepresentationlearnedbyaconvolutionalautoencoderforpreprocessingvisualinputsforanRLagent,butthecomparisonisunfavorableforourapproach.Therestofthischapterhasthefollowingorganization:Section 3.1motivatesthedesignofrepresentationswithindependentcomponents.Section 3.2explainstheICAproblemformulation.Section 3.3brieflydiscussesrelatedwork.Section3.4introducesourmethodofusingamutualinformationneuralestimatortoteachaneuralnetworktooutputindependentcomponents.Section3.5showstheexperimentalevaluationofourmethod.Finally,weconcludewithadiscussioninSection 3.6.

3.1 Introduction

Thegeneralobjectiveoftraininganencodertolearnstatisticallyindependent,factorialcodesofthedatahasbeencalledthe"holygrail"ofunsupervisedlearning(schmidhuberunsupervised).Wesuggestthatlearningtorecoverfew,statisticallyindependent,latentvariablesofanRLenvironmentcanspeedupthetrainingofRLagents.Forenvironmentswherehigh-dimensionalobservationsarecreatedfromasmallsetofstatisticallyindependentlatentvariables,thistechniquecouldreducethedimensionalityoftheobservationswithoutdiscardingunnecessaryinformation.AnothertheoreticaladvantageofusingthiskindofapproachinRLsettings,comparedtotheothermethodswedevelopinthisPhDdissertation,isthatitrequiresonlyout-of-contextobservationdatafromtheenvironment.LearningtheGrICArepresentationdoesnotrequirefulltransitions $(s, a, r, s^{'})$ tuples²²2Weuse $s$ todenotethestate, $a$ todenotetheaction, $r$ todenotetherewardand $s^{'}$ todenotethenextstate.andcanthusbeusedwhenthetransitionorrewarddynamicsoftheenvironmentschange.Learningrepresentationsthatoutputstatisticallyindependentfeaturescanbedoneinanynumberofways,forexample,bytryingtomakeeachoutputasunpredictableaspossiblegiventheotheroutputunits(schmidhuber1992learning).Wetaketheapproachofminimizingthemutualinformation,asestimatedbyaMINEnetwork,betweentheoutputunitsofadifferentiableencodernetwork.Thisisdonebysimplealternateoptimizationofthetwonetworks.

3.2 Background

Independentcomponentanalysis(ICA)aimsatestimatingunknownsourcesthathavebeenmixedtogetherintoanobservation.TheusualassumptionsarethatthesourcesarestatisticallyindependentandnomorethanoneisGaussian(jutten2003advances).Thenow-cementedmetaphorisoneofacocktailpartyproblem:severalpeople(sources)arespeakingsimultaneously,andtheirspeechhasbeenmixedtogetherinarecording(observation).Thetaskistounmixtherecordingsuchthatalldialoguescanbelistenedtoclearly.InlinearICA,wehaveadatamatrix $S$ whoserowsaredrawnfromstatisticallyindependentdistributions,amixingmatrix $A$ ,andanobservationmatrix $X$ :

X = A S

andwewanttofindanunmixingmatrix $U$ of $A$ thatrecoversthesourcesuptoapermutationandscaling:

Y = U X

Thegeneralnon-linearICAproblemisill-posed(hyvarinen1999nonlinear; darmois1953analyse)asthereisaninfinitenumberofsolutionsifthespaceofmixingfunctionsisunconstrained.However,post-linear(taleb1999source)(PNL)ICAissolvable.Thisisaparticularcaseofnon-linearICAwheretheobservationstaketheform

X = f (A S)

where $f$ operatescomponentwise,i.e. $X_{i, t} = f_{i} (\sum_{m}^{n} A_{i, m} S_{m, t})$ .Theproblemissolvedefficientlyif $f$ isatleastapproximatelyinvertible(ziehe2003blind)andthereareapproachestooptimizetheproblemfornon-invertible $f$ aswell(ilin2004post).Forsignalswithtime-structure,however,theproblemisnotill-posedeventhoughitisfori.i.d.samples(blaschke2007independent; sprekeler2014extension).ToframeICAasanoptimizationproblem,wemustfindawaytomeasurethestatisticalindependenceoftheoutputcomponentsandminimizethisquantity.Therearetwomainwaystoapproachthis:eitherminimizethemutualinformationbetweenthesources(amari1996new; bell1995non; cardoso1997infomax),ormaximizethesources’non-Gaussianity(hyvarinen2000independent; blaschke2004cubica).

3.3 Relatedwork

TherehasbeenaninterestincombiningneuralnetworkswiththeprinciplesofICAforseveraldecades.InPredictabilityMaximization(schmidhuber1992learning),agameisplayedwhereoneagenttriestopredictthevalueofoneoutputcomponentgiventheothers,andtheothertriestomaximizetheunpredictability.Morerecently,DeepInfoMax(DIM)(hjelm2018learning),GraphDeepInfoMax(velivckovic2018deep)andGenerativeadversarialnetworks(goodfellow2014generative),utilizetheworkofBrakeletal.(brakel2017learning)todeeplylearnICA.Ourworkdiffersfromtheseadversarialtrainingmethodsintherulesoftheminimaxgamebeingplayedtoachievethis:oneagentdirectlyminimizesthelower-boundofthemutualinformation,asderivedfromtheDonsker-VaradhancharacterizationoftheKL-Divergence,astheothertriestomaximizeit.

3.4 Method

3.4.1 Reinforcementlearningenvironment

Ourrepresentationistestedona2Denvironmentwheretheagentissupposedtoavoidafieldoflavaandreachagoalontheothersideoftheroom.Thefullstateoftheenvironmentisthewholeroom(Fig3.1,left)andtheobservationisanisometricviewoftheagentanditspointofview(Fig3.1,right).Theobservationsare $56 \times 56$ RGBimagesandtheagentcantakeastepforward,turnleftorturnright.

Fig 3.1: Thelavafieldenvironment.Theworldhas $5 \times 7$ tilesandissurroundedbyimpassablewalls.Theagent(redarrow)istaskedwithreachingthegoal,representedbythegreentile,whichyieldsarewardandterminatestheepisode.Theepisodeendswithoutrewardiftheagenttouchesanorangelavatile.Thefullworldstatecanbeseenontheright,withaslightlylighterboxcontainingtheagent.Thisboxhighlightsthesubsetoftheworldthatisperceivedbytheagent,seenontheleft.

Theepisodeterminatesiftheagentstepstowardlavaorthegoal.Theagentreceivesapositiverewardifitreachesthegoal,buttheepisodeterminateswithzerorewardifitstepsintothelava.Thereisnochangeiftheagentisfacedtowardthewallandtakesastepforward.

3.4.2 Learningtheindependentcomponents

Wetrainanencoder $E$ togenerateanoutput $(z_{1}, z_{2}, \dots, z_{k})$ suchthatanyoneoftheoutputcomponentsisstatisticallyindependentoftheunionoftheothers,i.e. $P (z_{i}, z_{- i}) = P (z_{i}) P (z_{- i})$ ,where

z_{- i} := (z_{1}, \dots, z_{i - 1}, z_{i + 1}, \dots, z_{k})

Thestatisticalindependenceof $z_{i}$ and $z_{- i}$ canbemaximizedbyminimizingtheirmutualinformation

I (Z_{i}; Z_{- i}) = \int_{z} \int_{z_{- i}} P (z_{i}, z_{- i}) log (\frac{P (z_{i}, z_{- i})}{P (z_{i}) P (z_{- i})}) d z_{i} d z_{- i}

(3.1)

Thisquantityishardtoestimate,particularlyforhigh-dimensionaldata.NotethatEquation 3.1canbemoresuccinctlyastheKLdivergencebetween $Z_{i}$ and $Z_{- i}$ :

I (Z_{i}; Z_{- i}) = D_{K L} (P (Z_{i}, Z_{- i}) | | P (Z_{i}) P (Z_{- i}))

(3.2)

donsker1975asymptoticfamouslyprovedthattheKLDivergenceadmitstherepresentation

D_{K L} (X | | Y) = sup T : Ω \to R E_{X} [T] - log (E_{Y} [e^{T}])

(3.3)

wherethedomain $Ω$ isaclosedandboundedsubsetof $R^{d}$ .belghazi2018mineintroduceamethodofusingtheDonsker-Varadhanrepresentationtoestimatemutualinformationwithneuralnetworks,withanarchitecturetheycallmutualinformationneuralestimation(MINE)networks.Tolearnrepresentationswithindependentcomponents,wethereforeestimatethelowerboundofEq. (3.1)usingaMINEnetwork $M$ :

I (Z_{i}; Z_{- i}) \geq L_{i} = E_{J} [M (z_{i}, z_{- i})] - log (E_{M} [e^{M (z_{i}, z_{- i})}])

(3.4)

where $J$ indicatesthattheexpectedvalueistakenoverthejointandsimilarly $M$ fortheproductofmarginals.Thenetworks $E$ and $M$ areparameterizedby $θ_{E}$ and $θ_{M}$ .TheencodertakestheobservationsasinputandtheMINEnetworktakestheoutputoftheencoderasaninput.The $E$ networkminimizes $L := \sum_{i} L_{i}$ inorderfortheoutputstohavelowmutualinformationandthereforebestatisticallyindependent.Inordertogetafaithfulestimationofthelowerboundofthemutualinformation,the $M$ networkmaximizes $L$ .Thus,inapush-pullfashion,thesystemasawholeconvergestoindependentoutputcomponentsoftheencodernetwork $E$ .Inpractice,ratherthantrainingthe $E$ and $M$ networkssimultaneouslyitprovedusefultotrain $M$ fromscratchforafewiterationsaftereachiterationoftraining $E$ ,sincethelossfunctionsof $E$ and $M$ areatoddswitheachother.Whentheencoderistrained,theMINEnetwork’sparametersarefrozenandviceversa.

Fig 3.2: Ourindependentfeaturelearningsystem.Thesystemlearnsstatisticallyindependentoutputsbyalternateoptimizationofanencoder $E$ andaMINEnetwork $M$ parameterizedby $θ_{E}$ and $θ_{M}$ .TheMINEobjective(Eq. 3.4)isminimizedwithrespectto $θ_{E}$ forweightupdatesoftheencoder,butitismaximizedwithrespectto $θ_{M}$ forweightupdatesoftheMINEnetwork.

3.5 Results

Wetryourmethodontwoscenarios:(1)wecompareittocanonicalimplementationsofICAonatextbookexampleofestimatingsourcesfromnoisydataand(2)weuseourmethodwithamorecomplexfunctionapproximatorforpreprocessingobservationsinanRLsetting.

3.5.1 Recoveringnoisysignals

Wevalidatethemethod³³3Fullcodeforthenoisysignalrecoveryexperimentisavailableatgithub.com/wiskott-lab/gradient-based-ica/blob/master/bss3.ipynbaforlinearnoisyICAexample(sklearn).Threeindependent,noisysources—sinewave,squarewaveandsawtoothsignal(Fig. 3.2(a))—aremixedlinearly(Fig. 3.2(b)):

Y = ⎡ ⎢ ⎣ \begin{matrix} 1 & 1 & 1 0.5 & 2 & 1 1.5 & 1 & 2 \end{matrix} ⎤ ⎥ ⎦ S

Theencoderisasingle-layerneuralnetworkwithlinearactivation,withadifferentiablewhiteninglayer(schuler2018gradient)beforetheoutput.Thewhiteninglayerisakeycomponentforperformingsuccessfulblindsourceseparationforourmethod.Statisticallyindependentrandomvariablesarenecessarilyuncorrelated,sowhiteningtheoutputbyconstructionbeforehandsimplifiestheoptimizationproblemsignificantly.TheMINEnetwork $M$ isaseven-layerneuralnetwork.Eachlayerbutthelastonehas64unitswitharectifiedlinearactivationfunction.Eachtrainingepochoftheencoderisfollowedbyseventrainingepochsof $M$ .Estimatingtheexactmutualinformationisnotessential,sofewiterationssufficeforagoodgradientdirection.SincetheMINEnetworkisappliedtoeachcomponentindividually,toestimatemutualinformation(Eq. 3.4),weneedtopasseachsamplethroughtheMINEnetwork $n$ times—onceforeachcomponent.Equivalently,onecouldconceptualizethisashaving $n$ copiesoftheMINEnetworkandfeedingthesamplestoitinparallel,withdifferentcomponentssingledout.Thus,forsample $(z_{1}, z_{2}, \dots, z_{n})$ wefeedin $(z_{i}; z_{- i})$ ,foreach $i$ .BothnetworksareoptimizedusingNesterovmomentumADAM(dozat2016incorporating)withalearningrateof $0.005$ .Forthissimpleexample,ourmethod(Fig. 3.2(c))isequivalentlygoodatunmixingthesignalsasFastICAasimplementedinthescikit-learnpackage(scikit-learn)(Fig. 3.2(d)).Notethat,ingeneral,thesourcescanonlyberecovereduptopermutationandscaling.

3.5.2 Lavafieldenvironment

Fortheseexperiments,welearnourICAfeaturesusingaconvolutionalneuralnetwork.Werollout100episodeswithafullyrandompolicytogatherdataforlearningtheindependentfeatures:anagentisplacedintheupperrightcorneroftheenvironmentandturnsleft,rightortakesastepforwardwithequalprobabilities.Theresultobservationsarethengathereduntiltheepisodeterminates.Thisgivesus4130observationstotraintherepresentationonforthisexperiment.Therepresentationwelearnis32-dimensional.TheMINEnetworkisthesameasabove,buttheencodernetworkisafive-layerconvolutionalneuralnetwork.Thefirsttwolayersareconvolutionallayerseachwith32filters,arectifiedlinearunitactivationandnopadding.Thefirstlayerhasastrideof4andthesecondoneastrideof3.Theoutputoflayer2isthenpassedtoaflatteninglayer,reshapingthetensoroutputtoavectorinputforalineardenselayerwith32units.Theoutputofthedenselayeristhenfinallypassedtoaspheringlayer,givingustheencoding.ThenetworkdescriptionissummarizedinTable 3.1.

Layer	Filters	Kernel	Stride	Padding	Output	Learnable
					Shape	Parameters
Input	-	-	-	-	$56 \times 56 \times 5$	0
Conv.	32	3x3	4	None	$14 \times 14 \times 32$	896
ReLU	-	-	-	-	$4 \times 4 \times 32$	0
Conv.	32	3x3	3	None	$4 \times 4 \times 32$	9248
ReLU	-	-	-	-	$4 \times 4 \times 32$	0
Flatten	-	-	-	-	512	0
Dense	-	-	-	-	32	16416
ReLU	-	-	-	-	$32$	0
Sphering	-	-	-	-	32	0

Table 3.1: OurconvolutionalICAnetwork.ThelayoutisinspiredbythetopologyofMnih’sdeepQnetworks(mnih2013playing)exceptforthelastlayer,wherewehaveadifferentiablespheringoperationinsteadofadenselayer.

ThetrainingofourICArepresentationfollowsthesameschemeasbefore:wetraintheestimator $M$ forsevenepochsaftereachtrainingepochoftheencoder.Wetrainedtheencoderfor100epochsandtheestimatorfor700epochsforthisexperiment.OurtrainedrepresentationisusedtopreprocessthevisualinputforaRLagent.WechooseActorCriticusingKronecker-FactoredTrustRegion(ACKTR)asimplementedbyStableBaselineswithdefaultparametersandmodel.TheACKTRdefaultmodelisafully-connectedneuralnetworkwithtwolayersof64unitseachandatanhactivationfunction.WetrainedanACKTRmodelfromscratchtwentytimesonourICArepresentation,andshowtheresultsinFig 3.4.Thisindicatesthatweareabletolearntheenvironmentusingourmethodtopreprocesstheinputforareinforcementlearningmethod.

Fig 3.4: GrICArewardduringtrainingonthelavafieldenvironment.Thelineindicatestheaveragereward(every1500trainingsteps)over20differentagentstrainedfromscratch.Theerrorbandsindicateonestandarddeviationfromthemean.

Tovisualizethebehaviorofouragent(Figure 3.5),wechoosethreesuccessfulepisodesfromafully-trainedmodelafter100thousandtimestepsoftrainingandthreeunsuccessfulonesfromamodelwith80thousandtimestepsoftraining.Itisnoteworthythattheagentprefersawidemarginbetweenitselfandthelavafieldasitpassesit,eventhoughamoreoptimalstrategywouldhavetheagentwalktotherightwiththelavaleftimmediatelyonitsleft-handside.Theagentalsosometimesdoublesbackbeforecontinuingtowardthegoalagain.

Fig 3.5: Trajectoriesinthelavafieldenvironment.Thefirststepinthetrajectoryisindicatedbyblue,thenthecolorwarmsupwitheachstepuntilitbecomesamoresaturatedredcolorinthefinalstep.

Wealsotriedtoseewhetherourmethodgeneralizestoavariantoftheenvironmentwherethelowerrowoflavaismovedtothebottom,punishingourstrategy.Therewereonly3successesinathousandtestiterations,showninFigure 3.6,alongwiththreeoftheunsuccessfulepisodes.

Fig 3.6: Trajectorieswithshiftedlavafields.Thisvariantpunishesstrategies–suchastheonelearnedusingourrepresentation–wheretheagentseekstogoallthedownforsafetyasitgoesacrosstheroom.

Forcomparison,wealsotrainedaconvolutionalautoencoder(CAE)forreconstructiononthesamedatasetweusedtotrainourICArepresentation.WerantheexperimentagainwiththeresultingencodingaftertheCAEwastrained.Theencoderisthesameasthenetworkusedforourrepresentation,exceptthatitdoesnothavethespheringlayer.Thedecodingportionofthenetworkconsistsofa392-unitdenselayerwhoseoutputisreshapedtoa $7 \times 7 \times 8$ tensorandpassedtoaconvolutionallayerwith32filtersofsize $3 \times 3$ .Theoutputisthenup-sampledtoquadruplethewidthandheightofthetensor.Thisisthenfollowedbyanotherconvolutionallayer,ofthesamekindasthepreviousone,andanotherup-samplinglayerthatdoublesthewidthandheightofthetensor.Theoutputthenfinallygoesthroughaconvolutionallayerwith3filtersofsize $3 \times 3$ .EachlayerinthedecoderhasaReLUactivation,exceptforthelastwhichhasalogisticactivationtoreconstructpixelvaluesthathavebeennormalizedlieintherange $[0, 1]$ .Eachconvolutionallayerdoeszero-paddingtopreservethewidthandtheheightoftheinputtensor.SeeTable3.2foranoverviewofthearchitecture.

Layer	Filters	Kernel	Stride	Padding	Output	Learnable
					Shape	Parameters
\rowcolorlightblueInput	-	-	-	-	$56 \times 56 \times 5$	0
\rowcolorlightblueConv.	32	3x3	4	None	$14 \times 14 \times 32$	896
\rowcolorlightblueReLU	-	-	-	-	$4 \times 4 \times 32$	0
\rowcolorlightblueConv.	32	3x3	3	None	$4 \times 4 \times 32$	9248
\rowcolorlightblueReLU	-	-	-	-	$4 \times 4 \times 32$	0
\rowcolorlightblueFlatten	-	-	-	-	512	0
\rowcolorlightblueDense	-	-	-	-	32	16416
\rowcolorlightblueReLU	-	-	-	-	32	0
\rowcolorlightredDense	-	-	-	-	392	12936
\rowcolorlightredReLU	-	-	-	-	$392$	0
\rowcolorlightredReshape	-	-	-	-	$7 \times 7 \times 8$	0
\rowcolorlightredConv.	32	3x3	1	Same	$7 \times 7 \times 32$	2336
\rowcolorlightredReLU	-	-	-	-	$7 \times 7 \times 32$	0
\rowcolorlightredUpsampling	-	4x4	-	-	$28 \times 28 \times 32$	0
\rowcolorlightredConv.	32	3x3	1	Same	$28 \times 28 \times 32$	9248
\rowcolorlightredReLU	-	-	-	-	$28 \times 28 \times 32$	0
\rowcolorlightredUpsampling	-	4x4	-	-	$56 \times 56 \times 32$	0
\rowcolorlightredConv.	3	3x3	1	Same	$28 \times 28 \times 32$	9248
\rowcolorlightredTanh	-	-	-	-	$56 \times 56 \times 3$	867

Table 3.2: Convolutionalautoencodernetworkarchitecture.Theblueparthighlightstheencoderportionandthegreenparthighlightsthedecoderportion.TheencoderportionwasusedtopreprocesstheinputtotheRLlearnerfortheexperiment.

WetraintheCAEfor50epochs.Eventhoughthisisalownumberofepochscomparedtothetrainingforourmethod,itsreconstructivepropertiesarealreadyquitegood(Figure3.7).

Fig 3.7: Reconstructionbyautoencoder.The32-dimensionallatentspacecapturesenoughinformationtoreconstructtheoriginalobservationsquiteaccurately,evenafteramodestnumberoftrainingepochs.

Werepeattheexperimentasbefore,butnowwiththeCAEfeaturesinsteadofourICArepresentation.ThetrainingcurveisshowninFigure3.8.Thisstraightforwardbaselinealgorithmlearnstosolvetheenvironmenttwiceasfastasourmethod.

Fig 3.8: CAErewardduringtrainingThelineindicatestheaveragereward(every1500trainingsteps)over20differentagentstrainedfromscratch.Theerrorbandsindicateonestandarddeviationfromthemean.

3.6 Conclusion

WehaveintroducedanoveltechniquefortrainingadifferentiablefunctiontoperformICA.Themethodconsistsofalternatingtheoptimizationofanencoderandaneuralmutualinformationneuralestimation(MINE)network.Themutualinformationestimatebetweeneachencoderoutputandtheunionoftheothersisminimizedwithrespecttotheencoder’sparameters.ThesolutionlearnedbyourapproachagreeswiththeonelearnedbythecanonicalICAalgorithm,FastICA.Anadvantageofourmethod,however,isthatitistriviallyextendedforovercompleteorundercompleteICAbychangingthenumberofoutputunitsoftheneuralnetwork.Weapplyouralgorithmonhigh-dimensionaldatatotesttherepresentationlearnedbyourmethodfordimensionalityreductionofvisualinputsforanRLagent.Theagentisabletouseourrepresentationtolearnhowtosolveasimplenavigationtask,butthepreprocessingofferedbyourapproachisoutperformedbyaconvolutionalautoencoder.Ourmethodworksinprinciple,ascanbeseenbythenoisysignalrecoveryexperiment,butitseffectivenessforlearningrepresentationsforRLagentsremainsunproven.Eventhoughtheobservationsofthelavafieldenvironmentarefullydeterminedbythreelatentvariablesthatarestatisticallyindependent,theagent’sxandypositions,alongwithitsdirection,ourrepresentationwasstillnotusefulenoughtobeattherelativelysimplebaselines.

Chapter 4 Latentrepresentationpredictionnetworks

Inthischapter¹¹1Thischapterisadaptedfrom(hlynsson2020latent).,weintroducearepresentationlearningtechniqueforRLsettingsthatwenameLatentRepresentationPrediction(LARP).ThisnovelsystemtakesadvantageofmoreinformationgivenbytheenvironmentthanourGrICAmethodfromthepreviouschapter,thatonlylearnedfromstaticobservationswithouttakingadvantageoftheknowledgethatthesystemwillbeusedinadynamicsetting.Thatistosay,wewillnowutilizethe $(s, a, s^{'}) = (state, action, nextstate)$ tripletsfortraining.Ouralgorithmlearnsastaterepresentation,alongwithafunctionthatpredictshowtherepresentationchangeswhentheagenttakesgivenactionsintheenvironment.Insteadofusingoursystemtopreprocessinputsforamodel-freereinforcementlearner,aswedidinthepreviouschapter,nowwetakeadvantageofapredictionfunction,whichisusedasaforwardmodelforsearchonagraphinaviewpoint-matchingtask.Usingarepresentationthatislearnedtobemaximallypredictableforthepredictorisfoundtooutperformpretrainedrepresentations.Thedata-efficiencyandoverallperformanceofourapproachisshowntorivalstandardreinforcementlearningmethods,andourlearnedrepresentationtransferssuccessfullytonovelenvironments.Therestofthechapterisorganizedasfollows:inSection 4.1,wemotivatetheusefulnessofrepresentationsthatarepredictableinthescopeofvisualplanning,andwementionhowweintendtoovercomeacommonpitfallintheirdesign.WemoveonwithdiscussingthemainclassesofrelatedworkinSection 4.2,andsummarizethemostrelevantarticlesfromamongthem.InSection 4.3,wediscussinconcretedetailthedesignofourrepresentation,howweuseitforplanning,andweintroduceanexperimentalenvironmentofourowndesign.TheresultsofourexperimentsarepresentedinSection 4.4,andweconcludewithadiscussionoftheproposedmethodologyandprospectsforfutureworkinSection 4.5.

4.1 Introduction

Deeply-learnedplanningmethodsareoftenbasedonlearningrepresentationsthatareoptimizedforunrelatedtasks.Forexample,theymightbetrainedtoreconstructobservationsoftheenvironment,suchastheconvolutionalautoencoderfromthepreviouschapter.Theserepresentationsarethencombinedwithpredictorfunctionsforsimulatingrolloutstonavigatetheenvironment.Weproposetoratherlearnrepresentationssuchthattheyaredirectlyoptimizedforthetaskathand:tobemaximallypredictableforthepredictorfunction.Thisresultsinrepresentationsthatarewell-suited,bydesign,forthedownstreamtaskofplanning,wherethelearnedpredictorfunctionisusedasaforwardmodel.Whilemodernreinforcementlearningalgorithmsreachsuper-humanperformanceontaskssuchasgameplaying,theyremainwoefullysampleinefficientcomparedtohumans.Analgorithmthatisdata-efficient (hlynsson2019measuring)requiresonlyfewsamplesforgoodperformanceandthestudyofdata-efficientcontroliscurrentlyanactiveresearcharea(corneil2018efficient; buckman2018sample; du2019good; saphal2020seerl).Dimensionalityreductionisapowerfultoolforincreasingthedata-efficiencyofmachinelearningmethods.Therehasbeenmuchrecentworkonmethodsthattakeadvantageofcompact,low-dimensionalrepresentationsofstatesforsearchandexploration (kurutach2018learning; corneil2018efficient; xu2019regression).Oneoftheadvantagesofthisapproachisthatagoodrepresentationaidsinfasterandmoreaccurateplanning.Thisholdsinparticularwhenthelatentspaceisofmuchlowerdimensionalitythanthestatespace(hamilton2014efficient).Forhigh-dimensionalinputs,suchasimagedata,arepresentationfunctionisfrequentlylearnedtoreducethecomplexityforacontroller.Indeepreinforcementlearning,therepresentationandthecontrollerarelearnedsimultaneously.Similarly,arepresentationcaninprinciplebelearnedalongwithaforwardmodelforclassicalplanninginhigh-dimensionalspace.WedothiswithourLARPnetwork,whichisaneuralnetwork-basedmethodforlearningastaterepresentationandatransitionfunctionforplanningwithinthelearnedlatentspace(Fig. 4.1).

Fig 4.1: Conceptualoverviewofourmethod.Theimportantcomponentsaretherepresentationnetworkalongwiththepredictornetwork.Together,theycompriseaLARPnetwork,whichisutilizedbyaplanningalgorithm.

Duringtraining,therepresentationandthepredictorarelearnedsimultaneouslyfromtransitionsinaself-supervisedmanner.Wetrainthepredictortopredictthemostlikelyfuturerepresentation,givenacurrentrepresentationandanaction.Thepredictoristhenusedforplanningbynavigatingthelatentspacedefinedbytherepresentationtoreachagoalstate.Optimizingcontrolinthismanner,afterlearninganenvironmentmodel,hastheadvantageofallowingforlearningnewrewardfunctionsinafastanddata-efficientmanner.Aftertherepresentationislearned,wefindsaidgoalstatebyconventionalpathplanning.Disentanglingtherewardfromthetransitionfunctioninsuchawayishelpfulwhenlearningformultipleorchangingrewardfunctions,andaidswithlearningwhenthereisnorewardavailableatall.Thus,itisalsogoodforasparseoradelayed-rewardsetting.Aproblemthatcanariseinrepresentationlearningistheoneoftrivialfeatures.Thiscanhappenwhenthemethodisoptimizinganobjectivefunctionthathasastraightforward,butuseless,solution.Forexample,SlowFeatureAnalysis(SFA)(wiskott2002slow)hastheobjectiveofextractingthefeaturesoftimeseriesdatathatvarytheleastwithtime.Thisiseasilyfulfilledbyconstantfunctions,soSFArequiresthattherepresentationshaveavarianceof $1$ –whichconstantfunctionscannotfulfill.Constantfeatureswouldsimilarlybemaximallypredictablerepresentationsforoursystem.Therefore,westudythreedifferentapproachestopreventthistrivialrepresentationfrombeinglearned,weeither:(i)designthearchitecturesuchthattheoutputissphered,(ii)regularizeitwithacontrastivelossterm,or(iii)includeareconstructionlosstermalongwithanadditionaldecodermodule.Wecomparetheseapproachesandvalidateourmethodexperimentallyonavisualenvironment:aviewpoint-matchingtaskusingtheNORBdataset(lecun2004learning),wheretheagentispresentedwithastartingviewpointofanobjectandthetaskistoproduceasequenceofactionssuchthattheagentendsupwiththegoalviewpoint.AstheNORBdatasetisembeddableonacylinder(hadsell2006dimensionality; schuler2018gradient)orasphere(wang2018toybox),wecanvisualizetheactionsastraversingtheembeddedmanifold.Ourapproachcomparesfavorablytostate-of-the-artmethodsonourtestbedwithrespecttodata-efficiency,butourasymptoticperformanceisstilloutclassedbyotherapproaches.

4.2 Relatedwork

Mostoftherelatedworkfallsintothecategoriesofreinforcementlearning,visualplanning,orrepresentationlearning.Theprimarydifferencebetweenoursandothermodel-basedmethodsisthattherepresentationislearnedbyoptimizingauxiliaryobjectiveswhicharenotdirectlyusefulforsolvingthemaintask.

4.2.1 Reinforcementlearning

Therearemanyworksintheliteraturethatalsoapproximatethetransitionfunctionofenvironments,forinstancebyperformingexplicitlatent-spaceplanningcomputations(tamar2016value; gal2016improving; henaff2017model; srinivas2018universal; chua2018deep; hafner2019learning)aspartoflearningandexecutingpolicies.gelada2019deepmdptrainanRLagenttosimultaneouslypredictrewardsaswellasfuturelatentstates.Ourworkisdistinctfromthese,aswearenotassumingarewardsignalduringtraining.ha2018worldcombinevision,memory,andcontrollerforlearningamodeloftheworldbeforelearningadecisionmodel.Apredictivemodelistrainedinanunsupervisedmanner,permittingtheagenttolearnpoliciescompletelywithinitslearnedlatentspacerepresentationoftheenvironment.Themaindifferenceisthattheyfirstapproximatethestatedistributionusingavariationalautoencoder,producingtheencodedlatentspace.Incontrast,ourrepresentationislearnedsuchthatitismaximallypredictableforthepredictornetwork.Similartoourtrainingsetup,oh2015actionpredictfutureframesinATARIenvironmentsconditionedonactions.Thepredictedframesareusedforlearningthetransitionfunctionoftheenvironment,e.g.forimprovingexplorationbyinformingagentsofwhichactionsaremorelikelytoresultinunseenstates.Ourworkdiffersasweareactingwithinalearnedlatentspaceandnotthefullinputspace,andourrepresentationsareusedinaclassicalplanningparadigmwithstartandgoalstatesinsteadofareinforcementlearningone.

4.2.2 Visualplanning

Wedefinevisualplanningastheproblemofsynthesizinganactionsequencetogenerateatargetstatefromaninitialstate,andallthestatesareobservedasimages.VariationalStateTabulations (corneil2018efficient)learnastaterepresentationinadditiontoatransferfunctionoverthelatentspace.However,theirobservationspaceisdiscretizedintoatableusingavariationalapproach,asopposedtoourcontinuousrepresentation.Acontinuousrepresentationcircumventstheproblemofhavingtodeterminethesizeofsuchatableinadvanceorduringtraining.Similarly,cuccu2018playingdiscretizevisualinputusingunsupervisedvectorquantizationandusethatrepresentationforlearningcontrollersforAtarigames.Inspiredbyclassicsymbolicplanning,RegressionPlanningNetworks(xu2019regression)createaplanbackwardfromasymbolicgoal.Wedonothaveaccesstohigh-levelsymbolicgoalinformationforourmethod,andweassumethatonlyhigh-dimensionalvisualcuesarereceivedfromtheenvironment.TopologicalmemoriesoftheenvironmentarebuiltinSemi-parametricTopologicalMemories(savinov2018semi)afterbeingprovidedwithobservationsequencesfromhumansexploringtheenvironment.Nodesareconnectedifapredictorestimatesthattheyareclose.Themethodhasproblemswithgeneralization,whicharereducedinHallucinativeTopologicalMemories(liu2020hallucinative),wherethemethodalsoadmitsadescriptionoftheenvironment,suchasamaporalayoutvector,whichtheagentcanuseduringplanning.Ourvisualplanningmethoddoesnotreceiveanyadditionalinformationonunseenenvironmentsanddoesnotdependonmanualexplorationduringtraining.CausalInfoGAN(kurutach2018learning)andrelatedmethods(wang2019learning)arebasedongenerativeadversarialnetworks(GANs)(goodfellow2014generative),inspiredbyInfoGANinparticular(chen2016infogan),forlearningaplannablerepresentation.AGANistrainedforencodingstartandgoalstates,andtheyplanatrajectoryintherepresentationspaceaswellasreconstructingintermediateobservationsintheplan.Ourmethodisdifferentasitdoesnotneedtoreconstructtheobservationsandtheforwardmodelisdirectlyoptimizedforprediction.

4.2.3 Prediction-basedrepresentationlearning

InPredictableFeatureAnalysis(richthofer2015predictable),representationsarelearnedthatarepredictablebyautoregressionprocesses.Ourmethodismoreflexibleandscalesbettertohigherdimensionsasthepredictorcanbeanydifferentiablefunction.Usingtheoutputofothernetworksaspredictiontargetsinsteadoftheoriginalpixelsisnotnew.Thecasewheretheoutputofalargermodelisthetargetforasmallermodelisknownasknowledgedistillation(bucilua2006model; hinton2015distilling).Thisisusedforcompressingamodelensembleintoasinglemodel.vondrick2016anticipatinglearntomakehigh-levelsemanticpredictionsoffutureframesinvideodata.Givenacurrentframe,aneuralnetworkpredictstherepresentationofafutureframe.Ourapproachisnotconstrainedonlytopretrainedrepresentations,welearnourrepresentationtogetherwiththepredictionnetwork.Moreover,weextendthisgeneralideabyalsoadmittinganactionastheinputtoourpredictornetwork.

4.3 Materialsandmethods

Inthiswork,westudydifferentrepresentationsforlearningthetransitionfunctionofapartiallyobservableMDP(POMDP)andproposeanetworkthatjointlylearnsarepresentationwithapredictionmodelandapplyitforlatentspaceplanning.WesummarizeherethedifferentingredientsoftheLARPnetwork–ourproposedsolution.Moredetaileddescriptionswillfollowinlatersections.Trainingthepredictornetwork:Weuseatwo-streamfullyconnectedneuralnetworktopredicttherepresentationofthefuturestategiventhecurrentstate’srepresentationandtheactionbridgingthosetwostates.Thepredictormoduleistrainedwithasimplemean-squarederrorterm.Handlingconstantsolutions:Therepresentationcouldbetransferredfromotherdomainsorlearnedfromscratchonthetask.IftherepresentationislearnedsimultaneouslywithanestimateofaMarkovdecisionprocess’s(MDP)transitionfunction,precautionsmustbetakensuchthatthepredictionlossisnottriviallyminimizedbyarepresentationthatisconstantoverallstates.Weconsiderthreeapproachesfortacklingtheproblem:spheringtheoutput,regularizingwithacontrastivelossterm,andregularizingwithareconstructivelossterm.Searchinginthelatentspace:Combiningtherepresentationwiththepredictornetwork,wecansearchinthelatentspaceuntilanodeisfoundthathasthelargestsimilaritytotherepresentationofthegoalviewpointusingamodifiedbest-firstsearchalgorithm.NORBenvironment:WeusetheNORBdataset (lecun2004learning)forourexperiments.Thisdatasetconsistsofimagesofobjectsfromdifferentviewpoints,andwecreateviewpoint-matchingtasksfromthedataset.

4.3.1 Ongoodrepresentations

Werelyonheuristicstoprovidesufficientevidenceforagood—albeitnotnecessarilyoptimal—decisionateverytimesteptoreachthegoal.Here,weusetheEuclideandistanceinrepresentationspace:asequenceofactionsispreferrediftheirendlocationisclosesttothegoal.TheusefulnessofthisheuristicsdependsonhowwellandhowcoherentlytheEuclideandistanceencodestheactualdistancetothegoalstateintermsofthenumberofactions.Alearnedpredictornetworkapproximatesthetransitionfunctionoftheenvironmentforplanninginthelatentspacedefinedbysomerepresentation.Thisraisesthequestion:whatistheidealrepresentationforlatentspaceplanning?Ourexperimentsshowthatanopenlyavailable,general-purposerepresentation,suchasapretrainedVGG16(simonyan2014very),canalreadyprovidesufficientguidancetoapplysuchheuristicseffectively.Betterstillarerepresentationmodelsthataretrainedonthedataathand,forexample,uniformmanifoldapproximationandprojection(UMAP)(mcinnes2018umap)orvariationalauto-encoders(VAEs)(kingma2013auto).Onemight,however,askwhataparticularlysuitedrepresentationmightlooklikewhenattainabilityisignored.Itwouldneedtotakethetopologicalstructureoftheunderlyingdatamanifoldintoaccount,sothattheEuclideandistancebecomesagoodproxyforthegeodesicdistance.Oneclassofmethodsthatsatisfythisarespectralembeddings,suchasLaplacianEigenmaps(LEMs)(belkin2003laplacian).Theirrepresentationsaresmoothanddiscriminativewhichisidealforourpurpose.However,theydonoteasilyproduceout-of-sampleembeddings,sotheywillonlybeappliedinanin-samplefashiontoserveasacontrolexperiment,yieldingoptimalperformance.

4.3.2 Predictornetwork

Astherepresentationisusedbythepredictornetwork,wewantittobepredictable.Thus,weoptimizetherepresentationlearnersimultaneouslywiththepredictornetwork,inanend-to-endmanner.Supposewehavearepresentationmap $ϕ$ andatrainingsetof $N$ labeleddatatuples $(X_{t} = [o_{t}, a_{t}], Y_{t} = o_{t + 1})$ ,where $o_{t}$ istheobservationattimestep $t$ and $a_{t}$ isanactionresultinginastatewithobservation $o_{t + 1}$ .Wetrainthepredictor $f$ ,parameterizedby $θ$ ,byminimizingthemean-squarederrorlossover $f$ ’sparameters:

argmin θ L_{prediction} (D, θ) = argmin θ \frac{1}{N} N \sum t = 1 ∥ ∥ ϕ (o_{t + 1}) - f_{θ} (ϕ (o_{t}), a_{t}) {∥ ∥}^{2}

(4.1)

where $D = {(X_{t}, Y_{t})}_{t = 0}^{N}$ isoursetoftrainingdata.Weconstruct $f$ asatwo-stream,fullyconnected,neuralnetwork.Usingthispredictorwecancarryoutplanninginthelatentspacedefinedby $ϕ$ .Byplanning,wemeanthatthereisastartstatewithobservation $o_{start}$ andagoalstatewith $o_{goal}$ andwewanttofindasequenceofactionsconnectingthem.Thenetworkoutputstheexpectedrepresentationafteracting.Usingthis,wecanformulateplanningasaclassicalpathfindingorgraphtraversalproblem.

4.3.3 Avoidingtrivialsolutions

Inthecasewhere $ϕ$ istrainableandparameterizedby $η$ ,thelossforthewholesystemthatonlycaresaboutmaximizingpredictabilityis

argmin θ, η L_{prediction} (D, θ, η) = argmin θ, η \frac{1}{N} N \sum t = 1 {(ϕ_{η} (o_{t + 1}) - f_{θ} (ϕ_{η} (o_{t}), a_{t}))}_{η}^{2}

(4.2)

foragivendataset $D$ .Withnoconstraintsonthefamilyoffunctionsthat $ϕ$ canbelongto,weruntheriskthattherepresentationcollapsestoaconstant.Constantfunctions $ϕ = c$ triviallyyieldzerolossforanyset $D$ if $f_{θ}$ outputstheinputstateagainforany $a$ ,i.e $f (ϕ (\cdot), a) = ϕ (\cdot)$ :Constantrepresentationsareoptimalwithrespecttopredictability,buttheyareunfortunatelyuselessforplanning,asweneedtodiscriminatedifferentstates.ThisobjectiveisnotpresentintheproposedlossfunctioninEq. (4.2)andwemustthusaddaconstraintoranotherlosstermtofacilitatedifferentiatingthedifferentstates.Thereareseveralwaystolimitthefunctionspacesuchthatconstantfunctionsarenotincluded,forexamplewithdecoder(goroshin2015learning)oradversarial(denton2017unsupervised)lossterms.Inthiswork,wedothiswith(i)aspheringlayer,(ii)acontrastiveloss,or(iii)areconstructiveloss.

(i)Spheringtheoutput

TheproblemoftrivialsolutionsissolvedinSFA(wiskott2002slow)andrelatedmethods(escalante2013solve; schuler2018gradient)byconstrainingtheoverallcovarianceoftheoutputtobe $I$ .Includingthisconstrainttooursettingyieldstheoptimizationformulation:

$minimize η, θ$	$L_{prediction} (D, θ, η)$	(4.3)
subjectto	$E_{D} [ϕ_{η}] = 0 (zeromean)$
$E_{D} [ϕ_{η} ϕ_{η}^{T}] = I (unitcovariance)$

Weenforcethisconstraintinournetworkviaarchitecturedesign.Thelastlayerperformsdifferentiablesphering(schuler2018gradient; hlynsson2019learning)ofthesecond-to-lastlayer’soutputusingthewhiteningmatrix $W$ .Weget $W$ usingpoweriterationofthefollowingiterativeformula:

u^{[i + 1]} = \frac{T u^{[i]}}{| | T u^{[i]} | |}

(4.4)

wherethesuperscript $i$ trackstheiterationnumberand $u^{[0]}$ canbeanarbitraryvector.Thepoweriterationalgorithmconvergestothelargesteigenvector $u$ ofamatrix $T$ inafewhundred,quickiterations.Theeigenvalue $λ$ isdetermined,andwesubtracttheeigenvectorfromthematrix:

T \leftarrow T - λ u u^{T}

(4.5)

theprocessisrepeateduntilthespheringmatrixisfound

W = \sum j = 0 \frac{1}{\sqrt{λ_{j}}} u_{j} u_{j}^{T}

(4.6)

Thewholesystem,includingthespheringlayer,canbeseeninFig 4.2,withanabstractconvolutionalneuralnetworkastherepresentation $ϕ$ andafully-connectedneuralnetworkasthepredictionfunction $f$ .

Fig 4.2: Predictiverepresentationlearningwithspheringregularization.Theobservations $o_{t}$ ,andtheresultingobservation $o_{t + 1}$ aftertheaction $a$ hasbeenperformedin $o_{t}$ ,arepassedthroughtherepresentationmap $ϕ$ ,whoseoutputsarepassedtoadifferentiablespheringlayerbeforebeingpassedto $f$ .Thepredictivenetwork $f$ minimizesthelossfunction $L$ ,whichisthemean-squarederrorbetween $ϕ (o_{t}) = ρ_{t}$ and $ϕ (o_{t + 1}) = ρ_{t + 1}$ .

(ii)Contrastiveloss

Constantsolutionscanalsobedealtwithinthelossfunctioninsteadofviaarchitecturedesign.hadsell2006dimensionalityproposetosolvethiswithalossfunctionthatpullstogethertherepresentationofsimilarobjects(inourcase,statesthatarereachablefromeachotherwithasingleaction)butpushesaparttherepresentationofdissimilarones:

L_{contrastive} (o, o^{^{'}}) = {\begin{matrix} | | ϕ (o) - ϕ (o^{^{'}}) | | & if o, o^{^{'}} aresimilar max (0, m - | | ϕ (o) - ϕ (o^{^{'}}) | |) & otherwise \end{matrix}

(4.7)

where $m$ isamarginand $| | \cdot | |$ issome—usuallytheEuclidean—norm.Therepresentationofdissimilarobjectsispushedapartonlyiftheinequality

| | ϕ (o) - ϕ (o^{^{'}}) | | < m

(4.8)

isviolated.Duringeachtrainingstep,wecompareeachobservationtoasimilarandadissimilarobservationsimultaneously (schroff2015facenet)bypassingatripletof(positive,anchor,negative)observationsduringtrainingtothreecopiesof $ϕ$ .Inourexperiments,thepositivecorrespondstothepredictedembeddingof $o_{t + 1}$ given $o_{t}$ and $a_{t}$ ,theanchor,isthetrueresultingembeddingafteranaction $a_{t}$ isperformedinstate $o_{t}$ and $ϕ (o_{n})$ ,thenegative,istherepresentationofanarbitrarilychosenobservationthatisnotreachablefrom $ϕ (o_{t})$ withasingleaction.Forenvironmentswherethisisdeterminable,suchasinourexperiments,thiscanbeassessedfromtheenvironment’sfullstate.Whenthisinformationisn’tavailable,ensuringfor $ϕ (o_{n})$ and $ϕ (o_{t})$ that $| n - t | > 2$ isagoodproxy,eventhoughthiscanresultinsomeincorrecttriplets.Forexample,whentheagentrunsinaself-intersectingpath.Wedefinetherepresentationoftheobservationattimestep $t$ as $ρ_{t} = ϕ (o_{t})$ andthenext-stepprediction ${~ ρ}_{t + 1} := f (ϕ (o_{t}), a_{t}))$ forreadabilityandour(positive,anchor,negative)tripletisthus $({~ ρ}_{t + 1}, ϕ (o_{t + 1}), ϕ (o_{n}))$ andweminimizethetripletloss:

L_{contrastive} (o_{t}, o_{t + 1}, o_{n}, a_{t}) = | | ρ_{t + 1} - {~ ρ}_{t + 1} | | + max (0, m - | | ρ_{t + 1} - ρ_{n} | |)

(4.9)

Itwouldseemthat $ρ_{t + 1}$ and ${~ ρ}_{t + 1}$ areinterchangeable,sincethesecondtermisincludedonlytopreventtherepresentationfromcollapsingintoaconstant.However,ifthelossfunctionis

L_{contrastive} (o_{t}, o_{t + 1}, o_{n}, a_{t}) = | | ρ_{t + 1} - {~ ρ}_{t + 1} | | + max (0, m - | | {~ ρ}_{t + 1} - ρ_{n} | |)

(4.10)

thenthenetworkisrewardedduringtrainingformaking $f$ pooratpredictingthenextrepresentationinsteadofsimplypushingtherepresentationof $o_{t}$ and $o_{n}$ awayfromeachother.Therearetwomainwaystosetthemargin $m$ ,oneisdynamicallydeterminingitperbatch(sun2014deep).Theother,whichwechoose,isconstrainingtherepresentationtobeonahypersphereusing $L_{2}$ normalizationandsettingasmallconstantmarginsuchas $m = 0.2$ (schroff2015facenet).ThearchitectureforthetrainingschemeusingthecontrastivelossregularizationisdepictedinFig. 4.3.

Fig 4.3: Predictiverepresentationlearningwithcontrastivelossregularization.Weminimizethecontrastivelossfunction $L_{contrastive}$ (Eq.4.9).Thepredictedfuturerepresentation ${~ ρ}_{t + 1}$ ispulledtowardthenextstep’srepresentation $ρ_{t + 1}$ .However, $ρ_{t + 1}$ ispushedawayfromthenegativestate’srepresentation $ρ_{n}$ ifthedistancebetweenthemislessthan $m$ .Theobservation $o_{n}$ israndomlyselectedfromthosethatarenotreachablefrom $o_{t}$ withasingleaction.

(iii)Reconstructiveloss

TrivialsolutionsareavoidedbyGoroshinetal.(goroshin2015learning)byintroducingadecodernetwork $D$ toasystemthatwouldotherwiseconvergetoaconstantrepresentation.Weincorporatethisintuitionintoourframeworkwiththelossfunction

	$L_{decoder} (o_{t}, o_{t + 1}, a_{t}) = L_{prediction} (o_{t}, o_{t + 1}) + L_{reconstruction} (o_{t}, o_{t + 1})$
	$= {(ρ_{t} - {~ ρ}_{t + 1})}_{t}^{2} + α {(o_{t + 1} - D ({~ ρ}_{t + 1}))}_{t + 1}^{2}$			(4.11)

where $α$ isapositive,realcoefficienttocontroltheregularizationstrength.Fig. 4.4showshowthemodelsandlossfunctionsarerelatedduringthetrainingoftherepresentationandpredictorusingbothapredictiveandareconstructivelossterm.

Fig 4.4: Predictiverepresentationlearningwithdecoderlossregularization.Atthetimestep

t

,theobservation

o_{t}

ispassedtotherepresentation

ϕ

.Thisproduces

ρ_{t}

whichispassed,alongwiththeaction

a_{t}

attimestep

t

,tothepredictornetwork

f

.Thisproducesthepredicted

{~ ρ}_{t + 1}

whichiscomparedto

ρ_{t + 1} = ϕ (o_{t + 1})

inthemean-squarederrorterm

L_{prediction}

.Theprediction

{~ ρ}_{t}

isalsopassedtothedecodernetwork

D

.Wethencompare

{~ o}_{t + 1} = D ({~ ρ}_{t + 1})

with

o_{t + 1}

L_{decoder}

,anothermean-squaredlossterm.Thefinallossisthesumofthesetwolossterms

L_{total} = L_{prediction} + L_{decoder}

ThedesiredeffectoftheregularizationcanalsobeachievedbyreplacingthesecondterminEq. (4.3.3)with $α {(o_{t + 1} - D (ρ_{t + 1}))}_{t + 1}^{2}$ .Bydoingthiswewouldmaximizethereconstructivepropertyofthelatentcodeinandofitself,whichisnotinherentlyusefulforplanning.Weinsteadaddanadditionallevelofpredictivepowerin $f$ :inadditiontopredictingthenextrepresentation,itspredictionmustalsobeusefulinconjunctionwiththedecoder $D$ forreconstructingthenewtrueobservation.Thisapproachcanhavethelargestcomputationaloverheadofthethree,dependingonthesizeofthedecoder.Weconstructthedecodernetwork $D$ suchthatitcloselymirrorsthearchitectureof $ϕ$ ,withconvolutionsreplacedbytransposedconvolutionsandmax-poolingreplacedbyupsampling.

4.3.4 Trainingthepredictornetwork

WetraintherepresentationnetworkandpredictornetworkjointlybyminimizingEq. (2),Eq. (10)orEq. (12).Thepredictornetworkcanalsobetrainedonitsownforafixedrepresentationmap $ϕ$ .Inthiscase, $f$ istaskedasbeforewithpredicting $ϕ (o_{t + 1})$ aftertheaction $a_{t}$ isperformedinthestatewithobservation $o_{t}$ byminimizingthemean-squarederrorbetween $f (ϕ (o_{t}), a_{t})$ and $ϕ (o_{t + 1})$ .ThenetworksarebuiltwithKeras(chollet2015keras)andoptimizedwithrmsprop(tieleman2012lecture).

4.3.5 Planningintransition-learneddomainrepresentationspace

Weuseamodifiedbest-firstsearchalgorithmwiththetrainedrepresentationsforourexperiments(Algorithm2).

o_{start}

o_{goal}

,maxtrials

m

,actionset

A

,representationmap

ϕ

andpredictorfunction

f

0: Asequenceofactions

(a_{0}, \dots, a_{n})

connectingthestartstatetothegoalstate

1: Initializetheset

Q

ofuncheckedrepresentationswiththerepresentationofthestartstate

ϕ (o_{start})

2: Initializethedictionary

P

ofrepresentation-pathpairswiththeinitialrepresentationmappedtoanemptysequence:

P [ϕ (o_{start})] \leftarrow (\emptyset

)

3: Initializetheemptysetofcheckedrepresentations

C \leftarrow \emptyset

4: for

k \leftarrow 0 to m

5: Choose

ρ^{'}

\leftarrow

argmin ρ \in Q | | ρ - ϕ (o_{g o a l}) | |

6: Remove

ρ^{'}

from

Q

andadditto

C

7: for all actions

a \in A

8: Getanewestimatedrepresentation

ρ^{*} \leftarrow f (ρ^{'}, a)

andaddittothesetQ

9: Concatenate

a

totheendof

P [p^{'}]

andassociatetheresultingsequencewith

ρ^{*}

inthedictionary:

P [ρ^{*}] \leftarrow P [ρ^{'}]^{⌢} (a)

10:

11: end for

12: end for

13: Findthemostsimilarrepresentationtothegoal:

ρ_{result} \leftarrow argmin ρ \in Q \cup C | | ρ - ϕ (o_{g o a l}) | |

14: return thesequence

P [ρ_{result}]

Algorithm 2 Performasimulatedrollouttofindastatethatismaximallysimilartoagoalstate.Outputasequencetoreachthefoundstatefromthestartstate.

Fromagivenstate,theagentperformsasimulatedrollouttosearchforthegoalstate.Foreachaction,theinitialobservationispassedtothepredictorfunctionalongwiththeaction.Thisresultsinapredictednext-steprepresentation,whichisaddedtoaset.Theactionstakensofarandresultingineachpredictionarenotedalso.Therepresentationthatisclosesttothegoal(usingforexampletheEuclideandistance)isthentakenforconsiderationandremovedfromtheset.Thisprocessisrepeateduntilthemaximumnumberoftrialsisreached.Thealgorithmthenoutputsthesequenceofactionsresultinginthepredictedrepresentationthatistheclosesttothegoalrepresentation.Tomakethealgorithmfaster,weonlyconsiderpathsthatdonottakeustoastatethathasalreadybeenevaluated,eveniftheremightbeadifferenceinthepredictionsfromgoingthisroundaboutway.Thatis,ifapermutationoftheactionsinthenextpathtobeconsideredisalreadyinanevaluatedpath,itwillbeskipped.Thishasthesameeffectastranspositiontablesusedtospeedupsearchingametrees.Pathsmightbeproducedwithredundancies,whichcanbeamendedwithpath-simplifyingroutines(e.g.takeonestepforwardinsteadofonestepleft,oneforwardthenoneright).WedoModel-PredictiveControl(garcia1989model),thatis,afterapathisfound,oneactionisperformedandanewpathisrecalculated,startingfromthenewposition.Sincetheplanningispossiblyoveralongtimehorizon,wemighthaveacasewhereapreviousstateisrevisited.Toavoidloopsresultingfromthis,wekeeptrackofvisitedstate-actionpairsandavoidanalreadychosenactionforagivenstate.

4.3.6 NORBviewpoint-matchingexperiments

Forourexperiments,wecreateanOpenAIGymenvironmentbasedonthesmallNORBdataset(lecun2004learning).Thecodefortheenvironmentisavailableathttps://github.com/wiskott-lab/gym-norbandrequiresthepickledNORBdatasethostedathttps://s3.amazonaws.com/unsupervised-exercises/norb.p.Thedatasetcontains50toys,eachbelongingtooneoffivecategories:four-leggedanimals,humanfigures,airplanes,trucks,andcars.Eachobjecthasstereoscopicimagesundersixlightingconditions,9elevations,and18azimuths(in-scenerotation).Inalloftheexperiments,wetrainthemethodsonninecarclasstoys,testingontheothertoys.EachtrialinthecorrespondingRLenvironmentrevolvesaroundasingleobjectunderagivenlightingcondition.Theagentispresentedwithastartandagoalviewpointoftheobjectandtransitionsbetweenimagesuntilthecurrentviewpointmatchesthegoalwhereeachactionoperatesthecamera.Tobeconcrete,theactionscorrespondtoturningaturntablebackandforthby $20^{\circ}$ ,movingthecameraupordownby $5^{\circ}$ and,inoneexperiment,changingthelighting.Thetrialisasuccessiftheagentmanagestochangeviewpointsfromthestartpositionuntilthegoalviewpointismatchedinfewerthantwicetheminimumnumberofactionsnecessary.Wecomparetherepresentationslearnedusingthethreevariantsofourmethodtofiverepresentationsfromtheliterature,namely(i)LaplacianEigenmaps(belkin2003laplacian),(ii)thesecond-to-lastlayerofVGG16pretrainedonImageNet(deng2009imagenet),(iii)UMAPembeddings(mcinnes2018umap),(iv)convolutionalencoder(masci2011stacked)and(v)VAEcodes(kingma2013auto).Asfixedrepresentationsdonotchangethroughoutthetraining,theycanbesavedtodisk,speedingupthetraining.Asareference,weconsiderthreereinforcementlearningmethodsworkingdirectlyontheinputimages:(i)DeepQ-Networks(DQN) (mnih2013playing),(ii)ProximalPolicyOptimization(PPO) (schulman2017proximal)and(iii)WorldModels (ha2018world).Thedatasetisturnedintoagraphforsearchbysettingeachimageasanodeandeachviewpoint-changingactionasanedge.Thetaskoftheagentistotransitionbetweenneighboringviewinganglesuntilagoalviewpointisreached.Thetotalnumberoftrainingsamplesisfixedat25600.Forourmethod,asampleisasingle( $o_{t}$ , $o_{t + 1}$ , $a_{t}$ )triplettobepredictedwhilefortheregularRLmethodsitisa( $o_{t}$ , $o_{t + 1}$ , $a_{t}$ , $ρ_{t}$ )tuple.

4.3.7 Modelarchitectures

Input

Thenetwork $ϕ$ encodesthefullNORBinput,a $96 \times 96$ pixelgrayscaleimage,tolower-dimensionalrepresentations.Thesystemasawholereceivestheimagefromthecurrentviewpoint,theimageofthegoalviewpointandaone-hotencodingofthetakenaction.Theimageinputsareconvertedfromintegersrangingfrom0to255tofloatingpointnumbersrangingfrom0to1.

Representationlearner $ϕ$ architecture

Weusethesamearchitectureforthe $ϕ$ networkinallofourexperimentsexceptforvaryingtheoutputdimension,Table 4.1.

Layer	Filters/Units	Kernelsize	Strides	Outputshape	Activation
Input				(96,96,1)
Convolutional	64	$5 \times 5$	$2 \times 2$	(45,45,64)	ReLU
Max-pooling			$2 \times 2$	(22,22,64)
Convolutional	128	$5 \times 5$	$2 \times 2$	(9,9,128)	ReLU
Flatten				(10368)
Dense	600			(600)	ReLU
Dense	#Features			(#Features)	Linear

Table 4.1: Representationnetworkarchitecture.

Regularizingdecoderarchitecture $D$

Thedecodernetwork $D$ hasthearchitecturelistedinTable 4.2.Itisdesignedtoapproximatelyinverseeachoperationintheoriginal $ϕ$ network.

Layer	Filters/Units	KernelSize	Strides	Outputshape	Activation
Input				(#Features)
Dense	512			(512)	ReLU
BN				(512)
Dense	12800			(12800)	ReLU
BN				(12800)
Reshape				(10,10,128)
CT	128	$5 \times 5$	$2 \times 2$	(23,23,128)	ReLU
Upsampling			$2 \times 2$	(46,46,128)
BN				(46,46,128)
CT	64	$5 \times 5$	$2 \times 2$	(95,95,64)	ReLU
BN				(95,95,64)
CT	1	$2 \times 2$	$1 \times 1$	(96,96,1)	Sigmoid

Table 4.2: Regularizingdecoderarchitecture.Theupsamplinglayeruseslinearinterpolation,BNstandsforBatchNormalizationandCTstandsforConvolutionalTranspose.

Predictornetwork $f$

Thepredictornetwork $f$ isatwo-streamdenseneuralnetwork.Eachstreamconsistsofadenselayerwitharectifiedlinearunit(ReLU)activation,followedbyabatchnormalization(BatchNorm)layer.Theoutputsofthesestreamsarethenconcatenatedandpassedthrough3denselayerswithReLUactivations,eachonefollowedbyaBatchNorm,andthenanoutputdenselayer,seeTable4.3.

Layer	Filters/Units	Outputshape	Activation
$ϕ$ Stream:Input		(#Features)	ReLU
$ϕ$ Stream:Dense	256	(256)	ReLU
$ϕ$ Stream:BatchNormalization		(256)
$A$ Stream:Input		(#Actions)	ReLU
$A$ Stream:Dense	128	(128)	ReLU
$A$ Stream:BatchNormalization		(128)
Concatenate $ϕ$ and $A$ streams		(384)
Dense	256	(256)	ReLU
BatchNormalization		(256)
Dense	256	(256)	ReLU
BatchNormalization		(256)
Dense	128	(128)	ReLU
BatchNormalization		(128)
Dense	#Features	(#Features)	Linear

Table 4.3: Representionpredictorarchitecture.The

ϕ

streamreceivestherepresentationasinputandthe

A

streamreceivestheone-hotactionasinput.Bothstreamsareprocessedinparallelandthenconcatenated,witheachoperationappliedfromtoptobottomsequentially.Thenumberofhiddenunitsinthelastlayerdependsonthechosendimensionalityoftherepresentation.

4.4 Results

Withourempiricalevaluation,weaimtoanswerthefollowingresearchquestions:

(Monotonicity)IstheEuclideandistancebetweenasuitablerepresentationandthegoalrepresentationdecreasingasthenumberofactionsthatseparatethemdecreases?
(Trainedpredictability)Istrainingarepresentationforpredictability,asproposed,feasible?
(Dimensionality)Whatisthebestdimensionalityofthelatentspaceforourplanningtasks?
(Solutionconstraints)Intermsofplanningperformance,whatarepromisingconstraintstoplaceontherepresentationtoavoidtrivialsolutions?
(Benchmarking)HowdoesplanningwithLARPcomparetoothermethodsfromtheRLliterature?
(Generalization)Howwelldoesourmethodgeneralizetounseenenvironments?

wewillrefertotheseresearchquestionsbynumberbelowastheygetaddressed.

4.4.1 Latentspacevisualization

Whentherepresentationandpredictornetworksaretrained,weapplyAlgorithm2totheviewpoint-matchingtask.Asdescribedabove,thegoalistofindasequenceofactionsthatconnectsthestartstatetothegoalstate,wherethetwostatesdifferintheirconfigurations.Tosupportthequalitativeanalysisofthelatentspace,weplotheatmapsofsimilaritybetweenthegoalrepresentationandthepredictedrepresentationofnodesduringsearch(Fig 4.6).Ofthe10cartoysintheNORBdataset,werandomlychose9forourtrainingsetandtestontheremainingone.

In-sampleembedding:LaplacianEigenmaps

First,weconsiderresearchquestion1(monotonicity).Inordertogetthebest-caserepresentation,weembedthetoyusingLaplacianEigenmaps.EmbeddingasingletoyinthreedimensionsusingLaplacianEigenmapsresultsinatube-likeembeddingthatencodesbothelevationandazimuthangles,seeFig4.5.Threedimensionsareneededsothatthecyclicazimuthcanbeembeddedcorrectlyas $sin (θ)$ and $cos (θ)$ .

Iftherepresentationisnowusedtotrainthepredictor,onewouldexpectthattherepresentationbecomesmonotonicallymoresimilartothegoalrepresentationasthestatemovestowardthegoal.InFig4.6weseethatthisisthecaseandthatthisbehaviorcanbeeffectivelyusedforagreedyheuristics.Whilethemonotonicityisnotalwaysexactduetoerrorsintheprediction,Fig4.6stillqualitativelyillustratesabest-casescenario.

Fig 4.6: HeatmapofLaplacianEigenmaplatentspacesimilarity.Eachpixeldisplaysthedifferencebetweenthepredictedrepresentationandthegoalrepresentation.Onlythestartandgoalobservationsaregiven.Thebluedotshowsthestartstate,greenthegoal,andpurplethesolutionstatefoundbythealgorithm.ThesearchalgorithmcanrelyonanalmostmonotonicallydecreasingEuclideandistancebetweeneachstate’spredictedrepresentationandthegoal’srepresentationtoguideitssearch.

Weconcludefromthisthat,forasuitablerepresentation,theEuclideandistancebetweenacurrentrepresentationandthegoalrepresentationismonotonicallyincreasingasafunctionofthenumberofactionsthatseparatethem.Thissupportstheuseofaprediction-basedlatentspacesearchforplanning.

Out-of-sampleembedding:pretrainedVGG16representation

Next,weconsiderthepretrainedrepresentationoftheVGG16networktogetarepresentationthatgeneralizestonewobjects.Wetrainthepredictornetworkandplottheheatmapofthepredictedsimilaritybetweeneachstateandthegoalstate,beginningfromthestartstate,inFig 4.7.

Fig 4.7: HeatmapofVGG16latentspacesimilarity.ThepredictornetworkestimatestheVGG16representationoftheresultingstatesastheobjectismanipulated.(a)Thegoalliesonahillcontainingamaximumofrepresentationalsimilarity.(b)Theaccumulatederrorsofiteratedestimationscausethealgorithmtoplanapathtoawrongstatewithasimilarshape.

Theheatdistributioninthiscaseismorenoisy.Togetaviewoftheexpectedheatmapprofile,weaverageseveralfiguresofthistypetoshowbasinsofattractionduringthesearch.Eachheatmapisshiftedsuchthatthegoalpositionisatthebottom,middlerow(Fig 4.8,a).Here,itisobviousthatthegoalandthe $180^{\circ}$ flipped(azimuth)versionofthegoalareattractorstates.Thisisduetotherepresentationmapbeingsensitivetotheroughshapeoftheobject,butbeingunabletodistinguishfinerdetails.In(Fig 4.8,b)wedisplayanaggregateheatmapwhentheagentcanalsochangethelightingconditions.Ourvisualizationsshowagradienttowardthegoalstateinadditiontovisuallysimilarfar-away-states,sometimescausingthealgorithmtoproducesolutionsthatarethepolaroppositeofthegoalconcerningtheazimuth.Predictionerrorsalsopreventtheplanningalgorithmfromfindingtheexactgoalforeverytask,evenifitisnotdistractedbythepolar-opposite.

Fig 4.8: AggregateheatmapsofVGG16representationsimilaritiesontestdata.Thedataiscollectedasthestatespaceissearchedforamatchingviewpoint.Thepixelsarearrangedaccordingtotheirelevationandazimuthdifferencefromthegoalstateat $(0^{\circ}, 0^{\circ})$ ontheleftand $(0^{\circ}, 0^{\circ}, 0^{\circ})$ ontheright.(a)Weseecleargradientstowardthetwobasinsofattraction.Thereislesschangealongtheelevationduetolesschangeateachstep.(b)Theagentcanalsochangethelightingofthescene,withqualitativelysimilarresults.Inthisgraphicweonlymeasuretheabsolutevalueofthedistance.

Toinvestigatetheaccuracyofthesearchwithrespecttoeachdimensionseparately,weplotthehistogramofdistancesbetweenthegoalstatesandthesolutionstatesinFig 4.9.Thegoalandstartstatesarechosenrandomly,withtherestrictionthattheazimuthdistanceandelevationdistancebetweenthemareeachuniformlysampled.Fortherestofthechapter,alltrialsfollowthissamplingprocedure.TheresultslooklessaccurateforelevationthanazimuthbecausetheelevationchangesaresmallerthantheazimuthchangesintheNORBdataset.ThedifferencebetweenthegoalandsolutionviewpointsinFig 4.7left,forexample,ishardlyvisible.Ifonewouldscalethehistogramsbyangleandnotbybins,thedrop-offwouldbesimilar.

Fig 4.9: Histogramsofelevation-wiseandazimuth-wiseVGG16errors.Thehistogramsdisplaythecountsofthedistancebetweengoalandsolutionstatesalongelevation(left)andazimuth(right)ontestdata.Thedistancebetweenthestartandgoalviewpointsisequallydistributedacrossallthetrials,alongbothdimensions.Thegoalandthe $180^{\circ}$ flipped(azimuth)versionofthegoalareattractorstates.

4.4.2 Latentspacedimensionality

Withthenextexperiment,weaimtoanswerresearchquestion2(trainedpredictability).Whiletuningthedetailsofthedesign,wealsotackleresearchquestions3(dimensionality)and4(solutionconstraints).Wedoanablationstudyofthedimensionalityoftherepresentationforourmethod(Table 4.4).ThetestcarisanunseencartoyfromtheNORBdataset,andthetraincarcomesfromthetrainingset.

Representation	Dimensions	TrainingCar( $%$ )	TestCar( $%$ )
LARP(Contrastive)	96	59.3	56.8
	64	64.1	60.5
	32	72.3	59.4
	16	74.1	59.3
	8	82.7	58.0
LARP(Sphering)	96	41.1	37.8
	64	93.9	53.7
	32	89.8	51.9
	16	85.2	42.6
	8	85.1	40.1
LARP(Decoder)	96	58.0	51.9
	64	79.5	63.0
	32	77.8	61.7
	16	51.9	45.2
	8	51.1	42.4
VGG16	902	62.4	55.1
RandomSteps		3.5	3.5

Table 4.4: Ablationstudyoftherepresentationdimensionality.WechangetheoutputdimensionoftherepresentationlearnersubnetworkandcompareittotheVGG16representationtrainedonImageNet.Theperformance(meansuccessrate)isaveragedovertenseparateinstantiationsofoursystems,whereeachinstanceisevaluatedonahundredtrialsoftheviewpoint-matchingtask.Atrialisasuccessifthegoalisreachedbytakinglessthantwicetheminimumnumberofactionsneededtoreachit.Thestandarddeviationsrangebetween0.1and0.3foreachtableentry.

Thereisnoclearwinner:thenetworkwiththespheringlayerdoesthebestononeofthecarsusedduringtraining,whilethereconstructive-lossnetworkdoesthebestontheheld-outtestcar.Thesharpdifferenceinperformancebetween64and94sphering-regularizedrepresentationcanbeexplainedbythenumericalinstabilityofthepoweriterationmethodfortoolargematrixdimensions.TheVGG16representationisnotthehighestperformeronanyofthecartoys.ManyofVGG16’srepresentationvaluesare0forallimagesintheNORBdataset,soweonlyusethosethatarenonzeroforanyoftheimages.Wesuggestthatthishighnumberofdeadunitsisduetotherepresentationbeingtoogeneralforthetaskofmanipulatingrelativelyhomogenousobjects.Anotherdrawbackofusingpretrainednetworksisthatinformationmightbeencodedthatisunimportantforthetask.Thishastheeffectthatoursearchmethodisnotguaranteedtooutputthecorrectsolutioninthelatentspace,astheremightbedistractingpocketsoflocalminima.Therandombaselinehasanaveragesuccessrateof3.5 $%$ ,whichisveryclearlyoutperformedbyourmethod.As64isthebestdimensionalityfortherepresentationonaverage,wecontinuewiththatnumberforourmethodinthetransferlearningexperiment.Weconcludethattheproposedmethodoftrainingarepresentationforpredictabilityisfeasible.Sofarwehaveevidencethat64isthebestdimensionalityoftherepresentation’slatentspaceforourplanningtasks.However,itisnotyetconclusivewhatthebestrestrictionistoplaceontherepresentationtoavoidthetrivialsolution,intermsofplanningperformance.

4.4.3 ComparisonwithotherRLmethods

Nowwedivertourattentiontoresearchquestion5(benchmarking),wherewecompareourmethodtotheliterature.ForthecomparisonwithstandardRLmethods,weusethedefaultconfigurationsofthemodel-freemethodsDQNandPPOasdefinedinOpenAIBaselines (baselines).Ourmodel-basedcomparisonischosentobeworldmodels(ha2018world)fromhttps://github.com/zacwellmer/WorldModels.WemakesurethatthecomparedRLmethodsaresimilartooursystemintermsofthenumberofparametersaswellasarchitecturelayoutandcomparethemwithourmethodonthecarviewpoint-matchingtask.ThetasksetupisthesameasbeforeandisconvertedtoanOpenAIgymenvironment:astartobservationandagoalobservationarepassedtotheagent.Iftheagentmanagestoreachitwithin2timestheminimumnumberofactionsrequired(theminimumnumberiscalculatedbytheenvironment),theagentreceivesarewardandthetaskisconsideredasuccess.Otherwise,norewardisgiven.TheresultsofthecomparisoncanbeviewedinFig. 4.10.Eachpointinthecurvecontainseachmethod’smeansuccessrate:theaverageofthecumulativerewardfrom100testepisodesfrom5differentinstantiationsoftheRLlearner,soitistheaveragerewardover500episodesintotal.Thetestepisodesaredoneonthesameenvironmentasisusedfortraining,exceptthatthepolicyismaximallyexploitingandminimallyexploring.

Fig 4.10: Reinforcementlearningcomparison.Theverticaldashedlinesindicatewhenthecomparedalgorithmhasprocessedthesamenumberoftransitionsasourmethodandthehorizontaldottedlineindicatesthetestperformanceofourmethod.Eachdatapointisthemeansuccessrateof100testepisodesafteravaryingnumberoftrainingsteps,averagedover5differentseedsofeachlearner.Themodel-freemethodsin(a)and(b)traintherepresentationandthecontrollersimultaneouslybyactingintheenvironmentandcollectingnewexperiences.Therepresentationin(c)istrainedon25.6ktransitions,whichisthesamenumberweuse.Theplotshowstheoptimizationcurveforthecontroller,usingaCovariance-MatrixAdaptationEvolutionStrategy,whichhardlyimprovesafter500orsotrainingsteps.Thehorizontallinestartsat0forworldmodelsbecausetherepresentationhasfinishedtrainingontheobservationsbeforethecontrollerisoptimized.

Inourexperiments,theDQNnetworksaremuchmoresampleinefficientthanPPO,whichinturnismoresampleinefficientthanourmethod.However,ourmethodismoretime-consumingduringtesttime.Werequireaforwardpassofthepredictornetworkforeachnodethatissearchedbeforewetakethenextstep,whichcangrowrapidlyifthetargetisfaraway.Incontrast,onlyasinglepassthroughthetraditionalRLnetworksisrequiredtocomputethenextaction.Ourmethodreaches $93.9 %$ successrateonthetraincar(Table 4.4)using25.6ksamples,butthebestPPOrunonlyreaches $70.5 %$ aftertrainingonthesamenumberofsamples.ThebestsinglePPOrunneeded41.3ksamplestogethigherthan $93.9 %$ successrate,andtheaverageperformanceishigherthan $93.9 %$ ataround55ksamples.Afterthat,somePPOlearnersdeclinedagaininperformance.TheworldmodelspolicyquicklyreachesthesamelevelofperformanceasDQNgotafter50kstepsandPPOafterapproximately8ksteps,butitdoesn’timprovebeyondthat.

Fig 4.11: Re-trainingafterplacingobstaclesinacheckerboardpattern.(a)Thetaskisthesameasbefore,butnothinghappensiftheagentattemptstomovetoastatecontainingablackrectangle.(b)Aftertrainingtheagents,were-testedthemafterweintroducedthecheckerboardpatternofobstacles.Ourmethoddoesnotallowforre-traininginthenewenvironment.

WeconcludethatourmethodcomparesfavorablytoothermethodsfromtheRLliteratureintermsofdata-efficiency.

4.4.4 Modifyingtheenvironment

Wenowmodifytheenvironmenttoanswerresearchquestion6(generalization).Toseehowthemethodscomparewhenobstaclesareintroducedtotheenvironment,werepeatthetrialononeofthecarobjectsexceptthattheagentcannolongerpassthroughstateswhoseelevationvaluesaredivisibleby10andazimuthvaluesaredivisibleby40(Fig. 4.11,(a)).Asbefore,thegoalandstartlocationscanhaveanyazimuth-elevationpair,buttheagentcannotmoveintostateswiththepropertiesindicatedbytheblackrectangle.Everyactionisavailabletotheagentatalllocationsasbefore,buttheagent’sstateisunchangedifitattemptstomovetoastatewithablackrectangle.WetrainedLARPusingthecontrastiveloss,PPO,andDQNagentsuntiltheyreached $80 %$ accuracyonourplanningtaskandthentestedthemwiththeaddedobstacles.Ourmethodlosesabout $10 %$ performance,butPPOloses $50 %$ .Nevertheless,wecancontinuetrainingPPOuntilitquicklyreachestopperformanceagain(Fig. 4.11,(b)).Ourmethodisnotre-trainedforthenewtask,andDQNdidnotreachagoodperformanceagaininthetimeweallottedforre-training.Thus,weseethatourmethodisquiteflexibleandgeneralizeswellwhenobstaclesareintroducedtotheenvironment.

4.4.5 Transfertodissimilarobjects

Wenowconsiderresearchquestion6(generalization)furtherbyinvestigatinghowwellourmethodtransfersknowledgefromonedomaintoanother.Selectingthebestdimensionalityfortherepresentationfromtheprevioussetofexperiments,weinvestigatefurthertheirperformanceinhardersituationsusingunseen,non-carobjects.Themodelsaretrainedonthesamecarobjectsasinthepreviousexperiment,buttheyaretestedonanarrayofdifferentplasticsoldiers:akneelingsoldierholdingabazooka,astandingsoldierwitharifle,aNativeAmericanwithabowandspearandacowboywitharifle(Fig. 4.12).

Fig 4.12: Toysfortransferlearningexperiments.Fromlefttoright:Soldier(Kneeling),Soldier(Standing),NativeAmericanwithBowandCowboywithRifle.

Qualitativeresults

Weofferavisualizationofthelearnedrepresentationofthekneelingsoldiertoyusingourmethod,aconvolutionalencoder,andVGG16inFig. 4.13.Eachembeddingwasreducedto2dimensionsusingt-SNE.Everymethodstructuresthedomainsimilarly.Inthebottomrow,weseethatthelargestclustersforallmethodsaretheoneswiththehighest(tealdots)illuminationsettings,whichisexplainedbytheeffectofthelightingonthepixelvalueintensities.Withintheseclusters,weseeclusteringbasedontheazimuth(middlerow).Finally,withintheseclusters,thereisagradientstructurebasedonelevation(toprow).Thisisduetotheelevationchanginginsmallerstep-sizes,with5degreedifferences,thanazimuthwith20degreedifferences.

Quantitativeresults

OurexperimentsincludethepretrainedVGG16networkbecausewebelievethataflexiblealgorithmshouldratherbebasedonagenericmulti-purpose-representationandnotonaspecificrepresentation.Tooursurprise,itisoutperformedbyourapproachoneveryobject,evenastheviewpoint-matchingtaskisdoneonobjectsthataredifferentfromtheoriginalcartoystheyaretrainedon.WealsotrytransferlearningaVAEandaconvolutionalencoderusingthesamenetworkasourrepresentationandaUMAPembedding,with64featureseach.Wedisplaytheresultsofthesameviewpoint-matchingtasksasbeforeusingthesedifferentrepresentationsontheunseenarmyfiguresinTable 4.5.Convolutionalencodershavethelowestperformanceoutofallthemethods,buttheVAErepresentationhasthemostsimilarperformancetoourownrepresentations.

Representation	Soldier	Soldier	NativeAmerican	Cowboy
	(Kneeling)	(Standing)	withBow	withRifle	Mean
LARP(Contrastive)	22.7	11.1	12.1	20.7	16.7
LARP(Sphering)	18.4	15.3	15.2	15.8	16.2
LARP(Decoder)	14.3	16.6	14.7	15.6	15.3
VAE	14.4	13.1	13.6	13.8	13.7
VGG16	12.3	12.9	11.8	13.6	12.7
UMAP	8.6	10.8	9.7	11.1	10.1
Conv.Encoder	4.9	14.3	11.5	6.4	9.3

Table 4.5: Transferlearningperformance.Themethodsaretrainedonthedatasetofdifferentcarimages,andthentheirperformanceondissimilartoysistested.Eachnumberisthemeansuccessrateofviewpointmatchingoutof1000trials.

Ourrepresentationisbetterthantheothersforthisexperiment.Butweseethattheperformanceofourmethoddropssignificantlyasweattemptgeneralizingtodifferentobjectswithchangedinputstatistics.Notethateventhoughtherepresentationclustering(Fig. 4.13)wasmoreclear-cutusingconvolutionalencodersandVGG,comparedtoours,theydidnotperformaswellinthetrials.Weattributethistothefactthatthemainlossterminourrepresentationwastheoneoffuturestatepredictability,whichmaynotgiveasclearofastructureintwodimensionsasarepresentationthatistrainedonlyforstaticimagereconstruction.

4.5 Conclusion

Inthischapter,wepresentLatentRepresentationPrediction(LARP)networkswithapplicationstovisualplanning.WejointlylearnamodeltopredicttransitionsinMarkovdecisionprocesseswitharepresentationtrainedtobemaximallypredictable.Thisallowsustoaccuratelysearchthelatentspacedefinedbytherepresentationusingaheuristicgraphtraversalalgorithm.Wevalidateourmethodonaviewpoint-matchingtaskderivedfromtheNORBdataset,andwefindthatarepresentationthatisoptimizedjointlywiththepredictornetworkperformsbestinourexperiments.Acommonissueofunsupervisedrepresentationlearningisoneoftrivialsolutions:aconstantrepresentationwhichoptimallysolvestheunsupervisedoptimizationproblembutbringsacrossnoinformation.Toavoidthetrivialsolution,weconstrainthetrainingbyintroducingaspheringlayeroralosstermthatiseithercontrastiveorreconstructive.Anyoftheseapproacheswilldothejobofpreventingtherepresentationsfromcollapsingtoconstants,andnoneofthemdisplaysstrongerperformancethantheothersinourexperiments.OurLARPrepresentationiscompetitivewithpretrainedrepresentationsforplanningandcomparesfavorablytootherreinforcementlearning(RL)methods.Ourapproachisasoundsolutionforlearningausefulrepresentationthatissuitableforplanningonlyfrominteractions.Furthermore,wefindthatourmethodhasbetterdata-efficiencyduringtrainingthanseveralreinforcementlearningmethodsfromtheliterature.However,adisadvantageofourapproachcomparedtostandardRLmethodsisthattheexecutiontimeofourmethodscalesworsewiththesizeofthestatespace,asaforwardpassiscalculatedforeachnodeduringthelatentspacesearch,potentiallyresultinginacombinatorialexplosion.Ourapproachisadaptabletochangesinthetasks.Forexample,oursearchwouldonlybeslightlyhinderedifsomeobstacleswereplacedintheenvironmentorsomestateswereforbiddentotraversethrough.Furthermore,ourmethodisindependentofspecificrewardsordiscountrates,whilestandardRLmethodsareusuallyrestrictedintheiroptimizationproblems.Often,thereisachoicebetweenoptimizingdiscountedorundiscountedexpectedreturns.Simulation/rollout-basedplanningmethodsarenotrestrictedinthatsense:Ifrewardtrajectoriescanbepredicted,onecanoptimizearbitraryfunctionsoftheseandregularizebehavior.Forexample,arisk-averseportfoliomanagercanprioritizesmoothrewardtrajectoriesovervolatileones.Futurelinesofworkshouldinvestigatefurthertheeffectofthedifferentconstraintsontheend-to-endlearningofrepresentationssuitedforapredictiveforwardmodel,aswellasconsideringnovelones.Thesearchalgorithmcanbeimprovedandmadefaster,especiallyforhigher-dimensionalactionspacesorcontinuousones.OurnetworkcouldalsoinprinciplebeusedtotrainanRLsystem,forinstance,byencouragingittoproducesimilaroutputsasoursandtherebycombiningdata-efficiencywithfastperformanceduringinferencetime.

Chapter 5 Rewardpredictionforrepresentationlearningandrewardshaping

ThepreviouschapterintroducedtheLARPnetworkandshoweditsusefulnessinview-pointmatchingexperiments,incomparisontoreinforcementlearning(RL)methods,particularlythemodel-basedones.Thenetworklearnsstatefeaturesthataretailor-madeforapredictionmodule,whichisthenusedforgraph-basedsearch.Thelearningofthestaterepresentationandpredictorisdoneinaself-supervisedmanner,independentlyofarewardsignal.Themethodisdata-efficientandisabletospeedtolearningfornewtasksafterpre-trainingonasimilarone.However,themethodrequiresaforwardpassofthenetworkforeverynodethatisconsideredduringthegoalsearch,whichscalespoorlywiththesizeofthestatespacecomparedtomethodsthatonlyneedtoprocessthecurrentstate.Dependingonthebranchingfactoroftheenvironmentandthedetailsofthelatentrepresentationgraphsearch,thiscanresultinacombinatorialexplosion.Inthischapter¹¹1Thischapterisadaptedfrom(hlynsson2021reward).,weapproachtheproblemoflearningstaterepresentationsforRLfromanotherangle:insteadofpredictingtheresultsofactions,welearnarepresentationthatispredictiveofthelocaldistancefromthegoalinsingle-goalenvironments.Therepresentationislearnedalongsidearewardpredictorthatlearnstoestimateeitheraraworasmoothedversionofthetruerewardsignal.Weaugmentthetrainingofout-of-the-boxRLagentsbyshapingtherewardusingourrewardpredictorduringpolicylearning.Usingourrepresentationforpreprocessinghigh-dimensionalobservations,aswellasusingthepredictorforrewardshaping,isshowntosignificantlyenhanceActorCriticusingKronecker-factoredTrustRegionandProximalPolicyOptimizationinsingle-goalenvironmentswithvisualinputs.Theremainderofthechapterhasthefollowingstructure:WestartwithachapterintroductioninSection5.1,followedbyanexplanationofrequiredbackgroundknowledgeforthischapterinSection5.2.WegointorelatedworkinSection5.3,afterwhichthedetailsofourapproachisoutlinedinSection5.4.ThemethodologyofourexperimentsisexplainedinSection5.5.WedisplayanddiscusstheresultsofourexperimentsinSection5.6.Finally,weclosethechapterwithconcludingremarksinSection5.7.

5.1 Introduction

EventhoughthedominanceofhumansisbeingtestedbyRLagentsonnumerousfronts,therearestillgreatdifficultiesforthefieldtoovercome.Forinstance,thedatathatisrequiredforalgorithmstoreachhumanperformanceisonafarlargerscalethanthatneededbyhumans.Furthermore,thegeneralintelligenceofhumansremainsunchallenged.EventhoughanRLagenthasreachedsuperhumanperformanceinonefield,itsperformanceisusuallypoorwhenitistestedinnewareas.Thestudyofmethodstoovercometheproblemofdata-efficiencyandtransferabilityofRLagentsinenvironmentswheretheagentmustreachasinglegoalisthefocalpointofthiswork.Weconsiderasimplewayoflearningastaterepresentationbypredictingeitheraraworasmoothedversionofasparserewardyieldedbyanenvironment.Thetwoobjectives,learningastaterepresentationandpredictingthereward,aredirectlyconnectedaswetrainadeepneuralnetworkfortheprediction,andthehiddenlayersofthisnetworklearnareward-predictivestaterepresentation.Therewardsignaliscreatedbycollectingdatafromarelativelylownumberofinitialepisodesusingacontrollerthatactsrandomly.Therepresentationisthenextractedfromanintermediatelayerofthepredictionmodelandre-usedasgeneralpreprocessingforRLagents,toreducethedimensionalityofvisualinputs.Theagentprocessesinputscorrespondingtoitscurrentstateaswellasthedesiredendstate,whichisanalogoustomentallyvisualizingagoalbeforeattemptingtoreachit.Thisgeneralapproachofrelyingonstaterepresentations,thatarelearnedtopredicttherewardratherthanmaximizingit,hasbeenmotivatedintheliterature(lehnert2020reward)andweshowthatourrepresentationiswell-suitedforsingle-goalenvironments.Ourworkaddstotherecentlygrowingbodyofknowledgerelatedtodeepunsupervised(hlynsson2019learning)orself-supervised(schuler2018gradient)representationlearning.WealsoinvestigatetheeffectivenessofaugmentingtherewardforRLagents,whentherewardissparse,withanovelproblem-agnosticrewardshapingtechnique.Therewardpredictor,whichisusedtotrainourrepresentation,isnotonlyusedasapartofanauxiliarylossfunctiontolearnarepresentation,butitisalsousedduringtrainingtheRLsystemtoencouragetheagenttomoveclosertoagoallocation.SimilartoadvantagefunctionsintheRLliterature(schulman2015high),giventhetrainedrewardpredictor,theagentreceivesanadditionalrewardsignalifitmovesfromstateswithalowpredictedrewardtostateswithahigherpredictedreward.Wefindthisrewardaugmentationtobebeneficialforourtestenvironmentwiththelargeststate-space.

5.2 Background

5.2.1 Rewardshaping

Sparserewardsinenvironmentsisacommonproblemforreinforcementlearningagents.Theagent’sobjectiveistolearnhowtoassociateitsinputswithactionsthatleadtohighrewards,whichcanbealengthyprocessiftheagentonlyrarelyexperiencespositiveornegativerewards.Rewardshaping(mataric1994reward; ng1999policy; brys2015policy)isapopularmethodofmodifyingtherewardfunctionofanMDPtospeeduplearning.Itisusefulforenvironmentswithsparserewardstoaugmentthetrainingoftheagent,butskillfulapplicationsofrewardshapingcaninprincipleaidtheoptimizationforanyenvironment–althoughtheefficacyoftherewardshapingishighlydependentonthedetailsoftheimplementation(clark2016faulty).Inthelastfewyears,rewardshapinghasbeenshowntobeusefulforcomplexvideogameenvironments,suchasreal-timestrategygames(efthymiadis2013using)andplatformers(brys2014multi)andithasalsobeencombinedwithdeepneuralnetworkstoimproveagentsinfirst-personshootergames(lample2017playing).Asanillustration,considerlearningapolicyforcarracing.Ifthegoalistotrainanagenttodriveoptimally,thensupplyingitwithapositiverewardforreachingthefinishlinefirstisintheorysufficient.However,ifitispunishedforactionsthatareneverbeneficial,forinstancecrashingintowalls,itprioritizeslearningtoavoidsuchsituations,allowingittoexploremorepromisingpartsofthestatespace.Furthermore,justreachingthegoalisinsufficientifthereiscompetition.Tomakesurethatwehaveawinningracer,asmallnegativerewardcanbeintroducedateverytimesteptourgetheagenttoreachthefinishlinequickly.Notethatthedetailsoftherewardshapinginthisexamplerequiresdomainknowledgefromadesignerwhoisfamiliarwiththeenvironment.Itwouldbemoregenerallyusefuliftherewardshapingwouldbeautonomouslylearned,justasthepolicyoftheagent,asweproposetodointhiswork.

5.2.2 Reward-predictivevs.reward-maximizingrepresentations

lehnert2020rewardmakethedistinctionbetweenreward-maximizingrepresentationsarereward-predictiverepresentations.Theyarguehowreward-maximizingrepresentationscantransferpoorlytonewenvironments,whilereward-predictiverepresentationsgeneralizesuccessfully.TakethesimplegridworldnavigationenvironmentsinFig.5.1,forexample.Theagentstartsatarandomtileinthegridandgetsarewardof+1byreachingtherightmostcolumninEnvironmentAorbyreachingthemiddlecolumninEnvironmentB.ThestatespaceinEnvironmentAcanbecompressedfromthe $3 \times 3$ gridtoavectoroflength3, $[ϕ_{1}^{p}, ϕ_{2}^{p}, ϕ_{3}^{p}]$ ofreward-predictiverepresentations.Topredictthediscountedreward,itsufficestodescribetheagent’sstatewith $ϕ_{j}^{p}$ ifitisinthe $j$ throw.

Fig 5.1: Reward-maximizingvs.reward-predictiverepresentations.Inthisgridworldexample,theagentstartstheepisodeatarandomlocationandcanmoveup,down,left,orright.Theepisodeendswitharewardof1andterminateswhentheagentreachestherightmostcolumn.Boththereward-predictiverepresentationandreward-maximizingrepresentation $ϕ^{p}$ and $ϕ^{m}$ ,respectively,areusefulforlearningtheoptimalpolicyinEnvironmentA.Thereward-predictiverepresentation $ϕ^{p}$ collapseseachcolumnintoasinglestatetopredictthediscountedfuturereward.Thereward-maximizingrepresentation $ϕ^{m}$ makesnosuchdistinction,asmovingrightistheoptimalactioninanystate.ItisadifferentstoryiftherepresentationsaretransferredtoEnvironmentB,wherereachingthemiddlecolumnisnowthegoal.Therepresentation $ϕ^{p}$ canbereused,andtheoptimalpolicyisfoundiftheagentnowtakesastepleftin $ϕ_{3}^{p}$ .However,therepresentation $ϕ^{m}$ isunabletodiscriminatebetweenthedifferentstatesandisuselessfordeterminingtheoptimalpolicy.

Thereward-maximizingrepresentationforEnvironmentAismuchsimpler:thewholestatespacecanbecollapsedtoasingleelement $ϕ^{m}$ ,withtheoptimalpolicyofalwaysmovingtotheright.Iftheserepresentationsarekept,thenthereward-predictiverepresentation $ϕ^{p}$ isinformativeenoughforaRLagenttolearnhowtosolveEnvironmentB.Thereward-maximizingrepresentation $ϕ^{m}$ hasdiscardedtoomanydetailsoftheenvironmenttobeusefulforsolvingthisnewenvironment.

5.2.3 Successorfeatures

Thesuccessorrepresentationalgorithmlearnstwofunctions:theexpectedreward $R_{π}^{SF}$ receivedaftertransitioningintoastate $s$ ,aswellasthematrix $M_{π}^{SF}$ ofdiscountedexpectedfutureoccupancyofeachstate,assumingthattheagentstartsinagivenstateandfollowsaparticularpolicy $π$ .Knowingthequantities $R_{π}^{SF}$ and $M_{π}^{SF}$ allowsustorewritethevaluefunction:

V_{π} (s) = E_{s^{'}} [R_{π}^{SF} (s) M_{π}^{SF} (s, s^{'})]

(5.1)

Themotivationforthisalgorithmisthatitcombinesthespeedofmodel-freemethods,byenablingfastcomputationsofthevaluefunction,withtheflexibilityofmodel-basedmethodsforenvironmentswithchangingrewardcontingencies.Thismethodismadeforsmall,discreteenvironments,butithasbeengeneralizedforcontinuousenvironmentswithso-calledfeature-basedsuccessorrepresentations,orsuccessorfeatures(SFs)(barreto2016successor).TheSFalgorithmsimilarlycalculatesthediscountedexpectedrepresentationoffuturestates,giventheagenttakestheaction $a$ inthestate $s$ andfollowsapolicy $π$ :

ψ_{π} (s, a) = E_{π} [\infty \sum t = 0 γ^{t - 1} ϕ_{t + 1} | s_{t} = s, a_{t} = a]

(5.2)

where $ϕ$ issomestaterepresentation.BoththeSF $ψ$ andtherepresentation $ϕ$ canbedeepneuralnetworks.

5.3 Relatedwork

5.3.1 Reward-predictiverepresentations

lehnert2020rewardcomparesuccessorfeatures(SFs)toanonparametricBayesianpredictorthatistrainedtolearntransitionandrewardtablesfortheenvironment,eitherwithareward-maximizingorareward-predictivelossfunction.lehnert2020successorproveunderwhatconditionssuccessorfeatures(SFs)areeitherreward-predictiveorreward-maximizing(seedistinctioninSection5.2.2).TheyalsoshowthatSFsworksuccessfullyfortransferlearningbetweenenvironmentswithchangingrewardfunctionsandunchangedtransitionfunctions,buttheygeneralizepoorlybetweenenvironmentswherethetransitionfunctionchanges.Ourworkisdistinctfromthereward-predictivemethodsthattheycompare,asourrepresentationdoesnotneedtocalculateexpectedfuturestateoccupancy,asisthecaseforSFs.Ourmethodscalesbetterformorecomplicatedstate-spacesbecausewedonottabulatethestates,astheydowiththeirBayesianmodel,butlearnarbitrarycontinuousfeaturesofhigh-dimensionalinputdata.Inadditiontothat,learningourrewardpredictorisnotonlya"surrogate"objectivefunction,asweuseitforrewardshapingaswell.

5.3.2 Rewardshaping

Theadvantagesofrewardshapingarewellunderstoodintheliterature(mataric1994reward).ArecenttrendinRLresearchisthestudyofmethodsthatcanlearntherewardshapingfunctionautomatically,withouttheneedof(oftenfaulty)humanintervention.marashi2012automaticassumethattheenvironmentcanbeexpressedasagraphandthatthisgraphformulationisknown.Underthesestrongassumptions,theyperformgraphanalysistoextractarewardshapingfunction.Morerecently,zou2019rewardhaveproposedameta-learningalgorithmforpotential-basedautomaticrewardshaping.Ourapproachisdifferentfrompreviousworkasweassumenoknowledgeabouttheenvironmentandtrainasimplepredictortoapproximate(potentiallysmoothed)rewards,whichisthenusedtoconstructapotential-basedrewardshapingfunction.

5.3.3 Goal-conditionedreinforcementlearning

kaelbling1993learningstudiedenvironmentswithmultiplegoalsandsmallstate-spaces.Intheirproblemsetting,theagentmustreachaknownbutdynamicallychanginggoalinthefewestnumberofmoves.Theobservationspaceisofalowenoughdimensionfordynamicprogrammingtobesatisfactoryintheircase.schaul2015universalintroducetheUniversalValueFunctionApproximatorsandtackleenvironmentsoflargerdimensionsbylearningavaluefunctionneuralnetworkapproximatorthatacceptsboththecurrentstateandagoalstateastheinputs.Inasimilarvein,pathak2018zerolearnapolicythatisgivenacurrentstateandagoalstateandoutputsanactionthatbridgesthegapbetweenthem.hlynsson2020latentlearnapredictablerepresentationthatispairedwitharepresentationpredictorandcombineitwithgraphsearchtofindagivengoallocation.Incontrasttotheseapproaches,welearnareward-predictiverepresentationinaself-supervisedmanner,whichisusedtopreprocessrawinputsforRLpolicies.

5.4 Approach

Inthissection,weexplainourapproachmathematically.Intuitively,wetrainadeepneuralnetworktopredicteitheraraworasmoothedrewardsignalfromasingle-goalenvironment.Theoutputofanintermediatelayerinthenetworkisthenextractedastherepresentation–forexample,bysimplyremovingthetoplayersofthenetwork.Thefullrewardpredictornetworkisusedforrewardshapingbyrewardingtheagentformovingfromlowerpredictedvaluestowardhigherpredictedvaluesofthenetwork.

5.4.1 Learningtherepresentation

Supposethat $f_{θ} : R^{c} \to [0, 1]$ isadifferentiablefunctionparameterizedby $θ$ and $c$ isapositiveinteger.Weuse $f_{θ}$ toapproximatethediscountedreturninaPOMDPwithasparsereward:theagentreceivesarewardof0foreachtimestepexceptwhenitreachesagoallocation,atwhichpointitreceivesapositiverewardandtheepisodeterminates.Givenanexperiencebuffer $D =$ ${(s_{t}, a_{t}, r_{t}, s_{t + 1})_{i}}$ ,wecreateanewdataset $D^{*} = {{(s_{t}, a_{t}, r_{t}^{*}, s_{t + 1})}_{i}}$ .Thenewrewardsarecalculatedaccordingtotheequation

r_{t}^{*} = γ^{m} r_{t + m}

(5.3)

where $γ \in [0, 1]$ isadiscountfactorand $M > m > 0$ isthedifferencebetween $t$ andthetimestepindexofthefinaltransitioninthatepisode,forsomemaximumtimehorizon $M$ .Throughoutourexperiments,wekeepthevalueofthediscountfactorequalto $0.99$ andwetrainon $D$ or $D^{*}$ .Assumethatourdifferentiablerepresentationfunction $ϕ : R^{d} \to R^{c}$ isparameterizedby $θ^{'}$ andmapsthe $d$ -dimensionalrawobservationofthePOMDPtothe $c$ -dimensionalfeaturevector.Wetraintherepresentationforthediscounted-rewardpredictionbyminimizingthelossfunction

L (f_{θ} [ϕ_{θ^{'}} (s_{t + 1})], r_{t}^{*}) = {(r_{t}^{*} - f_{θ} [ϕ_{θ^{'}} (s_{t + 1})])}_{θ}^{2}

(5.4)

withrespecttotheparameters $θ$ of $f$ andtheparameters $θ^{'}$ of $ϕ$ overthewholedataset $D^{*}$ .SeeFig.5.2foraconceptualoverviewofourrepresentationlearning.

Fig 5.2: Learningandusingtherepresentation.Ourrepresentationandrewardpredictoristrainedwiththeelementshighlightedinblue.ThetrainedrepresentationisthenusedfordimensionalityreductionforanRLagent,thatinteractswiththeenvironment,asindicatedbytheelementshighlightedinred.

5.4.2 Rewardshaping

ng1999policydefinearewardshapingfunction $F$ aspotential-basedifthereexistsafunction $f : S \to R$ suchthatforallstates $s, s^{'} \in S$ thefollowingequationholds:

F (s, a, s^{'}) = γ f (s^{'}) - f (s)

(5.5)

and $γ$ istheMDP’sdiscountfactor.Theyproveforsingle-goalenvironmentsthateveryoptimalpolicyfortheMDP $M = (S, A, P, R, P (s_{0}), γ)$ isalsooptimalforitsrewardshapedcounterpart $M^{'} = (S, A, P, R + F, P (s_{0}), γ)$ ,andviceversa.Theyalsoshow,foragivenstatespace $S$ andactionspace $A$ ,thatif $F$ isnotpotential-based,thenthereexistatransitionfunction $P$ andarewardfunction $R$ suchthatnooptimalpolicyin $M^{'}$ isoptimalin $M$ .Decidingthattherewardshapingfunctionshouldbepotential-basedisjustthefirststepinitsdesign.Nowassumethatwehaveanenvironmentwhereanagentistaskedwithreachingagoalstate $g$ .Thatis,foragivendistancefunction $d : S \times S \to R^{+}$ theagentreceivesarewardof1ifitiscloseenoughtothegoallocation, $d (s, g) \leq δ$ ,forsomerewardthreshold $δ \in R^{+}$ .Otherwise,itreceivesarewardof0.Thedistance $d$ betweentheagent’slocationnew $s^{'}$ andthegoallocation $g$ canbeausefulvaluetocalculateinthedesignofarewardshapingfunction

F (s, a, s^{'}) = {\begin{matrix} 1 & if d (s^{'}, g) \leq δ - d (s^{'}, g) & otherwise \end{matrix}

(5.6)

However,thisdependsontheenvironment,astheagentcouldgetstuckinlocaloptimabeforecomingclosetothegoal,i.e.ifitwouldhavetomovethrougharegionwithalarge $d (s, g)$ beforeitcangloballyminimizeit.Thiscouldforexamplebethecaseinamazeenvironmentif $d$ istheEuclideandistancebetweenthe $(x, y)$ coordinatesoftheagent’slocationandthegoallocationandthereisawallbetweentheagentandthegoal.trott2019keepingproposetosolvethisbyincorporatingthepotentiallocaloptimaintherewardshapingfunctionasso-called"anti-goals" $¯ g$ tobeavoided

F (s, a, s^{'}) = {\begin{matrix} 1 & if d (s^{'}, g) \leq δ min [0, - d (s^{'}, g) + d (s, ¯ g)] & otherwise \end{matrix}

(5.7)

Thesestatescanbehand-pickedbydomainexperts.However,addinganti-goalslikethiscoulditerativelyintroduceevenmorelocaloptimaandasolutiontotheoriginalproblemisnotguaranteed.Itisgenerallynottruethatthedistancefunction $d$ andallthevariablesneededtocalculateit,suchasthecoordinatesoftheagentandthegoalinamaze,areavailabletotheagent.Evenif $d$ werecomputable,usingitnaivelycanbringaboutitsownproblems,aswasalludedtoabove.Wearguethatinsteadofusing $d$ inEquation 5.6,itwouldbebettertomeasurethedistancebetweentheagentandthegoalintermsofhowmanyactionstheagenthastotakeuntilthegoalisreached.Thisfunctionisnotassumedtobegiven,butitcanbeestimatedastheagentisbeingtrainedontheenvironment,forinstancebyoptimizingEquation 5.4.Additionally,wewouldlikeourrewardshapingfunctiontobepotential-based(Equation5.5)toreapthetheoreticaladvantages.Thus,weproposeapotential-basedrewardshapingfunctionbasedonthediscountedrewardpredictor

\begin{matrix} F (s, a, s^{'}) & = (γ f_{θ} (ϕ_{θ^{'}} [s^{'}]) - f_{θ} (ϕ_{θ^{'}} [s])) (H - I) / H = γ (f_{θ} (ϕ_{θ^{'}} [s^{'}]) (H - I) / H) - f_{θ} (ϕ_{θ^{'}} [s]) (H - I) / H = γ f^{*} (s^{'}) - f^{*} (s) \end{matrix}

(5.8)

where $f^{*} = f_{θ} (ϕ_{θ^{'}} [s^{'}]) (H - I) / H$ , $f_{θ}$ istherewardpredictorand $ϕ_{θ^{'}}$ isourrepresentationfromtheprevioussection.Notethatboth $f_{θ}$ and $ϕ_{θ^{'}}$ areassumedtobefullytrainedbeforethepolicyoftheagentistrained,forexampleusingdatagatheredbyarandompolicy,buttheycaninprinciplealsobeupdatedasthepolicyisbeinglearned.Thefactor $(H - I) / H$ scalesdowntheintensityoftherewardshapingwhere $I \in N^{+}$ isthenumberofepisodesthattheagenthasexperiencedand $H \in N^{+}$ isthemaximumnumberofepisodeswheretheagentistrainedusingrewardshaping.Thestrengthoftherewardshapingisthehighestinthebeginningtocounteractpotentiallyadverseeffectsoferrorsintherewardpredictor.Itisalsomoreimportanttoincentivizemovingtowardthegeneraldirectionofthegoalintheearlystagesoflearning,afterwhichtheun-augmentedrewardsignaloftheenvironmentisallowedto"speakforitself"andguidethelearningoftheagenttowardthegoalprecisely.

5.5 Methodologyandimplementation

5.5.1 Environment

ThemethodistestedonthreedifferentgridworldenvironmentsbasedontheMinimalisticGridworldEnvironment(MiniGrid)(gym_minigrid).Tilescanbeempty,occupiedbyawalloroccupiedbylava.ThestructureoftheenvironmentsfitnaturallyintoourPOMDPtupletemplate(Eq.2.1):

Theconstituentstatesof $S$ aredeterminedbytheagent’slocationanddirection(facingnorth,west,southoreast).SeeFig.5.2(a)forthreedifferentworldstatesinoneofourenvironments.
Theactionspace $A$ consistsofthreeactions:(1)turnleft,(2)turnrightand(3)moveforward.
Thetransitionfunctionisdeterministic.Theagentrelocatestothetileitfacesifitmovesforwardandthetileisempty,andnothinghappensifthetileisoccupiedbyawall.Theepisodeterminatesifthetileisoccupiedbylavaorthegoal.Theagentrotatesinplaceifitturnsleftorright.
Reachingthegreentilegoalgivesarewardof $1 - 0.9 \cdot \frac{# stepstaken}{# maxsteps}$ ,everyotheractiongives0points.Theenvironmentautomaticallytimesoutafter $# maxsteps = 100$ steps.
Differsbetweenthethreeenvironments(seebelow).
All $7 \times 7$ subsetoftiles,representedby $28 \times 28 \times 3$ arrays,fromthepointofviewofanagentwhocannotseethroughwalls,seeFig.5.2(b).
Thepointofviewoftheagentfromitscurrentviewpoint(Fig.5.2(b))andagoalobservation(Fig.5.2(c)).
thediscountfactoris $0.99$ .

Weconsiderthefollowingthreeenvironments:

Two-roomenvironment

Theworldisa $8 \times 17$ gridoftiles,splitintotworooms,wherewallsareplacedatdifferentlocationstofacilitatediscriminationbetweentheroomsfromtheagent’spointofview.(Fig.5.3).Theagentisplacedbetweenthetworooms,facingarandomdirection.Thegoalisatoneofthreepossiblelocations.Thisisamodifiedversionoftheclassicalfour-roomenvironmentlayout(sutton1999between).

Lavagapenvironment

Inthisenvironment,theagentisina $4 \times 4$ roomwithacolumnoflavaeitheroneortwospacesinfrontoftheagent(Fig.5.4)withagapinarandomrow.Theagentalwaysstartsintheupperleftcornerandthegoalisalwaysinthelowerrightcorner.

Four-roomenvironment

Anexpansiontothetwo-roomenvironmentwithtwoadditionalrooms(Fig.5.5).Inthissetup,boththeagentandthegoallocationareplacedatrandomlocationswithinthe $17 \times 17$ gridworld.

5.5.2 Baselines

WecombineourrepresentationswithtwoRLalgorithmsasimplementedinStableBaselines(stable-baselines)usingthedefaulthyperparameters:

(ACKTR)ActorCriticusingKronecker-FactoredTrustRegion(wu2017scalable),whichcombinesactor-criticmethods,trust-regionoptimization,anddistributedKroneckerfactorizationtoenhancedata-efficiency.
(PPO2)AversionoftheProximalPolicyOptimizationalgorithm(schulman2017proximal).Itmodifiestheoriginalalgorithmbyusingclippedvaluefunctionsandanormalizedadvantage.

Forbothalgorithms,sixvariationsarecompared:

(DeepRL)TheRLalgorithmlearnstherepresentationfromscratchonrawimages
(SF)Theinputispreprocessedusingsuccessorfeatures
(Ours1r)Theinputispreprocessedusingourrepresentation,trainedonrawrewardpredictions
(Ours1r+Shaping)Theinputispreprocessedusingourrepresentationandtherewardisshaped,trainedonrawrewardpredictions
(Ours64r)Theinputispreprocessedusingourrepresentation,trainedonsmoothedrewardpredictions
(Ours64r+Shaping)Theinputispreprocessedusingourrepresentationandtherewardisshaped,trainedonsmoothedrewardpredictions

Carehasbeentakentoensurethateachvariationhasthesamearchitectureandthesamenumberofparameters.

5.5.3 Modelarchitectures

EverymodelisrealizedasaneuralnetworkusingKeras(chollet2015keras).Below,therepresentationandpolicynetworksareusedforourmethodandtheSFcomparison,therewardpredictionnetworkisusedonlyforourmethodandthedeepRLnetworkisusedonlyforthedeepRLcomparison,wheretheRLalgorithmalsolearnstherepresentation.Therepresentationnetworksaretwoconvolutionalnetworks(Table5.1)witha $28 \times 28 \times 3$ input,takingeithertheagent’scurrentobservationorthegoalobservation.

Layer	Filters	Filtersize	Stride	Padding	Outputshape
Inputtensor	-	-	-	-	$28 \times 28 \times 3$
Convolution	8	$3 \times 3$	3	None	$9 \times 9 \times 8$
ReLU	-	-	-	-	$4 \times 4 \times 8$
2Dmaxpooling	8	$2 \times 2$	-	None	$4 \times 4 \times 8$
Convolution	16	$3 \times 3$	2	None	$1 \times 1 \times 16$
ReLU	-	-	-	-	$1 \times 1 \times 16$
Flatten	-	-	-	-	16
Dense	-	-	-	-	16

Table 5.1: Representationnetwork.

Thefirstlayersubsamplestheinput,keepingonlyeveryothercolumnandrow.Thisisfollowedby8filtersofsize $3 \times 3$ withastrideof3.ThisisfollowedwithaReLUactivationanda $2 \times 2$ maxpoolinglayerwithastridevalueof2.Thepoolinglayer’soutputispassedtoalayerwith16convolutionalfiltersofsize $3 \times 3$ andastrideof2andaReLUactivationfunction.Theoutputisthenflattenedandpassedtoadenselayerwith16unitsandalinearactivation,definingthedimensionoftherepresentation.Nozeropaddingisappliedintheconvolutionallayersorthepoolinglayer.Thepolicynetworksarethree-layerfully-connectednetworks(Table5.2)acceptingtheconcatenatedoutputoftherepresentationnetworkfortheagent’scurrentpointofviewandthegoalobservationasaninput.Thefirsttwolayershave64unitsandaReLUactivation,andthelastlayerhas3unitsandalinearactivationfunction.Thethreeunitsrepresentthethreeactionsleft,right,andforwardinaonehotencoding.Winnertakesallisusedtodecideontheaction.

Layer	Units	Outputshape
Inputtensor	-	32
Dense	64	64
ReLU	-	64
Dense	64	3
ReLU	-	3
Dense	3	3

Table 5.2: Policynetwork.

Ourrewardpredictionnetworkisathree-layerfully-connectednetwork(Table5.3)withthesameinputasthepolicynetwork:theconcatenatedrepresentationoftheagent’scurrentviewandthegoalobservation.Thefirsttwolayershave256unitsandaReLUactivation,butthelastlayerhas1unitandalogisticactivationfunction.

Layer	Units	Outputshape
Inputtensor	-	32
Dense	256	64
ReLU	-	64
Dense	256	3
ReLU	-	3
Dense	1	3
Logistic	-	3

Table 5.3: Rewardpredictionnetwork.

ThedeepRLnetworkstackstherepresentationnetworkandthepolicynetworkontopofeachother.Therepresentationnetworkacceptstheinputandoutputsthelow-dimensionalrepresentationtothepolicynetworkthatoutputstheactionscores.

5.5.4 Trainingtherepresentationandpredictornetworks

Wecollectadatasetof $10$ thousandtransitionsbyfollowingarandompolicyinthetwo-roomenvironment.Forthisdatacollection,eachepisodehasa $50 %$ chancetohavethegoallocationinthebottomroomorontheleftsideofthetoproom(seetheleftandmiddlepicturesinFig.5.2(a)).Therewardpredictorandtherepresentationaretrainedinthismannerforallexperiments,includingthelavagapandthefour-roomenvironment.Thus,weusearepresentationandrewardpredictorthathaveneverseenlava.Fortheexperimentswithsmoothedrewards,thesparserewardassociatedwiththeobservationsinthedatasetisaugmentedbyassociatinganewrewardtothe $64$ statesleadingtoobservationswithapositivereward,accordingtoEquation5.3,withadiscountfactorof $0.99$ .Additionally,aftertherewardhasbeen(potentially)smoothedinthisway,observationsassociatedwithapositiverewardareoversampled $10$ timestobalancethedataset,regardlessofwhethertherewardhasbeenaugmentedornot.

5.6 Resultsanddiscussion

Intheexperiments,wecompareRLagentsthatlearntheirrepresentationsfromscratch(DeepRL)toagentsthatpreprocesstheirinputswithdifferentrepresentations.Wecompareourrepresentation,trainedonrawrewardpredictions–with(Ours1r)orwithoutrewardshaping(Ours1r+Shaping)–toourmethodtrainedonsmoothedrewardprediction,alsowith(Ours64r)orwithoutrewardshaping(Ours64r+Shaping).Weuse"64r"todenotethatourmethodwastrainedwithrewardshapingand"1r"todenotethatourmethodwastrainedwithoutaugmentation.Asabaseline,wecompareourrepresentationtoareward-predictiverepresentationfromtheliterature,SuccessorFeatures(SFs).

5.6.1 Two-roomenvironment

Westartbyvisualizingtheoutputsofourrewardpredictorintherooms,dependingonthegoallocation,inFig.5.6.Eachsquareindicatestheaveragepredictedrewardfortransitioningtothecorrespondingtileintheroom.Thepredictedrewardspikesinanarrowregionaroundthetwogoallocationsthatwereusedtotraintherawrewardpredictor(Fig.5.5(a)),buttheareaofstateswithhighpredictedrewardsiswideraroundthetestgoal.Thisdifferenceisduetooverfittingonthespecifictrainingpathsthatweremorefrequentlytakentowardtherespectivegoals,butthisdoesnotharmthegeneralizationcapabilitiesofthenetwork.Thepeakynessofthepredictionsdisappearswhenthepredictoristrainedonthesmoothedrewards(Fig.5.5(b)).However,higherpredictedrewardsinthecorneroftheotherroomappear.Bothscenarios,rawandsmoothedrewardprediction,showpromisefortheapplicationofrewardshapingunderourtrainingscheme,astheagentwouldbenefitfromfindingneighborhoodswithhighervaluesofpredictedrewarduntilitreachesthegoal,insteadofhavingtorelysolelyonasparserewardthatisonlygivenwhentheagentlandsexactlyonthegoalstate.

InFigure5.7,weillustratethevarianceofthemeanreward(leftside)andthevarianceoftheoptimalperformance(rightside)ofthedifferentmethods,asafunctionofthetimestepstakenfortraining.Weaverageover10runsandineachrunweperform10testrollouts,soeachpointistheaggregateof100episodesintotal.²²2Notethatthestandarddeviationofthemeanrewardofallepisodes,fromallrunsputtogether,isapproximately $26 %$ higherthanthestandarddeviationofthemeanofthemeansorthemeanofthemins.Theerrorbandsindicatetwostandarddeviations.ThismethodologyofgeneratingtheplotsalsoappliestoFig.5.9,Fig.5.10andFig.5.12.ThelearningcurvesofbothACKTRandPPO2getclosetothehighestachievablemeanrewardof $1$ thefastestusingourrepresentations.TherenosignificantbenefitfromusingsmoothedrewardshapingforACKTR,andtherawrewardshapingisinfactharmfulinthiscase.ForPPO2,theagentusingourrepresentationthatistrainedonrawrewardpredictionslearnsthefastest.RegulardeepRL,wheretherepresentationsarelearnedfromscratch,isclearlyoutperformedbythevariantsthatusereward-predictiverepresentations.WebelievethatthisisbecauseRLagentscangenerallybenefitfromtheinputbeingpreprocessed,asthecomputationaloverheadforlearningthepolicyisreduced.Thiseffectisenhancedwhenthepreprocessingisgood,whichisthecaseforourreward-predictiverepresentation:itabstractsawayunnecessaryinformationasitistrainedtooutputfeaturesthatindicatethedistancebetweentheagentandthegoal,whenthegoalisinview.Thedifferenceinaggregatedmeanrewardsvs.aggregatedminimumepisodelengthscanbeexplainedduetosystematicallydifferentbehaviors.Forexample,anagentmighthaveaweaklong-termstrategyofcheckingthedifferentrooms,givingitpooraveragemeanrewards,butastrongshort-termtacticoftakingthedirectcoursetothegoalwhenitseesit,givingitagoodaverageminimumepisodelength.

Fig 5.7: Two-roomenvironment.Intheseexperiments,thereareonlytworoomsandtheagentmustreachagoalthatisalwaysatthesamelocation.Theagentcantraversebetweentheroomsandstartseachepisodebetweenthem,facingarandomdirection.Theleftsideshowsthemeanofeveryagent’smeanrewardandtherightsideshowsthemeanofeveryagent’sminimumepisodelength.

5.6.2 Lavagapenvironment

Learningfromscratch

TheheatmapsofaveragepredictedrewardsarevisualizedinFig.5.8.Therewardpredictorwastrainedonthetwo-roomenvironment.Thetilesclosesttothegoalhavethehighestvalues,withaparticularlysmoothgradienttowardthegoalforthesmoothed-rewardpredictor,whichdemonstratesthatthereispotentialgainfromtransferringtheprediction-basedrewardshapingbetweensimilarenvironments.ThelearningperformanceofthedifferentmethodscanbeseeninFig.5.9.Thedecidedlyfastestlearningcanbeobservedwhentheactor-criticmethodiscombinedwithourrepresentation,trainedonrawrewardpredictionsandwithoutrewardshaping.RegulardeepRListhesecond-best,butwithaverylargevarianceontheperformance.OurrewardshapingvariationsandtheSFsareverycloseinperformance,albeitsignificantlyworsethantheothertwo.Thepoorperformanceofrewardshapingcanbeexplainedbythefactthatthereareveryfewstates,whichmakestherewardshapingunnecessaryinsuchasimpleenvironment.AllthemethodslookmoresimilarwhenPPO2optimizationisapplied,withrespecttothemeanrewards,butourvariantthatistrainedonsmoothrewardpredictionandusesrewardshapingreachesthehighestaverageperformanceinthelastiterations.

Fig 5.8: Predictedrewards,lavagap.Averagepredictedrewardperstateinthelavagapenvironment.

Fig 5.9: Lavagapexperiment.Allpoliciesarerandomlyinitializedandlearntosolvethelavagapenvironmentfromscratch.TherepresentationsinallmethodsexceptforDeepRLarelearnedonthetraininggoalsinthetwo-roomenvironment(seeFig.5.6).Theleftsideshowsthemeanofeveryagent’smeanreward,andtherightsideshowsthemeanofeveryagent’sminimumepisodelength.

Transferlearning

Toinvestigatehowthemethodscompareforadaptingtonewenvironments,wetrainedthepoliciesfor8000stepsonthetwo-roomenvironmentbeforelearningtosolvethelavagapenvironment,seeFig.5.10.Ourmethod,withoutrewardshaping,facilitatesthefastestlearningforACKTRinthiscase.DeepRListhemostseverelyaffectedbythischange,whichisprobablyduetothemethodlearningareward-maximizingrepresentationinoneenvironmentthatdoesnottransferwelltoanotherenvironment.EveryPPO2variationlooksbadforthisscenario,butthesmooth-rewardpredictionrepresentationwithrewardshapinghasthehighestmeanrewardandourraw-rewardpredictionrepresentationhasthelowestaverageminimumepisodelength.

Fig 5.10: Re-learningexperimentThedifferentmethodsaretrainedforeightthousandtrainingstepsonthetwo-roomenvironmentbeforebeingtrainedonthelavagapenvironment.Thecurvesshowthemeanrewardonthelavagapenvironment.Theleftsideshowsthemeanofeveryagent’smeanreward,andtherightsideshowsthemeanofeveryagent’sminimumepisodelength.

Wevisualizetrajectoriesofanagentthatistrainedonourrepresentation(Ours1r)asittraversesthelavagapsenvironment(Fig.5.11).Forinspectionofcaseswhereitfails,wechooseanagentthathasbeentrainedfor $50$ thousandtimestepsonlyandhasaround $0.75$ meanreward.

5.6.3 Four-roomenvironment

Inourfinalcomparison,weaddtwoadditionalroomstothetwo-roomenvironmentandrandomizeboththegoallocationandthestartingpositionoftheagent,withtheresultsshowninFig.5.12.Lookingattheminimumepisodelengths,fortheACKTRlearner,ourraw-rewardpredictionrepresentationwithrewardshapingperformsbestandtheonewithoutrewardshapingcomesinsecond.ThereislittlediscernibledifferencebetweentheperformanceofSFsandDeepRL,buttheybothperformsignificantlyworsethanourmethods.Thescaleofthemeanrewardisagreatdeallowerthaninthepreviousexperiments,sincetheaveragedistancebetweenthestartingtileoftheagentandthegoalismuchlargerthanintheprevioustwoenvironments.Forthisscenario,allthemethodslooksimilarlybadforthePPO2policy,exceptforourraw-rewardrepresentations,withrewardshaping,whichhasthelowestminimumepisodelength.Thebigadvantageofrewardshapinginthisenvironmentcomparedtothetwo-roomenvironmentcanbeexplainedbytheincreasedcomplexity,makingtherewardshapingmorehelpfulinguidingtheagent’ssearch.Inthepreviousexperiments,theagentandgoallocationsstartatfixedlocations,allowingtheagentstosolveitbyrotememorization.Therewardshapingfunctioncalculatedbytheraw-rewardpredictorfaressignificantlybetterinthissituation.Wehypothesizethatthisisduetothesmoothed-rewardpredictordistractingtheagentbypushingittocorners,asthevisualizationinFig.5.5(b)wouldsuggest.Therewardshapinggivenbytheraw-rewardpredictorismorediscriminative,asweseeinFig.5.5(a).Theagentreceivesapositiverewardassoonasthegoalreachesitspointofview,whichisanylocationuptosixtilesinfrontofitandnofurtherthan3tilesawayfromittotheleftortotheright.Thisallowstherewardshapingfunctiontoguidetheagentdirectlytothegoal,assumingthattheyareinthesameroomandthatthereisnowallobstructingtheagent’sfieldofvision.

Fig 5.12: Fullfour-roomenvironment.Theagentandgoalareplacedatrandomlocationsatthestartofeachepisode.Theleftsideshowsthemeanofeveryagent’smeanreward,andtherightsideshowsthemeanofeveryagent’sminimumepisodelength.

ThreesuccessfulandthreefailedtrajectoriesofanACKTRagentthathasbeentrained,usingourrepresentation(Ours1r)foramilliontimestepsarevisualizedinFig.5.13.Wecanseeundesirablebehaviorinboththesuccessfulandthefailedtrajectories,thattheagentwasteseffortre-visitingtilesithasalreadybeento.

5.7 Conclusion

Processinghigh-dimensionalinputsforreinforcementlearning(RL)agentsremainsadifficultproblem,especiallyiftheagentmustrelyonasparserewardsignaltoguideitsrepresentationlearning.Inthiswork,weputforwardamethodtohelpalleviatethisproblemwithamethodoflearningrepresentationsthatpreprocessesvisualinputsforRLmethods.Ourcontributionsare(i) areward-predictiverepresentationthatistrainedsimultaneouslywitharewardpredictorand(ii) arewardshapingtechniqueusingthistrainedpredictor.Thepredictorlearnstoapproximateeithertherawrewardsignalorasmoothedversionofit,anditisusedforrewardshapingbyencouragingtheagenttotransitiontostateswithhigherpredictedrewards.Weusedaviewofthegoalasasecondinputforthemethodsinourexperiments,butthisisinprinciplenotnecessary,asmovingtowardthegreentileasitbecomesvisibleissufficient.Removingthegoalinputmightencouragetheagentstolearnpoliciesthatscanalltheroomsfasteruntilthegoalreachesitsfieldofvision.Wehaveshowntheusefulnessofourrepresentationandourrewardshapingschemeinaseriesofgridworldexperiments,wheretheagentreceivesahigh-dimensionalobservationofitsgoalasaninputalongwithanobservationofitsimmediatesurroundings.Preprocessingtheinputusingthisrepresentationspeedsupthetrainingoftwoout-of-the-boxRLmethods,ActorCriticusingKronecker-FactoredTrustRegionandProximalPolicyOptimization,comparedtohavingthesemethodslearntherepresentationsfromscratch.Inourmostcomplicatedexperiment,combiningourrepresentationwithourrewardshapingtechniqueisshowntoperformsignificantlybetterthanthevanillaRLmethods,whichhintsatitspotentialforsuccess,especiallyinmorecomplexRLscenarios.

Chapter 6 Comparisonofourthreemethods

WecompareinthischapterthealgorithmsthatwedevelopedoverthecourseofthePhDwork:thegradient-basedICArepresentation(GrICA)fromChapter3,thelatentrepresentationpredictionnetwork(LARP)fromChapter4andthereward-predictionrepresentationfromChapter5(RewPred).Weusetherepresentationscalculatedbythesemethodstopreprocessinputsfordeepreinforcementlearning(DeepRL)agentsonfourdifferenttasks:acart-polebalancingenvironment,thetwo-roomandfour-roomgoal-findingenvironmentsfromthepreviouschapterandanobstacleavoidancetask.Theimplementationofourmethodsisthesameasdescribedintheexperimentalsectionoftheircorrespondingchapter,unlessspecifiedotherwise.InSection6.2weintroducethenewvisualcart-poleenvironmentanddiscussthenetworkarchitectures.InSection6.3wedescribeanddiscusstheresultsoftheexperiments.

6.1 Introduction

AseachmethodwasintroducedoverthecourseofthePhDwork,itwasonlycomparedtoothersimilarmethodsintheliterature.Wenowtaketheopportunitytocomparethemagainsteachotherindifferentenvironments.Thecart-poleenvironmentandtheobstacleavoidanceenvironmenthavethecommonalitythattheybothrequireonlyreactiveshort-termplanningtomaximizethereward,butthegoal-findingenvironmentsrequiremorelong-termplanningtofindandreachthegoal.EveryoneofourrepresentationsisusedforpreprocessingthevisualobservationsofeachenvironmentforRLalgorithms.TheRLagentsarethesameonesusedinthepreviouschapter:ACKTRandPPO2,withthedefaultparametersfromtheStableBaselinespackage.Aswewerenotabletosolvethecart-poleenvironmentusingACKTR,onlyPPO2isusedforthattask.WecomparethemethodfromChapter4,LARP,inthismanner,eventhoughitisnottrainedforpreprocessinginputsformodel-freeagentsbutisdevelopedtobepairedwithapredictorforpredictivestaterepresentationrollouts

6.2 Methods

6.2.1 Visualcart-pole

WestartbycomparingourmethodsasvisualpreprocessingforaPPO2policyonavariantoftheclassicalcart-polebalancingtask.Apoleisattachedtoacartthatmovesalongafrictionlesstrack.Thegoalistokeepthepoleuprightforaslongaspossible,buttheagentmustpushthecarttotheleftortotherightateverytimestep–choosingtoremaininplaceisnotanoption.Theoriginalenvironmentsuppliesfourscalarvariablestotheagent:thecartpositionandvelocityandthepoleangleandangularvelocity.Theagentreceivesapositiverewardof $+ 1$ foreachtransition.Insteadofusingthesescalars,weadaptthistasktoourparadigmbyusingavisualizationoftheenvironment,seeFig.6.1.Tobeginwith,weextracttherenderingoftheenvironmentthatismeantfordebuggingandvisualizationpurposes.Aswecanapproximatethecartpositionandvelocityandthepoleangleandangularvelocityusingtwoadjacentframes,weusethecurrentandpreviousobservations.Tosimplifytheinputfortheagent,wetakethedifferenceofthetwoframes,croptheimagearoundthecartandbinarizeit.Theonlyinformationthatislostinthisprocess,comparedtotheoriginalenvironment,isthepositionofthecart,whichisnotsoimportantforsolvingthetask.Inadditiontogivingtheagentapositiverewardof $+ 1$ foreverytimestepitkeepsthepolebalanced,wealsosupplyitwithanegativerewardof $- 1$ whenthepolefallsdown.

Fig 6.1: cart-poleprocessingpipeline.Wereplacetheoriginal4-dimensionalstatespaceoftheOpenAIgymcart-poleenvironmentwiththeoutputofitsrenderingfunction.Thepreviousframeissubtractedfromthecurrentframe,thenon-zeropixelvaluesarecentered,croppedandthefinalimageisconvertedtobinary.

Ourmethodsaretrainedon5000transitionsthatarecollectedfromarandompolicy.Eventhoughthisisnotasparse-rewardenvironment,wecalculatednewrewardsfortheRewPrednetworkaccordingtoEquation5.3withadiscountfactorof $γ = 0.9$ andamaximumhorizonof $M = 6$ .Thishastheeffectthateverytimestepisassociatedwitharewardof $+ 1$ ,exceptforthesixtransitionsleadinguptotheendoftheepisode,whichhavetherewards $- ({0.9}^{5}), - ({0.9}^{4}), - ({0.9}^{3}), - ({0.9}^{2}), - (0.9)$ and $- (1.0)$ .

6.2.2 Roomenvironments

WecomparetheperformanceofPPO2andACKTRpoliciesastheyuseourmethodsforpreprocessingonthetwo-roomandfour-roomgoaltasksfromthepreviouschapter,seeSection5.5.1.ThedatagatheringfortherepresentationsusedforpreprocessingfortheRLagentsisunchangedfromthepreviouschapter.

6.2.3 Obstacleavoidance

Inthisgridworldenvironment,theagentisplacedinaroomfilledwithcircularobjectsthatmoverandomlyineachstep(Fig.6.2).

Fig 6.2: Obstacleavoidanceenvironment.Theagentisrewardedformaximizingthedistancebetweenitselfandtheclosestcircle.Theleftfiguredisplaysthefullworldstateandtherightfiguredisplaysthecorrespondingobservationthattheagentreceives.

Theagentreceivesvisualinputsofdimension $56 \times 56 \times 3$ andmuststayasfarawayfromtheobjectsaspossible,asitreceivesarewardthatisequaltotheEuclideandistancebetweenitselfandtheclosestcircle.Theepisodeendswitharewardof $0$ after $100$ timestepsorwhentheagentcollideswithacircle,whichgivesarewardof $- 1$ .Thediscountfactorfortheenvironmentis0.9.

6.2.4 Architectures

Allrepresentationsandpolicieshavethesamearchitectureastheinthepreviouschapter,exceptthattheoutputoftherepresentationhasbeenloweredtoa16-dimensionalvectorforlowercomputationaltimes,seeTable5.1andTable5.2.TheRLmodelshavethesamearchitectureexceptthattherepresentationandpolicymodulesaretrainedend-to-end.ThepredictionmoduleoftheLARPnetworkisthesameasinChapter4,seeTable4.3.ThemutualinformationneuralestimatornetworkusedfortrainingGrICAisthesameasinChapter3,seeSection3.4.

6.3 Resultsanddiscussion

6.3.1 Visualcart-pole

Theresultsfromthevisualcart-poleexperimentcanbeseeninFig.6.3.ThedeepRLagentachievesthehighestmeanrewardbyasignificantmargin,probablybecausetheenvironmentrequirestheagenttotakequickactionsinresponsetothefast-changingenvironment.

Fig 6.3: cart-poleresults.Eachpointistheaggregateof5differentpoliciesthataretrainedfromscratchandtestedfor20episodes,each.

ThelearningcurveistheworstwhentheagentistrainedontheLARPrepresentations,whichistobeexpectedasitislearnedtobepairedwitharepresentationpredictorandgraphsearch,whichitisnotusedforthisscenario.TheRewPredandGrICAlearningcurveslooksimilarlygood,withtheRewPredrepresentationachievingaslightlyhigheraveragereward.Bothrepresentationsshouldtheoreticallybeusefulhere:learningarepresentationthatispredictiveofhowclosethepoleisfromfallingdown(RewPred)oughttoguidetheagent’sactionsawayfromstateswherethepoleisabouttofalldown,andrecoveringthethreestatisticallyindependentlatentvariablesgeneratingtheenvironment(GrICA),whichwouldbeaperfectcompressionoftheenvironment.Inthisscenario,however,theydonotbeattherepresentationlearningoftheDeepRLalgorithm.

6.3.2 Roomgoal-finding

Theresultsfortrainingonthesmooth-rewardpredictionRewPred¹¹1Denoted"Ours64r"inthepreviouschapter.andDeepRLrepresentationsarekeptintheplotsfromthepreviouschapter,andwehaveaddedtheresultsfortheGrICAandtheLARPrepresentations.

Two-roomenvironment

Theresultsforthegoal-findingtask,whentheagentstartsfacingarandomdirectionbetweentworoomsandmustlocateastaticgoal,canbeseeninFig.6.4.Wedisplayheretheversionofthereward-predictiveRewPredrepresentationfromthepreviouschapterthatistrainedonsmoothedrewards.

Fig 6.4: Gridworldcomparison:two-roomgoal-findingresults.Theagentstartsinbetweentworoomsandgetarewardfromreachingastaticgoallocationthatisoneoftworooms.Eachpointinthelearningcurveisaggregatedfrom10initializationsofpoliciesthatrun10testepisodeseach.

FortheACKTRpolicy,boththeLARPandtheGrICAlearningcurves²²2SeethepreviouschapterfortheRewPredresultsareconsiderablylowerthanthoseoftheothertwo,withtheGrICArepresentationagainlookingslightlybetteroutofthetwo,bothwhenthemeanrewardandtheminimumepisodelengthsareconsidered.ThisisprobablyalsoduetothefactthattheadvantageoftheLARPrepresentationdisappearswhenthepredictorthatitistrainedwithisdiscardedandtherepresentationisbeingusedinamodel-freesetting.ForthePPO2policy,neitherthepoliciesthattrainontheLARPnortheGrICArepresentationsshowsignsoflearningtosolvetheenvironment.

Four-roomenvironment

Inthisexperiment,nocombinationofRLagentandrepresentationshowsprogresstowardsolvingtheenvironment,exceptforRewPredpairedwithACKTR.BoththeLARPandGrICAvariantsbarelydisplaylearninginthepreviousexperiment,butallsignsofprogressdisappearwhentheenvironmentissufficientlycomplex.ThetrainingresultscanbeseeninFig.6.5.

Fig 6.5: Four-roomgoal-findingresults.Theagentstartsatarandomlocationgetarewardfromreachingadynamicgoallocationthatisinoneoffourrooms.Eachpointinthelearningcurveisaggregatedfrom10initializationsofpoliciesthatrun10testepisodeseach.

6.3.3 Obstacleavoidance

Thealgorithm’slearningcurvesfortheobstacleavoidanceexperimentcanbeseeninFig.6.6.ForbothPPO2andACKTR,weobservethatpreprocessingwithourmethodsisnotbeneficialastheenvironmentissolvedthefastestusingdeepRL,andbyasignificantmarginforACKTR.ThisshowsthatourmethodisoutperformedbyregulardeepRLforenvironmentswhereshort-term,reactivepoliciesareneeded–incontrasttothegoal-findingtasks,wheretheagentneedstoexecutealong-termplan.

Fig 6.6: Obstacleavoidanceresults.Theagentgetsarewardforkeepingitsdistancefromobjectsthatmoverandomly.Eachpointinthelearningcurveisaggregatedfrom10initializationsofpoliciesthatrun20testepisodeseach.

6.4 Conclusion

Inthischapter,weinvestigatedhowthethreemethods,thatwehavedevelopedinthisPhDwork,comparewhentheyareusedforpreprocessingvisualinputsforRLagents.TheRLagentsaretestedinfourdifferentenvironments,twoofwhichrequireshort-termdecision-makingandtheothertworequirelong-termdecision-makingforsuccess.OurrepresentationthatwasdevelopedinChapter3,GrICA,doesnotfacilitatelearningforanyoftheenvironments.ThesameholdstruefortheChapter4representation,LARP,althoughthatonewasnotdevelopedforpreprocessinginputstoRLagentsbutrathertobeusedinconjunctionwithapredictionfunctionforplanninginalatentrepresentationspace.Ourreward-predictiverepresentationfromChapter5isshowntospeeduplearningforadeepRLagentonthelong-termplanningtasks.

Chapter 7 Summaryandconclusion

Justasthecomplexityofthetasksthatdeepreinforcementlearning(RL)isbeingappliedtoincreases,sodoestheapparentnumberofproblemsthefieldhas.ThegoalofthisresearchwastodiscoverusefulrepresentationsthatcanbeusedtoalleviatetwoofmoderndeepRL’sproblems,namely,thoseoftraininginstabilityanddatainefficiency.OverthecourseofthePhDwork,wedevelopedthreerepresentationlearningmethodsandinvestigatedtheirsuitabilityforfulfillingthisgoal:

Agradient-basedICAmethodforlearningstatisticallyindependentfeatures(Chapter3).Weestimatethemutualinformationbetweenoneoutputcomponentofanencoderandalltheothersusinganeuralnetwork.Inapush-pullfashion,themutualinformationestimatorandtheencoderaretrained,resultinginasystemthatoutputsstatisticallyindependentfeaturesoftheinput.TheoutputofGrICAiscomparetotheoutputofFastICA,anestablishedICAproblem-solvingmethodintheliterature,fornoisyblindsignalseparation.
Thissystem,whichwecall"LatentRepresentationPredictor"(Chapter4),learnsatransitionmodeloftheenvironment.Weevaluateoursystembycombiningitwithgraphsearchtomanipulatetoyobjectstomatchagivenviewpoint.Ouralgorithmlearnsastaterepresentationjointlywithaone-steplookaheadpredictor.Wediscussandcomparethreedifferentconstraintsthatcanbeplacedonthesystemtopreventthesolutionfromcollapsingtoaconstantfunction.OurapproachoutperformsdeepRLinalow-dataregimeontheviewpoint-matchingtask.
Areward-predictiverepresentation,thatislearnedalongwithajointlylearnedrewardpredictor(Chapter5).Therewardpredictorisemployedforrewardshaping:Theagentisrewardedformovingfromstatesoflowpredictedrewardstostatesofhigherpredictedrewards.Themethodistestedinseveralgridworldenvironmentswheretheagentmustreachagoal.ThelearningofdeepRLagentsisspedupwhentheirinputsarepreprocessedbytheRewPredrepresentationinourexperimentsandtherewardshapingishelpfulwhentheenvironmentissufficientlycomplex.

NoteveryunsupervisedlearningtechniqueyieldsastaterepresentationthatisusefulinthecontextofRL,asseenbytheresultsacrosstheboardusingourGrICAalgorithm.However,basedonourresults,wehavefoundthatthetrainingofdeepRLmethodscanbeaugmentedbylearningappropriaterepresentationsinaself-supervisedmannerinenvironmentswheretheagentmustcarryoutlong-termplanningtoreachasinglegoalstate.AnimportantlimitationofthisworkisthehighcostofcomputationalresourcesrequiredforcarryingoutRLresearch.Toillustrate,reproducingDeepMind’s2017Gopaper(silver2017mastering)isestimatedtocost35milliondollarsusingGoogle’scloudcomputingservice(dan2018how)–abudgetcurrentlyunavailabletoPhDstudents.Reproducingtheirresultsonlyinvolvesmaximizingtheperformanceofasinglemodel,whichdoesnottakeintoaccounttheworkbehindfindingthefinalmodel.RLmethodshavemanypotentialsettingstotune,andrunninghyperparametersearchesiscostly.Forthisreason,weconcentratedonlyonusingthedefaultparametersofRLmodelimplementationsasofferedbytheStableBaselinespackage.Tomakemattersworse,theperformanceofRLtechniquescanbehighlysensitivetothehyperparametersettings,andbadluckcancausetheresearchertodismissamethodifunfortunatevaluesareinitiallychosen.Alargercomputationalbudgetallowsresearcherstotrymorecombinationsofparameterswhennewmethodsaretested,lesseningtheriskofpromisingapproachesbeingprematurelydismissedduetobadluck.ThemethodswerealsotestedforpreprocessingthevisualinputsofdeepRLalgorithmsinenvironmentswherethereisasetofstatethattheagentmustavoid,namely,preventingapolefromfallingoffacartorpreventingtheagentfromcollidingwithrandomlymovingobjects.Preprocessingtheinputsusingourmethodsisnotbeneficialinthesecases,potentiallyindicatingthatallowingdeepRLalgorithmstolearnthestaterepresentationsispreferablefortasksthatrequireimmediatereactionstoquicklychangingenvironments,particularlywhereinactionleadstoanundesirableoutcome.Althoughourgradient-basedICAwasnotfoundadvantageousinourgoalofaugmentingdeepRLtraining,itissuccessfulinrecoveringindependent,noisysourcesjustaswellasFastICA.Apromisingavenueofresearchistotakeadvantageoftheflexibilityofferedbyourmethod.Ouralgorithmcanbepairedwithanydifferentiablefunction,suchasconvolutionalneuralnetworks(convnets),totacklemoredifficultproblems,forexamplenonlinearICA.EventhoughgeneralnonlinearICAproblemsareill-posed,regularizationcanbeappliedtomakethemwell-posed.Thiscaninprinciplebedoneusingourmethodviathedesignoftheneuralnetwork.Oneaspectofrepresentationlearningthathasbecomepopularinrecentyears,butwasnotinvestigatedinthisPhDwork,istheapplicationofmemorymodulesinneuralnetworks.Theroleofmemorycaneasilybeintegratedintoourmethodsbydesigningourfunctionapproximatorsasartificialrecurrentneuralnetworks,forexamplebycombininglongshort-termmemorynetworkswithconvnets.AlthoughtheproblemoflearningrepresentationsinthecontextofRLremainsunsolved,theimplicationofourworkisthattakingadvantageoftherichunsupervisedsourceofsupervisionthatishiddeninRLenvironmentscanleadtodata-efficient,stablealgorithmsthatareresilienttochangesintheenvironment.AstraightforwardwayofextendingourworkwouldbetotakeadvantageofthefulldatathatisgivenineverytransitioninanRLenvironment:thecurrentstate $s$ ,theaction $a$ takeninthestate $s$ ,theresultingnextstate $s^{'}$ andthereward $r$ giventotheagentbytheenvironment.ItwouldbeaninterestingfuturelineofworktodothisbysimplycombiningthelossfunctionofLARP,whichtrainson $(s, a, s^{'})$ tuples,andRewPred,whichtrainson $(r, s^{'})$ tuples.Thisnewalgorithmwouldtraintherepresentationtobesimultaneouslypredictableforarepresentationpredictor,butalsotobeusefulforarewardpredictor.SeeFig.7.1foraVenndiagramthatsummarizesthesourcesofsupervisionthatourproposedextensionwoulduseinadditiontothesourcesofsupervisionthatareusedbyourmethods.

Fig 7.1: SupervisionsourceVenndiagram.AnillustrationofthecomponentsoftheRLtransitiontuples $(s, a, r, s^{'}) = (state, action, reward, nextstate)$ thatareusedbyourmethods:GrICA(Chapter3),LARP(Chapter4),RewPred(Chapter5).ThesetuplesconsistofalltheinformationinvolvedinasingletransitioninanRLenvironment.Theintersectionofeverycomponentcorrespondstothetheoreticalextensionofourwork,combiningelementsofLARPandRewPred.

Thisdissertationaddstoarapidlygrowingbodyofknowledgethatsitsattheintersectionofdeeplearning,reinforcementlearningandrepresentationlearning.Wecontributethreenovelmethodsoflearningstaterepresentationstotheliteratureandexperimentallyevaluatetheireffectivenessinthecontextofreinforcementlearning.Alongroadtoartificialintelligencethatmatchesourgeneralandefficientproblem-solvingcapabilitystillliesaheadofus.\pdfbookmark[0]Bibliographybibliography

Bibliography

Appendix A

\pdfbookmark

[1]Nomenclaturenomenclature\nomenclature[C] $S$ StateSpace\nomenclature[C] $A$ ActionSpace\nomenclature[C] $R$ RealNumbers\nomenclature[S]RLReinforcementLearning\nomenclature[S]MDPMarkovDecisionProcess\nomenclature[S]POMDPPartiallyObservableMDP\nomenclature[S]MLPMultilayerPerceptron\nomenclature[S]ANNArtificialNeuralNetwork\nomenclature[S]DLDeepLearning\nomenclature[S]PCAPrincipalComponentAnalysis\nomenclature[S]ICAIndependentComponentAnalysis\nomenclature[S]SFSuccessorFeatures\nomenclature[S]GrICAOurGradient-basedICAAlgorithm(fromChapter4)\nomenclature[S]LARPLatentRepresentationPrediction(fromchapter5)\nomenclature[S]RewPredOurReward-predictiveRepresentation(fromChapter6)\nomenclature[S]MINEMutualInformationNeuralEstimation\nomenclature[S]CAEVariationalAutoencoder\nomenclature[S]VAEConvolutionalAutoencoder\nomenclature[S]LEMLaplacianEigenmaps\nomenclature[S]Conv.Convolutional\nomenclature[S]ReLURectifiedLinearUnit\nomenclature[S]t-SNEt-distributedStochasticNeighborEmbedding\nomenclature[S]PPOProximalPolicyOptimization\nomenclature[S]DQNDeepQ-Network\nomenclature[S]ACKTRActorCriticusingKronecker-FactoredTrustRegion\nomenclature[S]MSEMeanSquaredError\nomenclature[S]MBRLModel-basedReinforcementLearning\nomenclature[S]MBRLModel-freeReinforcementLearning\nomenclature[V] $ϕ$ Representation\nomenclature[V] $η$ Learningrateparameter\nomenclature[V] $π$ Policyfunction\nomenclature[V] $π^{*}$ OptimalPolicy\nomenclature[V] $θ$ Theparametersofadifferentiablefunction\nomenclature[C] $P$ StateTransition\nomenclature[C] $P$ Probability\nomenclature[C] $R$ Rewardfunction\nomenclature[C] $V_{π}$ state-valuefunctionof $π$ \nomenclature[C] $q_{π}$ action-valuefunctionof $π$ \nomenclature[C] $D$ Dataset\nomenclature[V] $Ω$ Observationspace\nomenclature[C] $O$ Observationfunction\nomenclature[C] $t$ Timestepindex\nomenclature[C] $a$ Action\nomenclature[C] $s$ State\nomenclature[C] $o$ Observation\nomenclature[C] $r$ Reward\nomenclature[V] $γ$ Discountfactor\nomenclature[C] $L$ Lossfunction\printnomenclature

Visualprocessingincontextofreinforcementlearning

DissertationforthedegreeofDoctorofEngineeringoftheFacultyofElectricalEngineeringandInformationTechnologyattheRuhr-UniversitätBochum

Abstract

KurzfassungderDissertation

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Deepreinforcementlearning

1.2 Openproblems

1.3 Researchaim

1.4 Thesisoutline

Chapter 2 Background

2.1 Reinforcementlearning

2.1.1 PartiallyobservableMarkovdecisionprocesses

2.1.2 Model-freealgorithms

2.1.3 Model-basedalgorithms

2.2 Deeplearning

2.2.1 Theartificialneuron

2.2.2 Feedforwardneuralnetworks

2.2.3 Optimizingneuralnetworks

2.2.4 Convolutionalneuralnetworks

2.3 Representationlearning

2.3.1 Supervisedrepresentationlearning

2.3.2 Unsupervisedrepresentationlearning

Pca

t-SNE

Autoencoders

2.3.3 Self-supervisedlearning

Chapter 3 Learninggradient-basedICAbyneurallyestimatingmutualinformation

3.1 Introduction

3.2 Background

3.3 Relatedwork

3.4 Method

3.4.1 Reinforcementlearningenvironment

3.4.2 Learningtheindependentcomponents

3.5 Results

3.5.1 Recoveringnoisysignals

3.5.2 Lavafieldenvironment

3.6 Conclusion

Chapter 4 Latentrepresentationpredictionnetworks

4.1 Introduction

4.2 Relatedwork

4.2.1 Reinforcementlearning

4.2.2 Visualplanning

4.2.3 Prediction-basedrepresentationlearning

4.3 Materialsandmethods

4.3.1 Ongoodrepresentations

4.3.2 Predictornetwork

4.3.3 Avoidingtrivialsolutions

(i)Spheringtheoutput

(ii)Contrastiveloss

(iii)Reconstructiveloss

4.3.4 Trainingthepredictornetwork

4.3.5 Planningintransition-learneddomainrepresentationspace

4.3.6 NORBviewpoint-matchingexperiments

4.3.7 Modelarchitectures

Input

Representationlearnerϕarchitecture

RegularizingdecoderarchitectureD

Predictornetworkf

4.4 Results

4.4.1 Latentspacevisualization

In-sampleembedding:LaplacianEigenmaps

Out-of-sampleembedding:pretrainedVGG16representation

4.4.2 Latentspacedimensionality

4.4.3 ComparisonwithotherRLmethods

4.4.4 Modifyingtheenvironment

4.4.5 Transfertodissimilarobjects

Qualitativeresults

Quantitativeresults

4.5 Conclusion

Chapter 5 Rewardpredictionforrepresentationlearningandrewardshaping

5.1 Introduction

5.2 Background

5.2.1 Rewardshaping

5.2.2 Reward-predictivevs.reward-maximizingrepresentations

5.2.3 Successorfeatures

5.3 Relatedwork

Representationlearner $ϕ$ architecture

Regularizingdecoderarchitecture $D$

Predictornetwork $f$