\makenomenclature\pdfbookmark

[0]Abstractabstract

Visualprocessingincontextof

reinforcementlearningHlynurDavíðHlynssonAthesispresentedforthedegreeof

DoctorofEngineeringFacultyofElectricalEngineeringandInformationTechnology

Ruhr-UniversityBochum

Germany

July2021

Visualprocessingincontextofreinforcementlearning

DissertationforthedegreeofDoctorofEngineeringoftheFacultyofElectricalEngineeringandInformationTechnologyattheRuhr-UniversitätBochum

Nameoftheauthor: HlynurDavíðHlynsson
Placeofbirth: Reykjavík
Firstsupervisor: Prof.Dr.LaurenzWiskott
Ruhr-UniversitätBochum,Germany
Secondsupervisor: Prof.Dr.TobiasGlasmachers
Ruhr-UniversitätBochum,Germany
Yearofthesissubmission: 2021
Dateoftheoralexamination: March17th2022

Abstract

Althoughdeepreinforcementlearning(RL)hasrecentlyenjoyedmanysuccesses,itsmethodsarestilldatainefficient,whichmakessolvingnumerousproblemsprohibitivelyexpensiveintermsofdata.Weaimtoremedythisbytakingadvantageoftherichsupervisorysignalinunlabeleddataforlearningstaterepresentations.ThisthesisintroducesthreedifferentrepresentationlearningalgorithmsthathaveaccesstodifferentsubsetsofthedatasourcesthattraditionalRLalgorithmsuse:(i)GRICAisinspiredbyindependentcomponentanalysis(ICA)andtrainsadeepneuralnetworktooutputstatisticallyindependentfeaturesoftheinput.GrICAdoessobyminimizingthemutualinformationbetweeneachfeatureandtheotherfeatures.Additionally,GrICAonlyrequiresanunsortedcollectionofenvironmentstates.(ii)LatentRepresentationPrediction(LARP)requiresmorecontext:inadditiontorequiringastateasaninput,italsoneedsthepreviousstateandanactionthatconnectsthem.Thismethodlearnsstaterepresentationsbypredictingtherepresentationoftheenvironment’snextstategivenacurrentstateandaction.Thepredictorisusedwithagraphsearchalgorithm.(iii)RewPredlearnsastaterepresentationbytrainingadeepneuralnetworktolearnasmoothedversionoftherewardfunction.TherepresentationisusedforpreprocessinginputstodeepRL,whiletherewardpredictorisusedforrewardshaping.Thismethodneedsonlystate-rewardpairsfromtheenvironmentforlearningtherepresentation.Wediscoverthateverymethodhastheirstrengthsandweaknesses,andconcludefromourexperimentsthatincludingunsupervisedrepresentationlearninginRLproblem-solvingpipelinescanspeeduplearning.

\pdfbookmark

[0]KurzfassungderDissertationkurzfassung

KurzfassungderDissertation

ObwohltiefesVerstärkungslernen(VL)indenletztenJahrengroßeErfolgeerzielthat,sinddessenMethodenimmernochdatenineffizient,wasdieLösungvielerProblemeunerschwinglichmacht.WiruntersuchendieMöglichkeit,dieszubeheben,indemwirdasinformationsreicheÜberwachungssignalinnichtgekennzeichneteDatenfürdieDarstellungvonLernzuständennutzen.IndieserArbeitwerdendreiverschiedeneRepräsentationslernalgorithmenvorgestellt,dieZugriffaufverschiedeneTeilmengenderDatenquellenhaben,dieherkömmlicheVL-AlgorithmenzumLernenverwenden:(i)GrICAistvonderunabhängigenKomponentenanalyse(ICA)inspiriertundtrainierteintiefesneuronalesNetzwerk,umstatistischunabhängigeKomponentenderEingabeauszugeben.GrICAminimiertdiegemeinsamenInformationenvoneinzelnenMerkmalenmitdenjeweilsanderenMerkmalen.ZusätzlicherfordertGrICAlediglicheineunsortierteSammlungvonUmgebungszuständen.(ii)LatentRepresentationPrediction(LARP)erfordertmehrKontextdaten:AlsEingabebenötigtsiezusätzlichzueinemZustandauchdenentsprechendenvorherigenZustandundeineHandlung,welchedieseverbindet.DieMethodelerntZustandsdarstellungen,indemsiedieDarstellungdesnächstenZustandsderUmgebungmithilfeeinesaktuellenZustandsundeineraktuellenAktionvorhersagt.DerPrädiktorwirdzusammenmiteinemGraphensuchalgorithmusverwendet.(iii)RewPredlerntdieZustandsdarstellung,indemeintiefesneuronalesNetzwerktrainiertwirdeinegeglätteteVersionderBelohnungsfunktionzulernen.DieDarstellungwirdzurVorverarbeitungvonEingabenimtiefenVLverwendet,währendderBelohnungsprädiktoralsBelohnungsformungdient.DieseMethodebenötigteinzigStatus-Belohnungs-PaareausderUmgebung,umdieDarstellungzulernen.Wirstellenfest,dassjedeMethodeihreStärkenundSchwächenhat,undschließenausunserenExperimenten,dassdasEinbeziehenvonunbeaufsichtigtemRepräsentationslerneninVL-ProblemlösungspipelinesdasLernenbeschleunigenkann.\pdfbookmark[0]Dedicationdedication

\pdfbookmark

[0]Acknowledgementsacknowledgements

Acknowledgements

IwanttofirstthankmysupervisorProf.LaurenzWiskottforgivingmetheopportunitytoresearchthenichesofmachinelearningthatIfindinteresting.Hisinsightfulfeedbackandoutstandingenthusiasmandintuitionforthefieldprovedinvaluabletomeandothersinthisfastgrowingareaofresearch.Theadviceofmysecondsupervisor,TobiasGlasmachers,alsoprovedextremelyhelpful,especiallyonthetopicofreinforcementlearning.I’mgratefulfortheendlessloveandsupportfrommypartnerLisaSchmitz,whomadethetimeofmyPhDstudiesthebestinmylife–sofar.Myspecialthanksgoalsotoherparents,RosemarieSchmitzandGeorgSchmitz,fortheirsupportduringthistime.I’mthankfulforthehelpfulandpleasantenvironmentcreatedbytheotherPhDstudentsoftheInstitutfürNeuroinformatik(INI):MerlinSchüler,RobinSchiewer,ZahraFayyaz,EddieSeabrook,MortizLange,FrederickBaucks,JanBollenbacherandJanTekülve.IwouldalsoliketothanktwoalumnioftheINI,AlbertoEscalanteandFabianSchönfeld,forthesupporttheyhavegivenme.IalsowanttothanktheINIstaffoutsidethegroupfortheirhelpovertheyears:ArnoBerg,AngelikaWilleandKathleenSchmidt.Lastbutnotleast,IwanttothankmylovingfamilyforcreatingthecircumstancesthatgavemeroomtotraintheskillsthatIneededtopursueaPhDtobeginwith:ÓlöfIngibjörgEinarsdóttir,HlynurHöskuldsson,ÓlafurHlynssonandHöskuldurHlynsson.

Contents
List of Figures

Chapter 1 Introduction

Mankindhasbeeninterestedintheconceptofinfusinginanimateobjectswithitsintellectforthousandsofyears,withstoriesofartificiallyintelligentbeingsreachingbackthousandsofyears.ThisinterestmanifestsitselfforexampleinthestorytoldbyancientGreeksofthegreatautomataTalos,whowascraftedoutofbronzebythesmithinggodHephaestustoprotectthemythologicalqueenEuropa(rhodios2008argonautika).Automatahavebeenbuiltbycraftsmenfromdifferentculturesthroughouttheages,buttheyhavebeensimplemechanicalbeings(mccorduck1979machines).Thepossibilityofsatisfyingthehumandesiretocraftintelligentbeingshasonlyariseninthemiddleofthe19thcenturywiththefoundingofartificialintelligence(AI)asanacademicdiscipline.ThevastscopeofAIhasgivenrisetodifferentfields,eachwithitsownapplicationdomains,methodologiesandphilosophies.Onesuchexampleismachinelearning(ML),anareaofAIthatisconcernedwithalgorithmsthatleveragedatafordecision-making.Thisfieldhasexperiencedgreatsuccessinbothacademiaandindustryinthelastdecadewithincreasedaccesstopowerfulcomputersandlargedatabasesinadditiontoadelugeofadvancedcomputationaltechniquesandflexiblesoftwaresolutions(clark20152015).Thefieldisnotwithoutitsdrawbacks,however.Machinelearningalgorithmsneeddatacorrespondingtomonthsoryearsofhumanexperiencetogetcompetenceintasksthatapersoncanmasterinminutes.Peoplehavetheadvantagethattheycomeequippedwithastrongerunderstandingoftheworld.Inthisdissertation,weaimtoleveltheplayingfieldforMLmodelsthatperformsequentialdecision-makingbyexploringdifferentwaysforthemto"understand"theworldthroughrepresentations.InSection1.1,wediscussoneofthemostpromisingfieldsofartificialintelligence,deepreinforcementlearning,andmentionitsvictories.ThecurrentsituationisevaluatedinSection1.2wherewedescribethechallengesanddisadvantagesofthefield.ThevalueoftheresearchdirectionweputforwardisunderlinedinSection1.3aswellasourresearchobjectiveandhypothesis.ThechapterconcludeswithSection1.4whereweoutlinetheorderofcontentinthedissertation.

1.1 Deepreinforcementlearning

Awatershedmomentforartificialintelligencehappenedwhenkrizhevsky2012imagenetcombinedseveraltechniquesfromtheliteratureandconstructedadeep111Neuralnetworksthatprocesstheinputhierarchicallyusingatleastmorethantwolayersofcomputationallayersarecalleddeepneuralnetworktooutperformthecompetitionbyasignificantmargininanimageclassificationcontest.Thiswasthecatalystoftheso-calleddeeplearningrevolution(sejnowski2018deep)whichhasimpactedfieldssuchasnaturallanguageprocessing(wolf2020transformers),bioinformatics(li2019deep),computervision(khan2018guide),frauddetectionandmanyothers(alom2019state).Theareaofmachinelearningconcernedwiththetrainingofdeepneuralnetworksiscalleddeeplearning.Deeplearningmethodshavetheadvantagethattheyautonomouslylearnpatternsinthedatainahierarchicalmanner.Forexample,edgesareusefulpatternsforpicturesofshapessuchassquares,trianglesandcircles(patrick2010ai).Theedgescanbecombinedtocornersandthenumberofcornerscanbecountedfordistinguishingbetweenthedifferentshapes.Sincedeeplearningencompassesabroadsetofmachinelearningalgorithms,itcanbereadilycombinedwithotherareasofmachinelearning.Onesuchareaisreinforcementlearning,wheregeneralgoal-directeddecision-makingproblemsarestudied.Reinforcementlearning(RL)methodstrainmodelsthatareinteractingsequentiallywiththeirenvironmentstomaximizearewardsignal.DeeplearningisfrequentlycombinedwithRLtechniques,allowingthemodelstomaptheinputs,suchashigh-dimensionalimagedata,directlytoactions.Thiscombinationhasyieldedpromisingresultsindifferentareas,rangingfromrecommendersystems(zhang2019deep)overautonomousdriving(kiran2021deep)toplayinggames(silver2017mastering).

1.2 Openproblems

Aknownproblemofdeepneuralnetworksisthattheyrequirealargeamountofdataforadequateperformance.Thisdatacanbeprohibitivelyexpensivetoobtain–eitherintermsoftimeneededtocreatedataandtrainthemodelsforreinforcementlearningmethodsormonetarycostofacquiringhuman-labeledtrainingdataforsupervisedlearningmethods–whichhasencouragedthedevelopmentofmethodsthatlearnrepresentationsfromstreamsofmorereadilyavailable,unlabeleddata.Thisproblemincreasesinseverityindeepreinforcementlearning(DRL).Intheuncompromisinglytitledblogpost,Deepreinforcementlearningdoesn’tworkyet,irpan2018deepidentifiesseveralfundamentalproblemsofDRL.Oneofthemistheproblemofsampleinefficiency,wheremanyhighlypublicizedstate-of-the-artresultsonvideogamesrequirehundredsofmillionsofframesofexperiencetoachieveperformancethathumansreachinamatterofminutes.Anotherproblemistheoneofinstability.Deepneuralnetworksarehighlyexpressiveandoptimizelargenumbersofparameters.ThismakesthedesignofDRLmodelsdifficult,asthesearchofhyperparameters222Ahyperparameterisbroadlyspeakinganydesignchoicemadebytheprogrammerbeforethelearningofthe"regular"parametersstarts.thatsolvetheproblemcanbequitetime-consuming.Evenwhenapromisingsetofhyperparametersisfound,thedifferencebetweentheperformanceofdifferentmodelslearnedfromscratchcanbesignificant,dependingontherandomseed.ThisincreasedvariancecomesfromthenewsourceofrandomnessthatisintroducedtoRLmodels,comparedtoregularregressionlearning:theagentsactionsarestochastic,increasinglysointhebeginningoflearning333Thereisatradeoffbetweenexploringtheenvironmentandexploitingtheexpectedrewardsignal.AcommonstrategyforRLagentsistostartthelearningwithahighchanceofperformingrandomactionstoexploredifferentstatesoftheenvironmentandthendecreasethischanceasthelearningprogresses..

1.3 Researchaim

Inthisdissertation,weproposemethodsforunsupervisedandself-supervisedlearningofrepresentationsforgoal-directedbehavior.Self-supervisedlearningmethodsuseasubsetoftheinputtopredicttherestofit,foregoingtheneedofannotationswhiletakingadvantageofthepowerfulmachineryofsupervisedlearningmethods.Tacklingtheopenproblemsofdatainefficiencyandinstabilityoutlinedaboveinordertofurtherthefieldisourintentionwiththisthesis.Wedosobydevelopingandinvestigatingthreedifferentapproaches:(i)unsupervisedlearningofarepresentationforRLagents,(ii)amethodofjointlylearningapredictorforplanningarepresentationthatisgoodforthetransitionprediction,and(iii)learningarepresentationforRLagentsasthebyproductofrewardprediction.WerelatethedataneededtolearntherepresentationsforourmethodstotheavailabledatainthecontextofRLinTable1.1.Ourhypothesisisthatsuitablestaterepresentationsthatreducethecomplexityofhigh-dimensionalinputsinRLsettingscansupportamorestableanddataefficientlearningthanhavingdeepRLalgorithmslearnstaterepresentationsfromscratch.

Subsetofrequiredforlearning Method Chapter
GrICA 3
LARP 4
RewPred 5
Table 1.1: Inputdatatypepermethod.IfanRLagentisinastateandperformstheaction,itwillreceivearewardofandtransitiontothestate.Themethodsproposedinthisthesislearnstaterepresentationsbyprocessingdifferentsubsetsofthedatatuples.

1.4 Thesisoutline

Hereweoutlinethestructureofthethesis.Threeofthechaptersareadaptedfromtheworkthatwerepublishedoverthecourseofthedoctoralwork.\nobibliography*

  • Chapter2:Background.Inthischapter,wegoinfurtherdetailsonthemaintopicsinthisthesisanddiscusstheirfundamentals.WeintroducetheformalismofMarkovdecisionprocessesandexplainthedifferencebetweenmodel-basedandmodel-freereinforcementlearningalgorithms.Themachinerybehinddeeplearningisthenexplainedandthemainbuildingblocksofdeepneuralnetworksareillustrated.Thechapterconcludeswithadiscussionofthemainrepresentationlearningmethods.

  • Chapter3:Learninggradient-basedICAbyneurallyestimatingmutualinformation.Thischapterdiscussesanadaptionofindependentcomponentanalysis(ICA)forDL.Weintroduceanovelapplicationofaneuralmethodformutualinformationestimationtolearnarepresentationwithstatisticallyindependentfeatures.Thechapterisanadaptedversionof

    • \bibentry

      hlynsson2019learning(hlynsson2019learning)

  • Chapter4:Latentrepresentationpredictionnetworks.Thischapterdiscussesamethodformanipulableenvironmentsforjointlylearningarepresentationofobservationsandamodelforpredictingthenextrepresentation,givenanaction.Welearntherepresentationinaself-supervisedmanner,withouttheneedofarewardsignal.Weintroduceanewenvironmentthatisakintomanipulatingtoyobjectsforaviewpointmatchingtask.Therepresentationiscombinedwithagraph-searchalgorithmtofindthegoalviewpoint.Thechapterisanadaptedversionof

    • \bibentry

      hlynsson2020latent(hlynsson2020latent)

  • Chapter5:Rewardpredictionforrepresentationlearningandrewardshaping.Thischapterdiscussesaself-supervisedlearningmethodtomaphigh-dimensionalinputstoalowerdimensionalspaceforRLagents.Weintroduceatechniquewherearepresentationlearnedforarewardpredictorisusedtoshapetherewardfortheagents.Thechapterisanadaptedversionof

    • \bibentry

      hlynsson2021reward(hlynsson2021reward)

  • Chapter6:Comparisonofourmethods.Inthischapter,wedirectlycomparethethreedifferentmethodstostate-of-the-artdeepRLmethodsonfourdifferentenvironments:avisualpole-balancingenvironment,twogoal-findingenvironmentandanobstacleavoidanceenvironment

  • Chapter7:Summaryandconclusion.Thischapterclosesthedissertationwithabriefsummaryofthethesis,concludingremarksandpossiblefuturework.Thefollowingworkwasalsopublishedoverthecourseofthedoctoralstudies:

    • \bibentry

      hlynsson2019measuring(hlynsson2019measuring)

    Thepapercomparessupervisedlearningmethods,butitistoodissimilarintopicfromtherestoftheworkandisthuschosentobeomittedfromthisdissertation.

Chapter 2 Background

Thischapterlaysoutthefundamentalconceptsofmachinelearningthatformsthefocalpointoftherestofthedissertation.InSection2.1,welayoutthemainobjectofstudyinreinforcementlearning(RL),partiallyobservableMarkovdecisionprocesses,andpresentabrieftaxonomyofreinforcementlearningalgorithms.WeexplainthebasicsofartificialneuralnetworksanddeeplearninginSection2.2.Themostcommonlyusedtypeofneuralnetworkusedforprocessingvisualdata,theconvolutionalneuralnetwork,isthendescribed,alongwithsomeofitsmainbuildingblocks.InSection2.3,wediscussrepresentationlearning(alsoknownasfeaturelearning)andmotivateitinthecontextofreinforcementlearning.Foramorein-depthdiscussionofthesetopics,wereferthereadertothecomprehensivetextbookonRLbysutton2018reinforcement,thedeeplearningbookbyGoodfellow-et-al-2016andtheexcellentsurveybybengio2013representationonrepresentationlearning.

2.1 Reinforcementlearning

Inthissection,weformalizeRLfortherestofthethesis.RLisoneofthemaindisciplinesofmachinelearning,anditcovershowagentscanlearntobehaveoptimallyinanenvironmenttomaximizeacumulativereward.

2.1.1 PartiallyobservableMarkovdecisionprocesses

Apartially-observableMarkovdecisionprocess(POMDP)isageneralframeworkformodelingsequentialdecisionprocessesinenvironmentsthatcanbestochastic,complexandcontainhiddeninformation.Formally,itisatuple

(2.1)

whichwealsorefertoastheenvironment.Thetupleismadeupofthefollowingelements:

  • Thestatespacedefinesthepossibleconfigurationsoftheenvironment

  • Theactionspacedescribeshowtheagentisabletointeractwiththeenvironment

  • Thetransitionfunctiondictatestheeffectsofdifferentactionsindifferentstates

  • Therewardfunctiondeterminestheimmediaterewardgiventotheagentfortransitioningbetweenanytwostateswithanyaction

  • Theinitialstatedistribution

  • Theobservationspacedefinestheaspectsoftheenvironmentthattheagentcanperceive

  • Theobservationfunctiondefineswhat(potentiallytransformed)subsetoftheenvironmenttheagentreceivesafteractinginagivenstate

  • Therewarddiscountfactor

Theenvironmentstartsinastatedrawnfrom,fromwhichtheagentinteractssequentiallywiththeenvironmentbychoosingactionfromactionspaceattimesteps.Theagentreceivesanobservationandarewardaftereachaction.TheobjectiveofanRLagentistolearnapolicythatdeterminesthebehavioroftheagentintheenvironmentbymappingstatestoaprobabilitydistributionover,written.AdiscountfactorisusuallyincludedinthedefinitionofPOMDPs,anditcomesintoplayintheoptimizationfunctionoftheagent.Namely,thepolicyshouldmaximizetheexpecteddiscountedfuturesumofrewards,ortheexpectedreturn,wherethereturnisdefinedas

(2.2)

Thevaluefunctionisdefinedastheexpectationofthereturn(Eq.2.2),givenapolicyandaninitialstate

(2.3)

Thereisatleastoneoptimalpolicythatisbetterthanorequaltoothers:forallstatesandallotherpolicies.

2.1.2 Model-freealgorithms

Model-freereinforcementlearninglearnsthepolicyoravaluefunctiondirectlyfromexperiencewithoutattemptingtoapproximatethedynamicsoftheenvironment.Twopopularclassesofmodel-freemethodsarevalue-basedmethodsandpolicy-basedmethods.Value-basedmethodsapproximateeitherthevaluefunction(sutton1988learning)oranotherusefulfunctionthatissimilartothevaluefunction,theaction-valuefunction.Thisfunctionisdefinedastheexpectedreturnoffollowingthepolicyaftertakinganactioninastate:

(2.4)

Estimatingtheaction-valuefunctionisapivotalstepforalgorithmssuchasQ-learning(watkins1992q).Asimpleone-stepQ-learningupdatingruleis(sutton2018reinforcement):

(2.5)

whereisapositivelearningrateparameterandtheinitialvaluesofarechosenarbitrarily.Q-learningisguaranteedtoconvergetotheoptimalpolicy’saction-valuefunction,undercertainconditions111Thisdependsonagoodlearningratescheduleandexplorationtechniques,whicharedifficulttodetermineinpractice,whichinturnyieldstheoptimalpolicy:.Thismethodtabulatesthevaluesandthusworkswithdiscreteactionsandstatespaces.Q-learninghasbeencombinedwithdeepneuralnetworkstoworkforactionsandstatespacesofhigherdimensions(mnih2015human).Policy-basedmethodsdonotlearnavaluefunction,butratherlearnthepolicydirectlybyoptimizinganobjectivefunctionwithrespectto.Wedescribetwoofthosemethodsthatweemployinthiswork:(1)proximalpolicyoptimization(PPO)(schulman2017proximal)and(2)actorcriticusingKronecker-factoredtrustregion(ACKTR)(wu2017scalable).PPOoptimizestheobjectivefunction

(2.6)

whereisahyperparameter,andisanestimatoroftheadvantagefunction.Thecliptermreturnsif,otherwisethevalueisclippedtothecloserboundaryvalue.ThefullPPOalgorithmisshowninAlgorithm1.

1:  for iteration do
2:     for actor do
3:        Runpolicyinenvironmentforttimesteps
4:        Computeadvantageestimates
5:     end for
6:     Optimizewithrespectto,withkepochsandminibatchsize
7:     
8:  end for
Algorithm 1 Proximalpolicyoptimization

ACKTRappliesthepolicygradientupdates

(2.7)

where,isthelog-likelihoodoftheoutputdistributionofthepolicyandthelearningrateiscontrolleddynamicallywiththetrustregionparametertopreventthepolicyfromconvergingprematurelytoapoorpolicy.

2.1.3 Model-basedalgorithms

Model-basedreinforcementlearning(MBRL)algorithmslearntheoptimalpolicybyfirstestimatingthetransitionfunctionandtherewardfunction.Thesefunctionsareusuallycalledtheenvironmentdynamicsorworldmodelandarelearnedinasupervisedfashionfromadatasetofobservedtransitions,.Theworldmodelscanbeusedinmultipledifferentways,dependingonthealgorithm,toderivetheoptimalpolicy.Forexample,sampling-basedplanningalgorithmsuseandtosampleactionsequencesandcalculatetheirexpectedvalues:

(2.8)

TheagentfollowstheactionsequenceassociatedwiththehighestexpectedrewardinEquation2.8.Thisisoftencombinedwithmodel-predictivecontrol(MPC),whereanewactionsequenceiscalculatedaftertakingthefirstactioninthelastsequence.Therearedifferentwaysofchoosingcandidateactionsequences,withthesimplestbeingtherandomshootingalgorithm(richards2005robust),thatdrawstheactionsfromauniformdistribution.

2.2 Deeplearning

Forthelastfewyears,deeplearninghasbeenonthecenterstageofmachinelearningresearch.Wemakeextensiveuseofdeeplearninginthisworkbecauseofitsparallelizability,itsefficientscalingwithlargedatasetsanditscapabilitytoapproximatecomplexfunctions.Themostbasictypeofdeeplearningmethodisthefeedforwarddeepnetwork,whichcompriseslayersofartificialneurons.Thetheoreticalcapabilitiesofartificialneuralnetworkswereguaranteedbycybenko1989approximation:hisUniversalApproximationTheoremhastheimplicationthatanycontinuousfunctionofrealnumberswithvaluesinaEuclideanspacecanbeapproximatedbyaneuralnetworkwithonehiddenlayer.Unfortunately,itisapureexistencetheorem,leavingthetaskofconstructingthenetworktotheengineer.

2.2.1 Theartificialneuron

Anartificialneuron(rosenblatt1958perceptron)isamathematicalfunctionthatmultiplieseachinputwithaconstant,addsabiastothelinearcombinationandthenappliesanon-linearitytotheoutcome:

(2.9)

Thenon-linearityisknownastheactivationfunction,thecoefficientsareknownastheweightsandthetermisknownasthebias.Wenowbrieflydiscusssomecommonlyusedactivationfunctionsthatareemployedinthisthesis.Foramorecomprehensiveoverviewofrecenttrendsintheusageofactivationfunctions,weencouragethereadertolookatacomparisonbynwankpa2018activation.Thelogisticfunctioncanbeusedforbinaryclassification.

(2.10)

Thisfunction"squashes"theinputstoliebetweenand,givingtheoutputaprobabilisticinterpretation.Thesoftmaxfunctionisanextensionofthelogisticfunctionforseveralclasses

(2.11)

Theoutputofthesoftmaxfunctionisavectorofthesamedimensionalityastheinputvectorandsumsto.Thehyperbolictangentfunction(tanh)squashestheinputtoliebetween-1and1

(2.12)

Thishasthecomputationaladvantageoverthelogisticfunctionthatbiasesinthegradientsareavoidedand0-centereddatagivesrisetolargerderivativesduringoptimizationofthenetworks(lecun2012efficient),makingthemamorefrequentchoiceasanactivationfunctioninhiddenlayers.Themostpopularnonlinearityfordeepneuralnetworksistherectifierfunction

(2.13)

Therectifierfunctionoffersthesameadvantagesasthetanhfunctionbutatalowercost,asevaluatingexponentialsandperformingdivisionisavoided.Therectifierfunctionisalsocalledarectifiedlinearunit,anditiscommonlyabbreviatedas"ReLU".

2.2.2 Feedforwardneuralnetworks

ComputationalunitsimplementingthefunctioninEq.2.9canbearrangedhierarchically,withtheinputofaneuronconsistingoftheoutputofotherneurons.Anexamplefeedforwardneuralnetworkormultilayerperceptron(MLP)222Feedforwardneuralnetworksaresometimeslooselyreferredtoasmulti-layerperceptrons(MLPs),namedafteranearlyartificialneuronmodelcalledtheperceptron.However,perceptronsuseahardthresholdactivationfunctionwhilemodernMLPscanuseanydifferentiableactivation,sotheyareoftennotperceptrons,inthestrictmeaningoftheword.isdepictedinFigure2.1.

Fig 2.1: AFully-connectedneuralnetwork.Thisnetworkhasthreeinputsandtwolayers:onehiddenlayerwithfourunitsandanoutputlayerwithtwounits.

Thefigureshowsanetworkwithonehiddenlayer,butitcaninprinciplehaveanynumberofhiddenlayers.Thesameistrueforthenumberofinputsandoutputs.

2.2.3 Optimizingneuralnetworks

Trainingadeepneuralnetworkinvolvestrainingdataandalossfunction.Fortraininganartificialneuralnetwork,anappropriatelossfunctionhastobefoundtomatchboththetaskathandalongwiththefinallayer’sactivationfunction.Thelossfunctionmeasuresthedifferencebetweentheoutputofthenetwork,whenthedataispassedthroughit,andthedesiredoutcome.Theparametersofthenetworkarethenadjustedtowardtheoptimalthatminimizethelossfunctionoverthedata

(2.14)

whereisthenumberofdatapointsandisthepredictionofaneuralnetworkwithparameters,forsamplewiththetruevalue.Thequantityisalsoknownastheempiricalrisk.Onesuchexampleisthemean-squarederror(MSE)loss

(2.15)

MaximizingthelikelihoodofGaussiandatawithrespecttotheparametersoftheassumedmodelthatgeneratedthedataisequivalenttominimizingtheMSE,makingitapopularchoiceforregressiontasks(i.e.whentheoutputlayeractivationislinearorReLU).Forclassificationnetworkswithalogisticorsigmoidactivationoutput,asuitablelossfunctionisthecross-entropylossfunction

(2.16)

whereisthenumberofclasses333If,thenthelabelvectorcould,forexample,taketheform..SimilarlytoMSE,thislossfunctionisalsomotivatedbythefactthatminimizingthecross-entropylossisequivalenttomaximizingthelikelihoodofuniformlydistributedi.i.d.data(yao2019negative).Sofar,thelossfunctionswehaveseenrequirealabelasapartoftheinput.Thismakesthemsupervisedlearninglosses.Manycommonlyusedlossfunctionsexistthatdonotrequirelabels,thosearecalledunsupervisedlearninglosses.Oncewehavedecidedonalossfunctiontominimize,thenextstepistochoosetheoptimizationalgorithm.ThemostpopularonesareimplementedinsoftwarelibrariessuchasKeras(chollet2015keras),MXNet(chen2015mxnet),Tensorflow(abadi2016tensorflow),Pytorch(NEURIPS2019_9015)andseveralothers.Themostcommonwayoftrainingdeepneuralnetworksisbyemployingavariationofthegradientdescent(curry1944method)algorithm.Gradientdescentmethodstakeadvantageofthefactthatafunctiondecreasesthefastestinthenegativedirectionofitsgradient,convergingatalocalminimum.Analgorithmcalledbackpropagation(linnainmaa1970representation)computesthegradientofthelossfunctionwithrespecttoeachparameter(e.g.weightsandbiases)viathechainrulefromcalculus.Thesegradientsarethenusedforanupdatestepforeachparameter:

(2.17)

wherekeepstrackoftheindexoftheiteration.Theparameterisknownasthelearningrateoftheoptimizationalgorithm.TheclassicalgradientdescentmethodinEquation2.17calculatestheaveragelossovertheentiredataset.Thiscanbemadefaster,withoutlosingconvergenceguarantees,byperformingaweightupdateusingthegradientfromonlyasubsetofthetrainingdata–oratrainingbatch–ineachiteration.Thisstochasticapproximationofgradientdescentiscalledstochasticgradientdescent.Agoodlearningrateisimportantforthepracticalconvergenceofstochasticgradientdescent:ifitisverysmall,thenthetimeittakestoconvergecanbetoolong.However,ifitistoolarge,thenthereisariskofovershootingthelocalminima.Itisgenerallygoodtostartoffwithalargerlearningrateandthenmakeitsmallerwithtime.Determiningexactlywhentodecreasethesizeofthelearningrate,andbyhowmuch,canbelaboriousinpractice.Forthisreason,therehavebeenproposedseveralgradientdescentmethodsthatautomaticallyfindthislearningrateschedulewithadaptivelearningrates,forexample,rmsprop(tieleman2012lecture)andAdam(kingma2014adam).

2.2.4 Convolutionalneuralnetworks

ThenetworkinFig.2.1isafully-connectedordenseneuralnetwork,becauseeveryunitisconnectedtoeveryunitintheprecedinglayer.Thereareother,morespecialized,neuralnetworksthatarenotfully-connected,oneofthemostimportantclassbeingconvolutionalneuralnetworks.Forinputdatawithaspatialstructure,forexampleimages,convolutionalneuralnetworksareveryefficient.Incontrasttodenseneuralnetworks,eachunitinconvolutionalneuralnetworksonlyreceivesasinputasubsetoftheoutputsfromthepreviouslayer.Morespecifically,eachunitonlyreceivesinputsfromunitsthatareinspatialproximityofoneanother.WaldoTobler’sFirstLawofGeographycapturessuccinctlythemotivationbehindconvolutionalnetworks(tobler1970computer):"everythingisrelatedtoeverythingelse,butnearthingsaremorerelatedthandistantthings".Anotherkeypropertyofconvolutionalneuralnetworksistheoneofsharedweights–eachcomputationalunitinthesamelayerhasthesamesetofweights,eventhoughtheyreceivedifferentinputs.Thesepropertieshavethepracticalconsequencethatthenumberofparametersiscutdownsubstantially:eachneuronprocessingagrayscaleimagewouldrequireweights,whichisthenmultipliedagainbythenumberofneuronsinthelayerforthetotalparametercount.Ontheotherhand,aconvolutionallayerwhereeachneuronprocessesawindow(arelativelylargewindowsize)aroundapixelwouldrequireparameters–forthewholelayer,duetothesharedweights.Thus,processingtheimageinputinthisexamplewithaconvolutionallayerinsteadofadenselayerwithasingleneuronreducesthenumberofparametersbyafactorof.Inadditiontotheactivationfunction,wespecifythevalueoffourhyperparametersforconvolutionallayerswhenwedescribespecificnetworkarchitecturesinthisthesis:thenumberoffilterstoslidealongtheheightandwidthoftheinput,thesize,orthereceptivefield,ofthefilters,thestrideandhowmuchzeropaddingtouse.Thereceptivefielddictatesthesizesofthespatialdimensions(height,width)oftheinputthattheneurontakes.InFigure2.1(b),forinstance,wewouldsaythatthefiltersizeis(),despitethedimensionoftheweightsbeing()–thisisduetothesizeoftheinputdepth.Notethateventhoughtheseheightandwidthvaluesareconstrained,theneuronalwaysprocessesthefulldepthoftheinput.Thestridecontrolshowmanystepsthefilterstakeastheinputisprocessedalongitsspatialdimensions.Zeropaddingamountstoaddingrowsandcolumnsaroundtheinput,composedentirelyofzeros,alsoalongthedepth.Paddingisoftendonetomaketheoutputvalid,forinstance,ifthestrideorfiltersizewouldpotentiallycauseaneurontoprocessinputsthatare"outofbounds".Thisisoftencalledvalidpadding.Anotherpurposeofpaddingistopadtheinputwithzerostokeeptheoriginalspatialdimensionsunchanged,thisiscalledsamepadding.Forexample,zeroscanbeaddedaroundanimageofsizebeforeitisprocessedbyanetworkthatrequiresaninputof.TherelationshipbetweentheinputandoutputofaconvolutionalfilterisillustratedinFigure 2.1(a).InFigure 2.1(b)weaddadepthdimension.Inourexample,thestrideis1.However,intheexample,ifwewouldwanttoincreasethestridevaluethenwewouldhavetointroducezeropadding.

(a) Inputwith(heightwidthdepth)dimensionsof().
(b) Inputwith(heightwidthdepth)dimensionsof().
Fig 2.2: Aconvolutionallayer.Theinputisprocessedbyasingleconvolutionalfilterwithareceptivefieldof,astrideofone,nozeropaddingandtheidentityactivationfunction.Thesubsetoftheinputthatisincludedinthecalculationoftheupperrightelementoftheoutputishighlightedbythedottedboxes.

Generally,anydeepneuralnetworkiscalledaconvolutionalneuralnetworkifithasoneormoreconvolutionallayers.Thisholdstrueevenifnotallthelayersareconvolutionallayers.Somepopularlayertypesinclude:

  • Subsamplinglayersthatkeeponlyeverythrowandcolumntoreducethecomputationalcomplexity.Thisisusuallyonlydoneasafirstpreprocessingstepforveryhighdimensionalinputs,wherethrowingawaytheinformationisnotasharmfulasintheintermediatelayersofthenetwork.

  • Maxpoolinglayersslidealongthewidthandheightofeachdepthsliceintheinputandreturnthelargestsinglevalueintheirwindow.Theyreducethecomputationalcomplexitybyreducingtheinputdimension,andtheymaketherepresentationapproximatelyinvarianttosmalltranslations.

  • Flatteninglayersaretechnicallayersthatre-shapetensororarrayinputstovectors.

  • Normalizationlayersforre-centeringandre-scalinginputstolayers.Theyhelpspeedingupandstabilizingthelearningprocess.

Deepneuralnetworksoftenhaveanumberofparametersinthethousandsorbillions.Thismakestheinterpretationofthecalculationsdifficult,especiallyduetothenumberoflayers.Visualizationsofthefirstfewconvolutionallayersintrainednetworkshasbeendone(zeiler2014visualizing),withtheresultthatthefirstlayer’sfiltersusuallycaptureedges,corners,andcolorcombinations.Thesecondlayerthencombinesthesefeaturesintomorecomplicatedpatterns.Higherfiltersthencombinethesefeaturesfurtherintotextures,objectpartsorevenwholeobjects.

2.3 Representationlearning

Incomputerscienceingeneral,andmachinelearninginparticular,thechoiceoftherepresentationofthedatathatisbeingprocessediscrucial.Thiscouldmeanchoosingtherightdatastructureforthetask,suchasdesigningadatabaseforfastsearching.Thiscouldalsomeanchoosingtherightindependentvariablesforastatisticalmodel.Theextractionofusefulinformationaboutthedataisthusanimportanttask.Thisisespeciallytrueiftheinputisfromahigh-dimensionalspace,withthetermcurseofdimensionality(bellman1957dynamic)beingusedsincethelate1950sfordescribingproblemsofthisnature:theamountofdataneededtomakestatisticallysignificantclaimsgrowsexponentiallywiththedimensionalityofthespacethatthedataresidesin.Thismakesthediscoveryofmethodsforreducingthedimensionalityoftheinput,withoutdiscardingimportantinformation,anattractiveprospect.

2.3.1 Supervisedrepresentationlearning

Representationsariseinartificialneuralnetworks(ANNs)whentheyaretrainedforaregressionorclassificationobjective.OneviewofANNsisthatthehiddenlayersperformfeatureextractionontheinput,transformingittoamoresuitableformfortheoutputlayerthatperformsthefinalcalculationsforthetask.sharif2014cnnmadeuseofthisinsightbypre-processinginputsforsupervisedlearningmodelswiththeintermediatelayeroutputsofaconvolutionalnetworkthatwaspre-trainedonanobjectclassificationtask.Theyachievedimpressiveresultsonadiverserangeoftasks,suchasimageretrievalandscenerecognition.ThehierarchicalstructureofANNsalsohasthetheoreticalandpracticaladvantagethatthefeaturesateachlevelarere-usedforthedifferentfeaturesatthehigherlevel.

2.3.2 Unsupervisedrepresentationlearning

ForhierarchicalmethodslikeANNs,representationsaregeneratedatthesametimeasthewholesystemistrainedtominimizeerroronhumantaggeddata.Mostotherrepresentationlearningmethodsareunsupervisedandareabletolearnusefulfeaturesonunlabeleddata.

Pca

Principalcomponentanalysis(PCA)isanunsupervisedlearningmethodinventedbypearson1901liii.Themethodfindsanorthogonallineartransformationforzero-mean,-dimensionaldata.Thecolumnsofaretheprincipalcomponentsofthedata,whichpointtothedirectionofthegreatestvarianceinthedata.Thefirstprincipalcomponentisthesolutiontotheequation

(2.18)

ThesecondcomponentisfoundbyapplyingEquation2.18againtothetransformeddatathatisgivenbyremovingthecontributionofthefirstcomponentfrom,,andsoon.Allthecomponentscanalsobefoundsimultaneouslybycomputingtheeigendecompositionofthedata’scovariancematrix,asithasbeenshownthattheprincipalcomponentsareequaltotheresultingeigenvectors(shlens2014tutorial).Dimensionalityreductioncanbedonebycreatingamatrix,consistingonlyofthefirstprincipalcomponents,andapplyingittothedata.Thisyieldsthelower-dimensional,transformeddatamatrix.Bydoingthis,thedataisprojectedontothesubspacewiththemaximumvariance.Thismatrixofprincipalcomponentshasthepropertyofminimizingthereconstructionerror.InFigure2.3,weshowavisualizationoftheUCIMLdigitsdataset,whichconsistsofgrayscaleimagesofhand-writtendigits.Thefigureshowstheresultafterthedataisprojectedonitsfirsttwoprincipalcomponents,showingclearclusteringofthedigits.Wealsoshowtheclusteringfoundbyt-distributedstochasticneighborembedding(discussedbelow),whichseparatestheclustersmorecleanlyforthisdataset.

{adjustwidth}

0in0in\subfloat[Samplesfromthedigitsdatabase.]\subfloat[PCAvisualizationofdigits.]  \subfloat[t-SNEvisualizationofdigits.]

Fig 2.3: DimensionalityReductionbyPCAandt-SNE(a)Thedigitsdatabaseconsistsofpixelimagesofhand-drawndigits.(b)Weplotthefirsttwoprincipalcomponentsofthedigitsdatasetandcolorbydigitlabel.(c)Wevisualizethetwo-dimensionalembeddingofthedigitdatasetaslearnedbyt-distributedstochasticneighborembedding.
t-SNE

Overthelastfewyears,t-distributedstochasticneighborembedding(t-SNE)(maaten2008visualizing)hasbecomeoneofthemostpopulardimensionalityreductiontechniquesforvisualization(arora2018analysis).Theassumptionbehindthealgorithmisthatthehigh-dimensionalinputdataliesonalocallyconnectedmanifold.First,anauxiliaryasymmetricmeasurebetweeneachpairofdatapointsiscalculatedaccordingtotheequation

(2.19)

whereasitisofprimaryinteresttomodelpairwisesimilarities.TheconstantisthevarianceofaGaussianthatiscenteredaroundandcontrolshowinfluentialnearbydatapointsareincontrasttofarawaydatapoints.Thenthepairwisesimilaritiesarecomputedusingtheformula

(2.20)

thesesimilaritiesaredefinedtobesymmetrizedconditionalprobabilitiestoensurethateachdatapointmakesasignificantcontributiontothecostfunction.Next,alower-dimensionalvectorofdatapointsYiscreatedandinitializedrandomly.EachelementinYcorrespondstoanelementinX:similaritiesbetweendatapointsanarecalculatedaccordingtotheformula

(2.21)

Thelow-dimensionaldatapointsarethenmovedaroundtominimizetheKLdivergence--ameasureofthedifferencebetweentwoprobabilitydistributions444Notethatandcanbeinterpretedasprobabilitiessinceandandforalland.–betweenand

(2.22)

Equation2.22isminimizedusinggradientdescent,ensuringthatpointsthataresimilarinthehigh-dimensionalspacearealsosimilarinthenew,low-dimensionalspace.

Autoencoders

Theautoencoderisatypeofneuralnetwork555Autoencoderscanconsistofanytypesofneuralnetworklayers.Forexample,anautoencodermadeupofconvolutionallayersiscalledaconvolutionalautoencoders.thatconsistsofanencoderpart,whichmapstheinputtoanencoding(usuallyofasmallersizethantheinput),andadecoderpart,thatoutputsareconstructionoftheinput(Figure2.4).

Fig 2.4: Anautoencoder.Thisisanexampleofaconvolutionalautoencoderthatistrainedforreconstructingsealphotographs.Theencodingvectorisusuallymuchsmallerthanthetotalnumberofpixelsintheinput.

Theobjectivefunctionisthesquarederrorbetweentheinputandtheoutput

(2.23)

Usefulrepresentationsariseinthisprocessifthecorrectconstraintsareplacedonthesystem.Withoutconstraints,thesystemcouldenduplearningtheidentityfunction,thattriviallysatisfiestheobjectivefunction:.Thisproblemcanbeovercomebyincludingahiddenlayerintheautoencoderofalowerdimensionalitythantheinputspace.Inthiscase,theautoencoderissaidtobeundercomplete.Undercomplete,single-layerautoencoderswithlinearactivationfunctionsarealmostequivalenttoPCA.The-dimensionalhiddenlayerspansthesamesubspaceasthefirstprincipalcomponents,ortheprincipalsubspace,ofthedata(baldi1989neural).UnlikePCA,however,theweightsofthehiddenlayerarenotguaranteedtobeorthonormalnorordered.Ifthesmallestdimensionalityofahiddenlayerislargerthanthesizeoftheinput,theautoencoderissaidtobeovercomplete.Overcompleteautoencoderscanbepreventedfromlearningtheidentityfunctioniftheobjectivefunction(Equation2.23)iscombinedwitharegularizationterm.Forexample,insteadofreconstructingtheoriginalinputasitis,vincent2008extractingproposethatthegoalcouldbetorecovertheinputafterithasbeencorruptedwithnoise(e.g.Gaussiannoiseorsalt-and-peppernoise).Theideabehindthisisthattheautoencoderhastolearnrepresentationsthatarestableandrobustunderthecorruptionoftheinput,andthatthedenoisingtaskextractsausefulstructureoftheinputdistribution(vincent2010stacked).

2.3.3 Self-supervisedlearning

Approachesthatlearnrepresentationsbywayofsolvingauxiliarytasks,inthesensethattherepresentationthatarisesfromtheoptimizationismoreimportantthanachievingagoodperformanceonthetaskitself,issometimescalledself-supervisedlearningintheliterature(gogna2016semi).Forexample,ha2018worldtrainanautoencodertoreconstructtheobservationsinanRLenvironment,buttheydonottakeadvantageofthereconstructivecapabilitiesofthenetworkwhentheytraintheirRLpolicies,andtheyuseonlytheencoderpartofthenetworkforvisualpre-processing.Anotherexampleisthedenoisingautoencoderfromthepreviouschapter.Onedirectionofself-supervisedlearningthathasbeenfollowedintheliteratureistoinventataskthatrequiresanunderstandingofthedomaintosolvecorrectly,suchasthereconstructivenetworkusedbyha2018world.Anotherexampleistheworkbygidaris2018unsupervised,wholearnrepresentationsbyapplyingrandomrotationstonaturalimagesandtrainanetworktopredictwhichrotationwasapplied.Thisencouragesthenetworktolearnhigh-levelconcepts,suchasbeaks,wingsandtalons,andtheirrelativepositions.Morecommonly,self-supervisedlearningmethodsconsistofobscuringsomepartoftheinputandtrainamodeltopredictthatpartgivensomeothersubsetoftheinput,aswedoinChapter4andChapter5.Somevariationsofthisideainclude,forexample,colorization(zhang2016colorful),wherecolorfulimagesareconvertedtograyscalewiththegoalofpredictingtheoriginalcolors.

Chapter 3 Learninggradient-basedICAbyneurallyestimatingmutualinformation

Inthischapter111Thischapterisadaptedfrom(hlynsson2019learning),whichwaspublishedintheJointGerman/AustrianConferenceonArtificialIntelligence(KünstlicheIntelligenz).,weintroduceanovelmethodoftrainingneuralnetworksinanunsupervisedmannertooutputstatisticallyindependentcomponents,amethodwecallGrICA.Weuseamutualinformationneuralestimation(MINE)network(belghazi2018mine)toguidethelearningofanencodertoproducestatisticallyindependentoutputs.Thisisarecentmethodofestimatingthemutualinformationofrandomvariablesinadeeplearningsetting,andweapplyittogetaqualitativelyequalsolutiontoFastICAonblind-source-separationofnoisysources.WeinvestigatetheusefulnessofourmethodincontrasttoarepresentationlearnedbyaconvolutionalautoencoderforpreprocessingvisualinputsforanRLagent,butthecomparisonisunfavorableforourapproach.Therestofthischapterhasthefollowingorganization:Section 3.1motivatesthedesignofrepresentationswithindependentcomponents.Section 3.2explainstheICAproblemformulation.Section 3.3brieflydiscussesrelatedwork.Section3.4introducesourmethodofusingamutualinformationneuralestimatortoteachaneuralnetworktooutputindependentcomponents.Section3.5showstheexperimentalevaluationofourmethod.Finally,weconcludewithadiscussioninSection 3.6.

3.1 Introduction

Thegeneralobjectiveoftraininganencodertolearnstatisticallyindependent,factorialcodesofthedatahasbeencalledthe"holygrail"ofunsupervisedlearning(schmidhuberunsupervised).Wesuggestthatlearningtorecoverfew,statisticallyindependent,latentvariablesofanRLenvironmentcanspeedupthetrainingofRLagents.Forenvironmentswherehigh-dimensionalobservationsarecreatedfromasmallsetofstatisticallyindependentlatentvariables,thistechniquecouldreducethedimensionalityoftheobservationswithoutdiscardingunnecessaryinformation.AnothertheoreticaladvantageofusingthiskindofapproachinRLsettings,comparedtotheothermethodswedevelopinthisPhDdissertation,isthatitrequiresonlyout-of-contextobservationdatafromtheenvironment.LearningtheGrICArepresentationdoesnotrequirefulltransitionstuples222Weusetodenotethestate,todenotetheaction,todenotetherewardandtodenotethenextstate.andcanthusbeusedwhenthetransitionorrewarddynamicsoftheenvironmentschange.Learningrepresentationsthatoutputstatisticallyindependentfeaturescanbedoneinanynumberofways,forexample,bytryingtomakeeachoutputasunpredictableaspossiblegiventheotheroutputunits(schmidhuber1992learning).Wetaketheapproachofminimizingthemutualinformation,asestimatedbyaMINEnetwork,betweentheoutputunitsofadifferentiableencodernetwork.Thisisdonebysimplealternateoptimizationofthetwonetworks.

3.2 Background

Independentcomponentanalysis(ICA)aimsatestimatingunknownsourcesthathavebeenmixedtogetherintoanobservation.TheusualassumptionsarethatthesourcesarestatisticallyindependentandnomorethanoneisGaussian(jutten2003advances).Thenow-cementedmetaphorisoneofacocktailpartyproblem:severalpeople(sources)arespeakingsimultaneously,andtheirspeechhasbeenmixedtogetherinarecording(observation).Thetaskistounmixtherecordingsuchthatalldialoguescanbelistenedtoclearly.InlinearICA,wehaveadatamatrixwhoserowsaredrawnfromstatisticallyindependentdistributions,amixingmatrix,andanobservationmatrix:

andwewanttofindanunmixingmatrixofthatrecoversthesourcesuptoapermutationandscaling:

Thegeneralnon-linearICAproblemisill-posed(hyvarinen1999nonlinear; darmois1953analyse)asthereisaninfinitenumberofsolutionsifthespaceofmixingfunctionsisunconstrained.However,post-linear(taleb1999source)(PNL)ICAissolvable.Thisisaparticularcaseofnon-linearICAwheretheobservationstaketheform

whereoperatescomponentwise,i.e..Theproblemissolvedefficientlyifisatleastapproximatelyinvertible(ziehe2003blind)andthereareapproachestooptimizetheproblemfornon-invertibleaswell(ilin2004post).Forsignalswithtime-structure,however,theproblemisnotill-posedeventhoughitisfori.i.d.samples(blaschke2007independent; sprekeler2014extension).ToframeICAasanoptimizationproblem,wemustfindawaytomeasurethestatisticalindependenceoftheoutputcomponentsandminimizethisquantity.Therearetwomainwaystoapproachthis:eitherminimizethemutualinformationbetweenthesources(amari1996new; bell1995non; cardoso1997infomax),ormaximizethesources’non-Gaussianity(hyvarinen2000independent; blaschke2004cubica).

3.3 Relatedwork

TherehasbeenaninterestincombiningneuralnetworkswiththeprinciplesofICAforseveraldecades.InPredictabilityMaximization(schmidhuber1992learning),agameisplayedwhereoneagenttriestopredictthevalueofoneoutputcomponentgiventheothers,andtheothertriestomaximizetheunpredictability.Morerecently,DeepInfoMax(DIM)(hjelm2018learning),GraphDeepInfoMax(velivckovic2018deep)andGenerativeadversarialnetworks(goodfellow2014generative),utilizetheworkofBrakeletal.(brakel2017learning)todeeplylearnICA.Ourworkdiffersfromtheseadversarialtrainingmethodsintherulesoftheminimaxgamebeingplayedtoachievethis:oneagentdirectlyminimizesthelower-boundofthemutualinformation,asderivedfromtheDonsker-VaradhancharacterizationoftheKL-Divergence,astheothertriestomaximizeit.

3.4 Method

3.4.1 Reinforcementlearningenvironment

Ourrepresentationistestedona2Denvironmentwheretheagentissupposedtoavoidafieldoflavaandreachagoalontheothersideoftheroom.Thefullstateoftheenvironmentisthewholeroom(Fig3.1,left)andtheobservationisanisometricviewoftheagentanditspointofview(Fig3.1,right).TheobservationsareRGBimagesandtheagentcantakeastepforward,turnleftorturnright.

Fig 3.1: Thelavafieldenvironment.Theworldhastilesandissurroundedbyimpassablewalls.Theagent(redarrow)istaskedwithreachingthegoal,representedbythegreentile,whichyieldsarewardandterminatestheepisode.Theepisodeendswithoutrewardiftheagenttouchesanorangelavatile.Thefullworldstatecanbeseenontheright,withaslightlylighterboxcontainingtheagent.Thisboxhighlightsthesubsetoftheworldthatisperceivedbytheagent,seenontheleft.

Theepisodeterminatesiftheagentstepstowardlavaorthegoal.Theagentreceivesapositiverewardifitreachesthegoal,buttheepisodeterminateswithzerorewardifitstepsintothelava.Thereisnochangeiftheagentisfacedtowardthewallandtakesastepforward.

3.4.2 Learningtheindependentcomponents

Wetrainanencodertogenerateanoutputsuchthatanyoneoftheoutputcomponentsisstatisticallyindependentoftheunionoftheothers,i.e.,where

Thestatisticalindependenceofandcanbemaximizedbyminimizingtheirmutualinformation

(3.1)

Thisquantityishardtoestimate,particularlyforhigh-dimensionaldata.NotethatEquation 3.1canbemoresuccinctlyastheKLdivergencebetweenand:

(3.2)

donsker1975asymptoticfamouslyprovedthattheKLDivergenceadmitstherepresentation

(3.3)

wherethedomainisaclosedandboundedsubsetof.belghazi2018mineintroduceamethodofusingtheDonsker-Varadhanrepresentationtoestimatemutualinformationwithneuralnetworks,withanarchitecturetheycallmutualinformationneuralestimation(MINE)networks.Tolearnrepresentationswithindependentcomponents,wethereforeestimatethelowerboundofEq. (3.1)usingaMINEnetwork:

(3.4)

whereindicatesthattheexpectedvalueistakenoverthejointandsimilarlyfortheproductofmarginals.Thenetworksandareparameterizedbyand.TheencodertakestheobservationsasinputandtheMINEnetworktakestheoutputoftheencoderasaninput.Thenetworkminimizesinorderfortheoutputstohavelowmutualinformationandthereforebestatisticallyindependent.Inordertogetafaithfulestimationofthelowerboundofthemutualinformation,thenetworkmaximizes.Thus,inapush-pullfashion,thesystemasawholeconvergestoindependentoutputcomponentsoftheencodernetwork.Inpractice,ratherthantrainingtheandnetworkssimultaneouslyitprovedusefultotrainfromscratchforafewiterationsaftereachiterationoftraining,sincethelossfunctionsofandareatoddswitheachother.Whentheencoderistrained,theMINEnetwork’sparametersarefrozenandviceversa.

Fig 3.2: Ourindependentfeaturelearningsystem.ThesystemlearnsstatisticallyindependentoutputsbyalternateoptimizationofanencoderandaMINEnetworkparameterizedbyand.TheMINEobjective(Eq. 3.4)isminimizedwithrespecttoforweightupdatesoftheencoder,butitismaximizedwithrespecttoforweightupdatesoftheMINEnetwork.

3.5 Results

Wetryourmethodontwoscenarios:(1)wecompareittocanonicalimplementationsofICAonatextbookexampleofestimatingsourcesfromnoisydataand(2)weuseourmethodwithamorecomplexfunctionapproximatorforpreprocessingobservationsinanRLsetting.

3.5.1 Recoveringnoisysignals

Wevalidatethemethod333Fullcodeforthenoisysignalrecoveryexperimentisavailableatgithub.com/wiskott-lab/gradient-based-ica/blob/master/bss3.ipynbaforlinearnoisyICAexample(sklearn).Threeindependent,noisysources—sinewave,squarewaveandsawtoothsignal(Fig. 3.2(a))—aremixedlinearly(Fig. 3.2(b)):


Theencoderisasingle-layerneuralnetworkwithlinearactivation,withadifferentiablewhiteninglayer(schuler2018gradient)beforetheoutput.Thewhiteninglayerisakeycomponentforperformingsuccessfulblindsourceseparationforourmethod.Statisticallyindependentrandomvariablesarenecessarilyuncorrelated,sowhiteningtheoutputbyconstructionbeforehandsimplifiestheoptimizationproblemsignificantly.TheMINEnetworkisaseven-layerneuralnetwork.Eachlayerbutthelastonehas64unitswitharectifiedlinearactivationfunction.Eachtrainingepochoftheencoderisfollowedbyseventrainingepochsof.Estimatingtheexactmutualinformationisnotessential,sofewiterationssufficeforagoodgradientdirection.SincetheMINEnetworkisappliedtoeachcomponentindividually,toestimatemutualinformation(Eq. 3.4),weneedtopasseachsamplethroughtheMINEnetworktimes—onceforeachcomponent.Equivalently,onecouldconceptualizethisashavingcopiesoftheMINEnetworkandfeedingthesamplestoitinparallel,withdifferentcomponentssingledout.Thus,forsamplewefeedin,foreach.BothnetworksareoptimizedusingNesterovmomentumADAM(dozat2016incorporating)withalearningrateof.Forthissimpleexample,ourmethod(Fig. 3.2(c))isequivalentlygoodatunmixingthesignalsasFastICAasimplementedinthescikit-learnpackage(scikit-learn)(Fig. 3.2(d)).Notethat,ingeneral,thesourcescanonlyberecovereduptopermutationandscaling.

(a) Theoriginalsources.
(b) Linearmixtureofsources.
(c) Sourcesrecoveredbyourmethod.
(d) SourcesrecoveredbyFastICA.
Fig 3.3: Noisysignalrecovery.Threeindependent,noisysources(a)aremixedlinearly(b).Ourmethodrecoversthem(c)tothesameextentasFastICA(d).

3.5.2 Lavafieldenvironment

Fortheseexperiments,welearnourICAfeaturesusingaconvolutionalneuralnetwork.Werollout100episodeswithafullyrandompolicytogatherdataforlearningtheindependentfeatures:anagentisplacedintheupperrightcorneroftheenvironmentandturnsleft,rightortakesastepforwardwithequalprobabilities.Theresultobservationsarethengathereduntiltheepisodeterminates.Thisgivesus4130observationstotraintherepresentationonforthisexperiment.Therepresentationwelearnis32-dimensional.TheMINEnetworkisthesameasabove,buttheencodernetworkisafive-layerconvolutionalneuralnetwork.Thefirsttwolayersareconvolutionallayerseachwith32filters,arectifiedlinearunitactivationandnopadding.Thefirstlayerhasastrideof4andthesecondoneastrideof3.Theoutputoflayer2isthenpassedtoaflatteninglayer,reshapingthetensoroutputtoavectorinputforalineardenselayerwith32units.Theoutputofthedenselayeristhenfinallypassedtoaspheringlayer,givingustheencoding.ThenetworkdescriptionissummarizedinTable 3.1.

Layer Filters Kernel Stride Padding Output Learnable
Shape Parameters
Input - - - - 0
Conv. 32 3x3 4 None 896
ReLU - - - - 0
Conv. 32 3x3 3 None 9248
ReLU - - - - 0
Flatten - - - - 512 0
Dense - - - - 32 16416
ReLU - - - - 0
Sphering - - - - 32 0
Table 3.1: OurconvolutionalICAnetwork.ThelayoutisinspiredbythetopologyofMnih’sdeepQnetworks(mnih2013playing)exceptforthelastlayer,wherewehaveadifferentiablespheringoperationinsteadofadenselayer.

ThetrainingofourICArepresentationfollowsthesameschemeasbefore:wetraintheestimatorforsevenepochsaftereachtrainingepochoftheencoder.Wetrainedtheencoderfor100epochsandtheestimatorfor700epochsforthisexperiment.OurtrainedrepresentationisusedtopreprocessthevisualinputforaRLagent.WechooseActorCriticusingKronecker-FactoredTrustRegion(ACKTR)asimplementedbyStableBaselineswithdefaultparametersandmodel.TheACKTRdefaultmodelisafully-connectedneuralnetworkwithtwolayersof64unitseachandatanhactivationfunction.WetrainedanACKTRmodelfromscratchtwentytimesonourICArepresentation,andshowtheresultsinFig 3.4.Thisindicatesthatweareabletolearntheenvironmentusingourmethodtopreprocesstheinputforareinforcementlearningmethod.

Fig 3.4: GrICArewardduringtrainingonthelavafieldenvironment.Thelineindicatestheaveragereward(every1500trainingsteps)over20differentagentstrainedfromscratch.Theerrorbandsindicateonestandarddeviationfromthemean.

Tovisualizethebehaviorofouragent(Figure 3.5),wechoosethreesuccessfulepisodesfromafully-trainedmodelafter100thousandtimestepsoftrainingandthreeunsuccessfulonesfromamodelwith80thousandtimestepsoftraining.Itisnoteworthythattheagentprefersawidemarginbetweenitselfandthelavafieldasitpassesit,eventhoughamoreoptimalstrategywouldhavetheagentwalktotherightwiththelavaleftimmediatelyonitsleft-handside.Theagentalsosometimesdoublesbackbeforecontinuingtowardthegoalagain.

Fig 3.5: Trajectoriesinthelavafieldenvironment.Thefirststepinthetrajectoryisindicatedbyblue,thenthecolorwarmsupwitheachstepuntilitbecomesamoresaturatedredcolorinthefinalstep.

Wealsotriedtoseewhetherourmethodgeneralizestoavariantoftheenvironmentwherethelowerrowoflavaismovedtothebottom,punishingourstrategy.Therewereonly3successesinathousandtestiterations,showninFigure 3.6,alongwiththreeoftheunsuccessfulepisodes.

Fig 3.6: Trajectorieswithshiftedlavafields.Thisvariantpunishesstrategies–suchastheonelearnedusingourrepresentation–wheretheagentseekstogoallthedownforsafetyasitgoesacrosstheroom.

Forcomparison,wealsotrainedaconvolutionalautoencoder(CAE)forreconstructiononthesamedatasetweusedtotrainourICArepresentation.WerantheexperimentagainwiththeresultingencodingaftertheCAEwastrained.Theencoderisthesameasthenetworkusedforourrepresentation,exceptthatitdoesnothavethespheringlayer.Thedecodingportionofthenetworkconsistsofa392-unitdenselayerwhoseoutputisreshapedtoatensorandpassedtoaconvolutionallayerwith32filtersofsize.Theoutputisthenup-sampledtoquadruplethewidthandheightofthetensor.Thisisthenfollowedbyanotherconvolutionallayer,ofthesamekindasthepreviousone,andanotherup-samplinglayerthatdoublesthewidthandheightofthetensor.Theoutputthenfinallygoesthroughaconvolutionallayerwith3filtersofsize.EachlayerinthedecoderhasaReLUactivation,exceptforthelastwhichhasalogisticactivationtoreconstructpixelvaluesthathavebeennormalizedlieintherange.Eachconvolutionallayerdoeszero-paddingtopreservethewidthandtheheightoftheinputtensor.SeeTable3.2foranoverviewofthearchitecture.

Layer Filters Kernel Stride Padding Output Learnable
Shape Parameters
\rowcolorlightblueInput - - - - 0
\rowcolorlightblueConv. 32 3x3 4 None 896
\rowcolorlightblueReLU - - - - 0
\rowcolorlightblueConv. 32 3x3 3 None 9248
\rowcolorlightblueReLU - - - - 0
\rowcolorlightblueFlatten - - - - 512 0
\rowcolorlightblueDense - - - - 32 16416
\rowcolorlightblueReLU - - - - 32 0
\rowcolorlightredDense - - - - 392 12936
\rowcolorlightredReLU - - - - 0
\rowcolorlightredReshape - - - - 0
\rowcolorlightredConv. 32 3x3 1 Same 2336
\rowcolorlightredReLU - - - - 0
\rowcolorlightredUpsampling - 4x4 - - 0
\rowcolorlightredConv. 32 3x3 1 Same 9248
\rowcolorlightredReLU - - - - 0
\rowcolorlightredUpsampling - 4x4 - - 0
\rowcolorlightredConv. 3 3x3 1 Same 9248
\rowcolorlightredTanh - - - - 867
Table 3.2: Convolutionalautoencodernetworkarchitecture.Theblueparthighlightstheencoderportionandthegreenparthighlightsthedecoderportion.TheencoderportionwasusedtopreprocesstheinputtotheRLlearnerfortheexperiment.

WetraintheCAEfor50epochs.Eventhoughthisisalownumberofepochscomparedtothetrainingforourmethod,itsreconstructivepropertiesarealreadyquitegood(Figure3.7).

Fig 3.7: Reconstructionbyautoencoder.The32-dimensionallatentspacecapturesenoughinformationtoreconstructtheoriginalobservationsquiteaccurately,evenafteramodestnumberoftrainingepochs.

Werepeattheexperimentasbefore,butnowwiththeCAEfeaturesinsteadofourICArepresentation.ThetrainingcurveisshowninFigure3.8.Thisstraightforwardbaselinealgorithmlearnstosolvetheenvironmenttwiceasfastasourmethod.

Fig 3.8: CAErewardduringtrainingThelineindicatestheaveragereward(every1500trainingsteps)over20differentagentstrainedfromscratch.Theerrorbandsindicateonestandarddeviationfromthemean.

3.6 Conclusion

WehaveintroducedanoveltechniquefortrainingadifferentiablefunctiontoperformICA.Themethodconsistsofalternatingtheoptimizationofanencoderandaneuralmutualinformationneuralestimation(MINE)network.Themutualinformationestimatebetweeneachencoderoutputandtheunionoftheothersisminimizedwithrespecttotheencoder’sparameters.ThesolutionlearnedbyourapproachagreeswiththeonelearnedbythecanonicalICAalgorithm,FastICA.Anadvantageofourmethod,however,isthatitistriviallyextendedforovercompleteorundercompleteICAbychangingthenumberofoutputunitsoftheneuralnetwork.Weapplyouralgorithmonhigh-dimensionaldatatotesttherepresentationlearnedbyourmethodfordimensionalityreductionofvisualinputsforanRLagent.Theagentisabletouseourrepresentationtolearnhowtosolveasimplenavigationtask,butthepreprocessingofferedbyourapproachisoutperformedbyaconvolutionalautoencoder.Ourmethodworksinprinciple,ascanbeseenbythenoisysignalrecoveryexperiment,butitseffectivenessforlearningrepresentationsforRLagentsremainsunproven.Eventhoughtheobservationsofthelavafieldenvironmentarefullydeterminedbythreelatentvariablesthatarestatisticallyindependent,theagent’sxandypositions,alongwithitsdirection,ourrepresentationwasstillnotusefulenoughtobeattherelativelysimplebaselines.

Chapter 4 Latentrepresentationpredictionnetworks

Inthischapter111Thischapterisadaptedfrom(hlynsson2020latent).,weintroducearepresentationlearningtechniqueforRLsettingsthatwenameLatentRepresentationPrediction(LARP).ThisnovelsystemtakesadvantageofmoreinformationgivenbytheenvironmentthanourGrICAmethodfromthepreviouschapter,thatonlylearnedfromstaticobservationswithouttakingadvantageoftheknowledgethatthesystemwillbeusedinadynamicsetting.Thatistosay,wewillnowutilizethetripletsfortraining.Ouralgorithmlearnsastaterepresentation,alongwithafunctionthatpredictshowtherepresentationchangeswhentheagenttakesgivenactionsintheenvironment.Insteadofusingoursystemtopreprocessinputsforamodel-freereinforcementlearner,aswedidinthepreviouschapter,nowwetakeadvantageofapredictionfunction,whichisusedasaforwardmodelforsearchonagraphinaviewpoint-matchingtask.Usingarepresentationthatislearnedtobemaximallypredictableforthepredictorisfoundtooutperformpretrainedrepresentations.Thedata-efficiencyandoverallperformanceofourapproachisshowntorivalstandardreinforcementlearningmethods,andourlearnedrepresentationtransferssuccessfullytonovelenvironments.Therestofthechapterisorganizedasfollows:inSection 4.1,wemotivatetheusefulnessofrepresentationsthatarepredictableinthescopeofvisualplanning,andwementionhowweintendtoovercomeacommonpitfallintheirdesign.WemoveonwithdiscussingthemainclassesofrelatedworkinSection 4.2,andsummarizethemostrelevantarticlesfromamongthem.InSection 4.3,wediscussinconcretedetailthedesignofourrepresentation,howweuseitforplanning,andweintroduceanexperimentalenvironmentofourowndesign.TheresultsofourexperimentsarepresentedinSection 4.4,andweconcludewithadiscussionoftheproposedmethodologyandprospectsforfutureworkinSection 4.5.

4.1 Introduction

Deeply-learnedplanningmethodsareoftenbasedonlearningrepresentationsthatareoptimizedforunrelatedtasks.Forexample,theymightbetrainedtoreconstructobservationsoftheenvironment,suchastheconvolutionalautoencoderfromthepreviouschapter.Theserepresentationsarethencombinedwithpredictorfunctionsforsimulatingrolloutstonavigatetheenvironment.Weproposetoratherlearnrepresentationssuchthattheyaredirectlyoptimizedforthetaskathand:tobemaximallypredictableforthepredictorfunction.Thisresultsinrepresentationsthatarewell-suited,bydesign,forthedownstreamtaskofplanning,wherethelearnedpredictorfunctionisusedasaforwardmodel.Whilemodernreinforcementlearningalgorithmsreachsuper-humanperformanceontaskssuchasgameplaying,theyremainwoefullysampleinefficientcomparedtohumans.Analgorithmthatisdata-efficient (hlynsson2019measuring)requiresonlyfewsamplesforgoodperformanceandthestudyofdata-efficientcontroliscurrentlyanactiveresearcharea(corneil2018efficient; buckman2018sample; du2019good; saphal2020seerl).Dimensionalityreductionisapowerfultoolforincreasingthedata-efficiencyofmachinelearningmethods.Therehasbeenmuchrecentworkonmethodsthattakeadvantageofcompact,low-dimensionalrepresentationsofstatesforsearchandexploration (kurutach2018learning; corneil2018efficient; xu2019regression).Oneoftheadvantagesofthisapproachisthatagoodrepresentationaidsinfasterandmoreaccurateplanning.Thisholdsinparticularwhenthelatentspaceisofmuchlowerdimensionalitythanthestatespace(hamilton2014efficient).Forhigh-dimensionalinputs,suchasimagedata,arepresentationfunctionisfrequentlylearnedtoreducethecomplexityforacontroller.Indeepreinforcementlearning,therepresentationandthecontrollerarelearnedsimultaneously.Similarly,arepresentationcaninprinciplebelearnedalongwithaforwardmodelforclassicalplanninginhigh-dimensionalspace.WedothiswithourLARPnetwork,whichisaneuralnetwork-basedmethodforlearningastaterepresentationandatransitionfunctionforplanningwithinthelearnedlatentspace(Fig. 4.1).

Fig 4.1: Conceptualoverviewofourmethod.Theimportantcomponentsaretherepresentationnetworkalongwiththepredictornetwork.Together,theycompriseaLARPnetwork,whichisutilizedbyaplanningalgorithm.

Duringtraining,therepresentationandthepredictorarelearnedsimultaneouslyfromtransitionsinaself-supervisedmanner.Wetrainthepredictortopredictthemostlikelyfuturerepresentation,givenacurrentrepresentationandanaction.Thepredictoristhenusedforplanningbynavigatingthelatentspacedefinedbytherepresentationtoreachagoalstate.Optimizingcontrolinthismanner,afterlearninganenvironmentmodel,hastheadvantageofallowingforlearningnewrewardfunctionsinafastanddata-efficientmanner.Aftertherepresentationislearned,wefindsaidgoalstatebyconventionalpathplanning.Disentanglingtherewardfromthetransitionfunctioninsuchawayishelpfulwhenlearningformultipleorchangingrewardfunctions,andaidswithlearningwhenthereisnorewardavailableatall.Thus,itisalsogoodforasparseoradelayed-rewardsetting.Aproblemthatcanariseinrepresentationlearningistheoneoftrivialfeatures.Thiscanhappenwhenthemethodisoptimizinganobjectivefunctionthathasastraightforward,butuseless,solution.Forexample,SlowFeatureAnalysis(SFA)(wiskott2002slow)hastheobjectiveofextractingthefeaturesoftimeseriesdatathatvarytheleastwithtime.Thisiseasilyfulfilledbyconstantfunctions,soSFArequiresthattherepresentationshaveavarianceof–whichconstantfunctionscannotfulfill.Constantfeatureswouldsimilarlybemaximallypredictablerepresentationsforoursystem.Therefore,westudythreedifferentapproachestopreventthistrivialrepresentationfrombeinglearned,weeither:(i)designthearchitecturesuchthattheoutputissphered,(ii)regularizeitwithacontrastivelossterm,or(iii)includeareconstructionlosstermalongwithanadditionaldecodermodule.Wecomparetheseapproachesandvalidateourmethodexperimentallyonavisualenvironment:aviewpoint-matchingtaskusingtheNORBdataset(lecun2004learning),wheretheagentispresentedwithastartingviewpointofanobjectandthetaskistoproduceasequenceofactionssuchthattheagentendsupwiththegoalviewpoint.AstheNORBdatasetisembeddableonacylinder(hadsell2006dimensionality; schuler2018gradient)orasphere(wang2018toybox),wecanvisualizetheactionsastraversingtheembeddedmanifold.Ourapproachcomparesfavorablytostate-of-the-artmethodsonourtestbedwithrespecttodata-efficiency,butourasymptoticperformanceisstilloutclassedbyotherapproaches.

4.2 Relatedwork

Mostoftherelatedworkfallsintothecategoriesofreinforcementlearning,visualplanning,orrepresentationlearning.Theprimarydifferencebetweenoursandothermodel-basedmethodsisthattherepresentationislearnedbyoptimizingauxiliaryobjectiveswhicharenotdirectlyusefulforsolvingthemaintask.

4.2.1 Reinforcementlearning

Therearemanyworksintheliteraturethatalsoapproximatethetransitionfunctionofenvironments,forinstancebyperformingexplicitlatent-spaceplanningcomputations(tamar2016value; gal2016improving; henaff2017model; srinivas2018universal; chua2018deep; hafner2019learning)aspartoflearningandexecutingpolicies.gelada2019deepmdptrainanRLagenttosimultaneouslypredictrewardsaswellasfuturelatentstates.Ourworkisdistinctfromthese,aswearenotassumingarewardsignalduringtraining.ha2018worldcombinevision,memory,andcontrollerforlearningamodeloftheworldbeforelearningadecisionmodel.Apredictivemodelistrainedinanunsupervisedmanner,permittingtheagenttolearnpoliciescompletelywithinitslearnedlatentspacerepresentationoftheenvironment.Themaindifferenceisthattheyfirstapproximatethestatedistributionusingavariationalautoencoder,producingtheencodedlatentspace.Incontrast,ourrepresentationislearnedsuchthatitismaximallypredictableforthepredictornetwork.Similartoourtrainingsetup,oh2015actionpredictfutureframesinATARIenvironmentsconditionedonactions.Thepredictedframesareusedforlearningthetransitionfunctionoftheenvironment,e.g.forimprovingexplorationbyinformingagentsofwhichactionsaremorelikelytoresultinunseenstates.Ourworkdiffersasweareactingwithinalearnedlatentspaceandnotthefullinputspace,andourrepresentationsareusedinaclassicalplanningparadigmwithstartandgoalstatesinsteadofareinforcementlearningone.

4.2.2 Visualplanning

Wedefinevisualplanningastheproblemofsynthesizinganactionsequencetogenerateatargetstatefromaninitialstate,andallthestatesareobservedasimages.VariationalStateTabulations (corneil2018efficient)learnastaterepresentationinadditiontoatransferfunctionoverthelatentspace.However,theirobservationspaceisdiscretizedintoatableusingavariationalapproach,asopposedtoourcontinuousrepresentation.Acontinuousrepresentationcircumventstheproblemofhavingtodeterminethesizeofsuchatableinadvanceorduringtraining.Similarly,cuccu2018playingdiscretizevisualinputusingunsupervisedvectorquantizationandusethatrepresentationforlearningcontrollersforAtarigames.Inspiredbyclassicsymbolicplanning,RegressionPlanningNetworks(xu2019regression)createaplanbackwardfromasymbolicgoal.Wedonothaveaccesstohigh-levelsymbolicgoalinformationforourmethod,andweassumethatonlyhigh-dimensionalvisualcuesarereceivedfromtheenvironment.TopologicalmemoriesoftheenvironmentarebuiltinSemi-parametricTopologicalMemories(savinov2018semi)afterbeingprovidedwithobservationsequencesfromhumansexploringtheenvironment.Nodesareconnectedifapredictorestimatesthattheyareclose.Themethodhasproblemswithgeneralization,whicharereducedinHallucinativeTopologicalMemories(liu2020hallucinative),wherethemethodalsoadmitsadescriptionoftheenvironment,suchasamaporalayoutvector,whichtheagentcanuseduringplanning.Ourvisualplanningmethoddoesnotreceiveanyadditionalinformationonunseenenvironmentsanddoesnotdependonmanualexplorationduringtraining.CausalInfoGAN(kurutach2018learning)andrelatedmethods(wang2019learning)arebasedongenerativeadversarialnetworks(GANs)(goodfellow2014generative),inspiredbyInfoGANinparticular(chen2016infogan),forlearningaplannablerepresentation.AGANistrainedforencodingstartandgoalstates,andtheyplanatrajectoryintherepresentationspaceaswellasreconstructingintermediateobservationsintheplan.Ourmethodisdifferentasitdoesnotneedtoreconstructtheobservationsandtheforwardmodelisdirectlyoptimizedforprediction.

4.2.3 Prediction-basedrepresentationlearning

InPredictableFeatureAnalysis(richthofer2015predictable),representationsarelearnedthatarepredictablebyautoregressionprocesses.Ourmethodismoreflexibleandscalesbettertohigherdimensionsasthepredictorcanbeanydifferentiablefunction.Usingtheoutputofothernetworksaspredictiontargetsinsteadoftheoriginalpixelsisnotnew.Thecasewheretheoutputofalargermodelisthetargetforasmallermodelisknownasknowledgedistillation(bucilua2006model; hinton2015distilling).Thisisusedforcompressingamodelensembleintoasinglemodel.vondrick2016anticipatinglearntomakehigh-levelsemanticpredictionsoffutureframesinvideodata.Givenacurrentframe,aneuralnetworkpredictstherepresentationofafutureframe.Ourapproachisnotconstrainedonlytopretrainedrepresentations,welearnourrepresentationtogetherwiththepredictionnetwork.Moreover,weextendthisgeneralideabyalsoadmittinganactionastheinputtoourpredictornetwork.

4.3 Materialsandmethods

Inthiswork,westudydifferentrepresentationsforlearningthetransitionfunctionofapartiallyobservableMDP(POMDP)andproposeanetworkthatjointlylearnsarepresentationwithapredictionmodelandapplyitforlatentspaceplanning.WesummarizeherethedifferentingredientsoftheLARPnetwork–ourproposedsolution.Moredetaileddescriptionswillfollowinlatersections.Trainingthepredictornetwork:Weuseatwo-streamfullyconnectedneuralnetworktopredicttherepresentationofthefuturestategiventhecurrentstate’srepresentationandtheactionbridgingthosetwostates.Thepredictormoduleistrainedwithasimplemean-squarederrorterm.Handlingconstantsolutions:Therepresentationcouldbetransferredfromotherdomainsorlearnedfromscratchonthetask.IftherepresentationislearnedsimultaneouslywithanestimateofaMarkovdecisionprocess’s(MDP)transitionfunction,precautionsmustbetakensuchthatthepredictionlossisnottriviallyminimizedbyarepresentationthatisconstantoverallstates.Weconsiderthreeapproachesfortacklingtheproblem:spheringtheoutput,regularizingwithacontrastivelossterm,andregularizingwithareconstructivelossterm.Searchinginthelatentspace:Combiningtherepresentationwiththepredictornetwork,wecansearchinthelatentspaceuntilanodeisfoundthathasthelargestsimilaritytotherepresentationofthegoalviewpointusingamodifiedbest-firstsearchalgorithm.NORBenvironment:WeusetheNORBdataset (lecun2004learning)forourexperiments.Thisdatasetconsistsofimagesofobjectsfromdifferentviewpoints,andwecreateviewpoint-matchingtasksfromthedataset.

4.3.1 Ongoodrepresentations

Werelyonheuristicstoprovidesufficientevidenceforagood—albeitnotnecessarilyoptimal—decisionateverytimesteptoreachthegoal.Here,weusetheEuclideandistanceinrepresentationspace:asequenceofactionsispreferrediftheirendlocationisclosesttothegoal.TheusefulnessofthisheuristicsdependsonhowwellandhowcoherentlytheEuclideandistanceencodestheactualdistancetothegoalstateintermsofthenumberofactions.Alearnedpredictornetworkapproximatesthetransitionfunctionoftheenvironmentforplanninginthelatentspacedefinedbysomerepresentation.Thisraisesthequestion:whatistheidealrepresentationforlatentspaceplanning?Ourexperimentsshowthatanopenlyavailable,general-purposerepresentation,suchasapretrainedVGG16(simonyan2014very),canalreadyprovidesufficientguidancetoapplysuchheuristicseffectively.Betterstillarerepresentationmodelsthataretrainedonthedataathand,forexample,uniformmanifoldapproximationandprojection(UMAP)(mcinnes2018umap)orvariationalauto-encoders(VAEs)(kingma2013auto).Onemight,however,askwhataparticularlysuitedrepresentationmightlooklikewhenattainabilityisignored.Itwouldneedtotakethetopologicalstructureoftheunderlyingdatamanifoldintoaccount,sothattheEuclideandistancebecomesagoodproxyforthegeodesicdistance.Oneclassofmethodsthatsatisfythisarespectralembeddings,suchasLaplacianEigenmaps(LEMs)(belkin2003laplacian).Theirrepresentationsaresmoothanddiscriminativewhichisidealforourpurpose.However,theydonoteasilyproduceout-of-sampleembeddings,sotheywillonlybeappliedinanin-samplefashiontoserveasacontrolexperiment,yieldingoptimalperformance.

4.3.2 Predictornetwork

Astherepresentationisusedbythepredictornetwork,wewantittobepredictable.Thus,weoptimizetherepresentationlearnersimultaneouslywiththepredictornetwork,inanend-to-endmanner.Supposewehavearepresentationmapandatrainingsetoflabeleddatatuples,whereistheobservationattimestepandisanactionresultinginastatewithobservation.Wetrainthepredictor,parameterizedby,byminimizingthemean-squarederrorlossover’sparameters:

(4.1)

whereisoursetoftrainingdata.Weconstructasatwo-stream,fullyconnected,neuralnetwork.Usingthispredictorwecancarryoutplanninginthelatentspacedefinedby.Byplanning,wemeanthatthereisastartstatewithobservationandagoalstatewithandwewanttofindasequenceofactionsconnectingthem.Thenetworkoutputstheexpectedrepresentationafteracting.Usingthis,wecanformulateplanningasaclassicalpathfindingorgraphtraversalproblem.

4.3.3 Avoidingtrivialsolutions

Inthecasewhereistrainableandparameterizedby,thelossforthewholesystemthatonlycaresaboutmaximizingpredictabilityis

(4.2)

foragivendataset.Withnoconstraintsonthefamilyoffunctionsthatcanbelongto,weruntheriskthattherepresentationcollapsestoaconstant.Constantfunctionstriviallyyieldzerolossforanysetifoutputstheinputstateagainforany,i.e:Constantrepresentationsareoptimalwithrespecttopredictability,buttheyareunfortunatelyuselessforplanning,asweneedtodiscriminatedifferentstates.ThisobjectiveisnotpresentintheproposedlossfunctioninEq. (4.2)andwemustthusaddaconstraintoranotherlosstermtofacilitatedifferentiatingthedifferentstates.Thereareseveralwaystolimitthefunctionspacesuchthatconstantfunctionsarenotincluded,forexamplewithdecoder(goroshin2015learning)oradversarial(denton2017unsupervised)lossterms.Inthiswork,wedothiswith(i)aspheringlayer,(ii)acontrastiveloss,or(iii)areconstructiveloss.

(i)Spheringtheoutput

TheproblemoftrivialsolutionsissolvedinSFA(wiskott2002slow)andrelatedmethods(escalante2013solve; schuler2018gradient)byconstrainingtheoverallcovarianceoftheoutputtobe.Includingthisconstrainttooursettingyieldstheoptimizationformulation:

(4.3)
subjectto

Weenforcethisconstraintinournetworkviaarchitecturedesign.Thelastlayerperformsdifferentiablesphering(schuler2018gradient; hlynsson2019learning)ofthesecond-to-lastlayer’soutputusingthewhiteningmatrix.Wegetusingpoweriterationofthefollowingiterativeformula:

(4.4)

wherethesuperscripttrackstheiterationnumberandcanbeanarbitraryvector.Thepoweriterationalgorithmconvergestothelargesteigenvectorofamatrixinafewhundred,quickiterations.Theeigenvalueisdetermined,andwesubtracttheeigenvectorfromthematrix:

(4.5)

theprocessisrepeateduntilthespheringmatrixisfound

(4.6)

Thewholesystem,includingthespheringlayer,canbeseeninFig 4.2,withanabstractconvolutionalneuralnetworkastherepresentationandafully-connectedneuralnetworkasthepredictionfunction.

Fig 4.2: Predictiverepresentationlearningwithspheringregularization.Theobservations,andtheresultingobservationaftertheactionhasbeenperformedin,arepassedthroughtherepresentationmap,whoseoutputsarepassedtoadifferentiablespheringlayerbeforebeingpassedto.Thepredictivenetworkminimizesthelossfunction,whichisthemean-squarederrorbetweenand.
(ii)Contrastiveloss

Constantsolutionscanalsobedealtwithinthelossfunctioninsteadofviaarchitecturedesign.hadsell2006dimensionalityproposetosolvethiswithalossfunctionthatpullstogethertherepresentationofsimilarobjects(inourcase,statesthatarereachablefromeachotherwithasingleaction)butpushesaparttherepresentationofdissimilarones:

(4.7)

whereisamarginandissome—usuallytheEuclidean—norm.Therepresentationofdissimilarobjectsispushedapartonlyiftheinequality

(4.8)

isviolated.Duringeachtrainingstep,wecompareeachobservationtoasimilarandadissimilarobservationsimultaneously (schroff2015facenet)bypassingatripletof(positive,anchor,negative)observationsduringtrainingtothreecopiesof.Inourexperiments,thepositivecorrespondstothepredictedembeddingofgivenand,theanchor,isthetrueresultingembeddingafteranactionisperformedinstateand,thenegative,istherepresentationofanarbitrarilychosenobservationthatisnotreachablefromwithasingleaction.Forenvironmentswherethisisdeterminable,suchasinourexperiments,thiscanbeassessedfromtheenvironment’sfullstate.Whenthisinformationisn’tavailable,ensuringforandthatisagoodproxy,eventhoughthiscanresultinsomeincorrecttriplets.Forexample,whentheagentrunsinaself-intersectingpath.Wedefinetherepresentationoftheobservationattimestepasandthenext-steppredictionforreadabilityandour(positive,anchor,negative)tripletisthusandweminimizethetripletloss:

(4.9)

Itwouldseemthatandareinterchangeable,sincethesecondtermisincludedonlytopreventtherepresentationfromcollapsingintoaconstant.However,ifthelossfunctionis

(4.10)

thenthenetworkisrewardedduringtrainingformakingpooratpredictingthenextrepresentationinsteadofsimplypushingtherepresentationofandawayfromeachother.Therearetwomainwaystosetthemargin,oneisdynamicallydeterminingitperbatch(sun2014deep).Theother,whichwechoose,isconstrainingtherepresentationtobeonahypersphereusingnormalizationandsettingasmallconstantmarginsuchas(schroff2015facenet).ThearchitectureforthetrainingschemeusingthecontrastivelossregularizationisdepictedinFig. 4.3.

Fig 4.3: Predictiverepresentationlearningwithcontrastivelossregularization.Weminimizethecontrastivelossfunction(Eq.4.9).Thepredictedfuturerepresentationispulledtowardthenextstep’srepresentation.However,ispushedawayfromthenegativestate’srepresentationifthedistancebetweenthemislessthan.Theobservationisrandomlyselectedfromthosethatarenotreachablefromwithasingleaction.
(iii)Reconstructiveloss

TrivialsolutionsareavoidedbyGoroshinetal.(goroshin2015learning)byintroducingadecodernetworktoasystemthatwouldotherwiseconvergetoaconstantrepresentation.Weincorporatethisintuitionintoourframeworkwiththelossfunction

(4.11)

whereisapositive,realcoefficienttocontroltheregularizationstrength.Fig. 4.4showshowthemodelsandlossfunctionsarerelatedduringthetrainingoftherepresentationandpredictorusingbothapredictiveandareconstructivelossterm.

Fig 4.4: Predictiverepresentationlearningwithdecoderlossregularization.Atthetimestep,theobservationispassedtotherepresentation.Thisproduceswhichispassed,alongwiththeactionattimestep,tothepredictornetwork.Thisproducesthepredictedwhichiscomparedtointhemean-squarederrorterm.Thepredictionisalsopassedtothedecodernetwork.Wethencomparewithin,anothermean-squaredlossterm.Thefinallossisthesumofthesetwolossterms.

ThedesiredeffectoftheregularizationcanalsobeachievedbyreplacingthesecondterminEq. (4.3.3)with.Bydoingthiswewouldmaximizethereconstructivepropertyofthelatentcodeinandofitself,whichisnotinherentlyusefulforplanning.Weinsteadaddanadditionallevelofpredictivepowerin:inadditiontopredictingthenextrepresentation,itspredictionmustalsobeusefulinconjunctionwiththedecoderforreconstructingthenewtrueobservation.Thisapproachcanhavethelargestcomputationaloverheadofthethree,dependingonthesizeofthedecoder.Weconstructthedecodernetworksuchthatitcloselymirrorsthearchitectureof,withconvolutionsreplacedbytransposedconvolutionsandmax-poolingreplacedbyupsampling.

4.3.4 Trainingthepredictornetwork

WetraintherepresentationnetworkandpredictornetworkjointlybyminimizingEq. (2),Eq. (10)orEq. (12).Thepredictornetworkcanalsobetrainedonitsownforafixedrepresentationmap.Inthiscase,istaskedasbeforewithpredictingaftertheactionisperformedinthestatewithobservationbyminimizingthemean-squarederrorbetweenand.ThenetworksarebuiltwithKeras(chollet2015keras)andoptimizedwithrmsprop(tieleman2012lecture).

4.3.5 Planningintransition-learneddomainrepresentationspace

Weuseamodifiedbest-firstsearchalgorithmwiththetrainedrepresentationsforourexperiments(Algorithm2).

0:  ,,maxtrials,actionset,representationmapandpredictorfunction
0:  Asequenceofactionsconnectingthestartstatetothegoalstate
1:  Initializethesetofuncheckedrepresentationswiththerepresentationofthestartstate
2:  Initializethedictionaryofrepresentation-pathpairswiththeinitialrepresentationmappedtoanemptysequence:)
3:  Initializetheemptysetofcheckedrepresentations
4:  for  do
5:     Choose
6:     Removefromandadditto
7:     for all actions do
8:        GetanewestimatedrepresentationandaddittothesetQ
9:        Concatenatetotheendofandassociatetheresultingsequencewithinthedictionary:
10:        
11:     end for
12:  end for
13:  Findthemostsimilarrepresentationtothegoal:
14:  return  thesequence
Algorithm 2 Performasimulatedrollouttofindastatethatismaximallysimilartoagoalstate.Outputasequencetoreachthefoundstatefromthestartstate.

Fromagivenstate,theagentperformsasimulatedrollouttosearchforthegoalstate.Foreachaction,theinitialobservationispassedtothepredictorfunctionalongwiththeaction.Thisresultsinapredictednext-steprepresentation,whichisaddedtoaset.Theactionstakensofarandresultingineachpredictionarenotedalso.Therepresentationthatisclosesttothegoal(usingforexampletheEuclideandistance)isthentakenforconsiderationandremovedfromtheset.Thisprocessisrepeateduntilthemaximumnumberoftrialsisreached.Thealgorithmthenoutputsthesequenceofactionsresultinginthepredictedrepresentationthatistheclosesttothegoalrepresentation.Tomakethealgorithmfaster,weonlyconsiderpathsthatdonottakeustoastatethathasalreadybeenevaluated,eveniftheremightbeadifferenceinthepredictionsfromgoingthisroundaboutway.Thatis,ifapermutationoftheactionsinthenextpathtobeconsideredisalreadyinanevaluatedpath,itwillbeskipped.Thishasthesameeffectastranspositiontablesusedtospeedupsearchingametrees.Pathsmightbeproducedwithredundancies,whichcanbeamendedwithpath-simplifyingroutines(e.g.takeonestepforwardinsteadofonestepleft,oneforwardthenoneright).WedoModel-PredictiveControl(garcia1989model),thatis,afterapathisfound,oneactionisperformedandanewpathisrecalculated,startingfromthenewposition.Sincetheplanningispossiblyoveralongtimehorizon,wemighthaveacasewhereapreviousstateisrevisited.Toavoidloopsresultingfromthis,wekeeptrackofvisitedstate-actionpairsandavoidanalreadychosenactionforagivenstate.

4.3.6 NORBviewpoint-matchingexperiments

Forourexperiments,wecreateanOpenAIGymenvironmentbasedonthesmallNORBdataset(lecun2004learning).Thecodefortheenvironmentisavailableathttps://github.com/wiskott-lab/gym-norbandrequiresthepickledNORBdatasethostedathttps://s3.amazonaws.com/unsupervised-exercises/norb.p.Thedatasetcontains50toys,eachbelongingtooneoffivecategories:four-leggedanimals,humanfigures,airplanes,trucks,andcars.Eachobjecthasstereoscopicimagesundersixlightingconditions,9elevations,and18azimuths(in-scenerotation).Inalloftheexperiments,wetrainthemethodsonninecarclasstoys,testingontheothertoys.EachtrialinthecorrespondingRLenvironmentrevolvesaroundasingleobjectunderagivenlightingcondition.Theagentispresentedwithastartandagoalviewpointoftheobjectandtransitionsbetweenimagesuntilthecurrentviewpointmatchesthegoalwhereeachactionoperatesthecamera.Tobeconcrete,theactionscorrespondtoturningaturntablebackandforthby,movingthecameraupordownbyand,inoneexperiment,changingthelighting.Thetrialisasuccessiftheagentmanagestochangeviewpointsfromthestartpositionuntilthegoalviewpointismatchedinfewerthantwicetheminimumnumberofactionsnecessary.Wecomparetherepresentationslearnedusingthethreevariantsofourmethodtofiverepresentationsfromtheliterature,namely(i)LaplacianEigenmaps(belkin2003laplacian),(ii)thesecond-to-lastlayerofVGG16pretrainedonImageNet(deng2009imagenet),(iii)UMAPembeddings(mcinnes2018umap),(iv)convolutionalencoder(masci2011stacked)and(v)VAEcodes(kingma2013auto).Asfixedrepresentationsdonotchangethroughoutthetraining,theycanbesavedtodisk,speedingupthetraining.Asareference,weconsiderthreereinforcementlearningmethodsworkingdirectlyontheinputimages:(i)DeepQ-Networks(DQN) (mnih2013playing),(ii)ProximalPolicyOptimization(PPO) (schulman2017proximal)and(iii)WorldModels (ha2018world).Thedatasetisturnedintoagraphforsearchbysettingeachimageasanodeandeachviewpoint-changingactionasanedge.Thetaskoftheagentistotransitionbetweenneighboringviewinganglesuntilagoalviewpointisreached.Thetotalnumberoftrainingsamplesisfixedat25600.Forourmethod,asampleisasingle(,,)triplettobepredictedwhilefortheregularRLmethodsitisa(,,,)tuple.

4.3.7 Modelarchitectures

Input

ThenetworkencodesthefullNORBinput,apixelgrayscaleimage,tolower-dimensionalrepresentations.Thesystemasawholereceivestheimagefromthecurrentviewpoint,theimageofthegoalviewpointandaone-hotencodingofthetakenaction.Theimageinputsareconvertedfromintegersrangingfrom0to255tofloatingpointnumbersrangingfrom0to1.

Representationlearnerarchitecture

Weusethesamearchitectureforthenetworkinallofourexperimentsexceptforvaryingtheoutputdimension,Table 4.1.

Layer Filters/Units Kernelsize Strides Outputshape Activation
Input (96,96,1)
Convolutional 64 (45,45,64) ReLU
Max-pooling (22,22,64)
Convolutional 128 (9,9,128) ReLU
Flatten (10368)
Dense 600 (600) ReLU
Dense #Features (#Features) Linear
Table 4.1: Representationnetworkarchitecture.
Regularizingdecoderarchitecture

ThedecodernetworkhasthearchitecturelistedinTable 4.2.Itisdesignedtoapproximatelyinverseeachoperationintheoriginalnetwork.

Layer Filters/Units KernelSize Strides Outputshape Activation
Input (#Features)
Dense 512 (512) ReLU
BN (512)
Dense 12800 (12800) ReLU
BN (12800)
Reshape (10,10,128)
CT 128 (23,23,128) ReLU
Upsampling (46,46,128)
BN (46,46,128)
CT 64 (95,95,64) ReLU
BN (95,95,64)
CT 1 (96,96,1) Sigmoid
Table 4.2: Regularizingdecoderarchitecture.Theupsamplinglayeruseslinearinterpolation,BNstandsforBatchNormalizationandCTstandsforConvolutionalTranspose.
Predictornetwork

Thepredictornetworkisatwo-streamdenseneuralnetwork.Eachstreamconsistsofadenselayerwitharectifiedlinearunit(ReLU)activation,followedbyabatchnormalization(BatchNorm)layer.Theoutputsofthesestreamsarethenconcatenatedandpassedthrough3denselayerswithReLUactivations,eachonefollowedbyaBatchNorm,andthenanoutputdenselayer,seeTable4.3.

Layer Filters/Units Outputshape Activation
    Stream:Input (#Features) ReLU
    Stream:Dense 256 (256) ReLU
    Stream:BatchNormalization (256)
  Stream:Input (#Actions) ReLU
  Stream:Dense 128 (128) ReLU
  Stream:BatchNormalization (128)
Concatenateandstreams (384)
Dense 256 (256) ReLU
BatchNormalization (256)
Dense 256 (256) ReLU
BatchNormalization (256)
Dense 128 (128) ReLU
BatchNormalization (128)
Dense #Features (#Features) Linear
Table 4.3: Representionpredictorarchitecture.Thestreamreceivestherepresentationasinputandthestreamreceivestheone-hotactionasinput.Bothstreamsareprocessedinparallelandthenconcatenated,witheachoperationappliedfromtoptobottomsequentially.Thenumberofhiddenunitsinthelastlayerdependsonthechosendimensionalityoftherepresentation.

4.4 Results

Withourempiricalevaluation,weaimtoanswerthefollowingresearchquestions:

  1. (Monotonicity)IstheEuclideandistancebetweenasuitablerepresentationandthegoalrepresentationdecreasingasthenumberofactionsthatseparatethemdecreases?

  2. (Trainedpredictability)Istrainingarepresentationforpredictability,asproposed,feasible?

  3. (Dimensionality)Whatisthebestdimensionalityofthelatentspaceforourplanningtasks?

  4. (Solutionconstraints)Intermsofplanningperformance,whatarepromisingconstraintstoplaceontherepresentationtoavoidtrivialsolutions?

  5. (Benchmarking)HowdoesplanningwithLARPcomparetoothermethodsfromtheRLliterature?

  6. (Generalization)Howwelldoesourmethodgeneralizetounseenenvironments?

wewillrefertotheseresearchquestionsbynumberbelowastheygetaddressed.

4.4.1 Latentspacevisualization

Whentherepresentationandpredictornetworksaretrained,weapplyAlgorithm2totheviewpoint-matchingtask.Asdescribedabove,thegoalistofindasequenceofactionsthatconnectsthestartstatetothegoalstate,wherethetwostatesdifferintheirconfigurations.Tosupportthequalitativeanalysisofthelatentspace,weplotheatmapsofsimilaritybetweenthegoalrepresentationandthepredictedrepresentationofnodesduringsearch(Fig 4.6).Ofthe10cartoysintheNORBdataset,werandomlychose9forourtrainingsetandtestontheremainingone.

In-sampleembedding:LaplacianEigenmaps

First,weconsiderresearchquestion1(monotonicity).Inordertogetthebest-caserepresentation,weembedthetoyusingLaplacianEigenmaps.EmbeddingasingletoyinthreedimensionsusingLaplacianEigenmapsresultsinatube-likeembeddingthatencodesbothelevationandazimuthangles,seeFig4.5.Threedimensionsareneededsothatthecyclicazimuthcanbeembeddedcorrectlyasand.

(a) *
(b) *
(c) *
(d) *
(e) *
(f) *
Fig 4.5: LaplacianEigenmaprepresentationspaceofaNORBtoy.Thethree-dimensionalLaplacianEigenmapsofatoycarwhereelementswiththesameazimuthvalueshavethesamecolor(top)andwhereelementswiththesameelevationvalueshavethesamecolor(bottom).Euclideandistanceisagoodproxyforgeodesicdistanceinthiscase.

Iftherepresentationisnowusedtotrainthepredictor,onewouldexpectthattherepresentationbecomesmonotonicallymoresimilartothegoalrepresentationasthestatemovestowardthegoal.InFig4.6weseethatthisisthecaseandthatthisbehaviorcanbeeffectivelyusedforagreedyheuristics.Whilethemonotonicityisnotalwaysexactduetoerrorsintheprediction,Fig4.6stillqualitativelyillustratesabest-casescenario.

Fig 4.6: HeatmapofLaplacianEigenmaplatentspacesimilarity.Eachpixeldisplaysthedifferencebetweenthepredictedrepresentationandthegoalrepresentation.Onlythestartandgoalobservationsaregiven.Thebluedotshowsthestartstate,greenthegoal,andpurplethesolutionstatefoundbythealgorithm.ThesearchalgorithmcanrelyonanalmostmonotonicallydecreasingEuclideandistancebetweeneachstate’spredictedrepresentationandthegoal’srepresentationtoguideitssearch.

Weconcludefromthisthat,forasuitablerepresentation,theEuclideandistancebetweenacurrentrepresentationandthegoalrepresentationismonotonicallyincreasingasafunctionofthenumberofactionsthatseparatethem.Thissupportstheuseofaprediction-basedlatentspacesearchforplanning.

Out-of-sampleembedding:pretrainedVGG16representation

Next,weconsiderthepretrainedrepresentationoftheVGG16networktogetarepresentationthatgeneralizestonewobjects.Wetrainthepredictornetworkandplottheheatmapofthepredictedsimilaritybetweeneachstateandthegoalstate,beginningfromthestartstate,inFig 4.7.

Fig 4.7: HeatmapofVGG16latentspacesimilarity.ThepredictornetworkestimatestheVGG16representationoftheresultingstatesastheobjectismanipulated.(a)Thegoalliesonahillcontainingamaximumofrepresentationalsimilarity.(b)Theaccumulatederrorsofiteratedestimationscausethealgorithmtoplanapathtoawrongstatewithasimilarshape.

Theheatdistributioninthiscaseismorenoisy.Togetaviewoftheexpectedheatmapprofile,weaverageseveralfiguresofthistypetoshowbasinsofattractionduringthesearch.Eachheatmapisshiftedsuchthatthegoalpositionisatthebottom,middlerow(Fig 4.8,a).Here,itisobviousthatthegoalandtheflipped(azimuth)versionofthegoalareattractorstates.Thisisduetotherepresentationmapbeingsensitivetotheroughshapeoftheobject,butbeingunabletodistinguishfinerdetails.In(Fig 4.8,b)wedisplayanaggregateheatmapwhentheagentcanalsochangethelightingconditions.Ourvisualizationsshowagradienttowardthegoalstateinadditiontovisuallysimilarfar-away-states,sometimescausingthealgorithmtoproducesolutionsthatarethepolaroppositeofthegoalconcerningtheazimuth.Predictionerrorsalsopreventtheplanningalgorithmfromfindingtheexactgoalforeverytask,evenifitisnotdistractedbythepolar-opposite.

Fig 4.8: AggregateheatmapsofVGG16representationsimilaritiesontestdata.Thedataiscollectedasthestatespaceissearchedforamatchingviewpoint.Thepixelsarearrangedaccordingtotheirelevationandazimuthdifferencefromthegoalstateatontheleftandontheright.(a)Weseecleargradientstowardthetwobasinsofattraction.Thereislesschangealongtheelevationduetolesschangeateachstep.(b)Theagentcanalsochangethelightingofthescene,withqualitativelysimilarresults.Inthisgraphicweonlymeasuretheabsolutevalueofthedistance.

Toinvestigatetheaccuracyofthesearchwithrespecttoeachdimensionseparately,weplotthehistogramofdistancesbetweenthegoalstatesandthesolutionstatesinFig 4.9.Thegoalandstartstatesarechosenrandomly,withtherestrictionthattheazimuthdistanceandelevationdistancebetweenthemareeachuniformlysampled.Fortherestofthechapter,alltrialsfollowthissamplingprocedure.TheresultslooklessaccurateforelevationthanazimuthbecausetheelevationchangesaresmallerthantheazimuthchangesintheNORBdataset.ThedifferencebetweenthegoalandsolutionviewpointsinFig 4.7left,forexample,ishardlyvisible.Ifonewouldscalethehistogramsbyangleandnotbybins,thedrop-offwouldbesimilar.

Fig 4.9: Histogramsofelevation-wiseandazimuth-wiseVGG16errors.Thehistogramsdisplaythecountsofthedistancebetweengoalandsolutionstatesalongelevation(left)andazimuth(right)ontestdata.Thedistancebetweenthestartandgoalviewpointsisequallydistributedacrossallthetrials,alongbothdimensions.Thegoalandtheflipped(azimuth)versionofthegoalareattractorstates.

4.4.2 Latentspacedimensionality

Withthenextexperiment,weaimtoanswerresearchquestion2(trainedpredictability).Whiletuningthedetailsofthedesign,wealsotackleresearchquestions3(dimensionality)and4(solutionconstraints).Wedoanablationstudyofthedimensionalityoftherepresentationforourmethod(Table 4.4).ThetestcarisanunseencartoyfromtheNORBdataset,andthetraincarcomesfromthetrainingset.

Representation Dimensions TrainingCar() TestCar()
LARP(Contrastive) 96 59.3 56.8
64 64.1 60.5
32 72.3 59.4
16 74.1 59.3
8 82.7 58.0
LARP(Sphering) 96 41.1 37.8
64 93.9 53.7
32 89.8 51.9
16 85.2 42.6
8 85.1 40.1
LARP(Decoder) 96 58.0 51.9
64 79.5 63.0
32 77.8 61.7
16 51.9 45.2
8 51.1 42.4
VGG16 902 62.4 55.1
RandomSteps 3.5 3.5
Table 4.4: Ablationstudyoftherepresentationdimensionality.WechangetheoutputdimensionoftherepresentationlearnersubnetworkandcompareittotheVGG16representationtrainedonImageNet.Theperformance(meansuccessrate)isaveragedovertenseparateinstantiationsofoursystems,whereeachinstanceisevaluatedonahundredtrialsoftheviewpoint-matchingtask.Atrialisasuccessifthegoalisreachedbytakinglessthantwicetheminimumnumberofactionsneededtoreachit.Thestandarddeviationsrangebetween0.1and0.3foreachtableentry.

Thereisnoclearwinner:thenetworkwiththespheringlayerdoesthebestononeofthecarsusedduringtraining,whilethereconstructive-lossnetworkdoesthebestontheheld-outtestcar.Thesharpdifferenceinperformancebetween64and94sphering-regularizedrepresentationcanbeexplainedbythenumericalinstabilityofthepoweriterationmethodfortoolargematrixdimensions.TheVGG16representationisnotthehighestperformeronanyofthecartoys.ManyofVGG16’srepresentationvaluesare0forallimagesintheNORBdataset,soweonlyusethosethatarenonzeroforanyoftheimages.Wesuggestthatthishighnumberofdeadunitsisduetotherepresentationbeingtoogeneralforthetaskofmanipulatingrelativelyhomogenousobjects.Anotherdrawbackofusingpretrainednetworksisthatinformationmightbeencodedthatisunimportantforthetask.Thishastheeffectthatoursearchmethodisnotguaranteedtooutputthecorrectsolutioninthelatentspace,astheremightbedistractingpocketsoflocalminima.Therandombaselinehasanaveragesuccessrateof3.5,whichisveryclearlyoutperformedbyourmethod.As64isthebestdimensionalityfortherepresentationonaverage,wecontinuewiththatnumberforourmethodinthetransferlearningexperiment.Weconcludethattheproposedmethodoftrainingarepresentationforpredictabilityisfeasible.Sofarwehaveevidencethat64isthebestdimensionalityoftherepresentation’slatentspaceforourplanningtasks.However,itisnotyetconclusivewhatthebestrestrictionistoplaceontherepresentationtoavoidthetrivialsolution,intermsofplanningperformance.

4.4.3 ComparisonwithotherRLmethods

Nowwedivertourattentiontoresearchquestion5(benchmarking),wherewecompareourmethodtotheliterature.ForthecomparisonwithstandardRLmethods,weusethedefaultconfigurationsofthemodel-freemethodsDQNandPPOasdefinedinOpenAIBaselines (baselines).Ourmodel-basedcomparisonischosentobeworldmodels(ha2018world)fromhttps://github.com/zacwellmer/WorldModels.WemakesurethatthecomparedRLmethodsaresimilartooursystemintermsofthenumberofparametersaswellasarchitecturelayoutandcomparethemwithourmethodonthecarviewpoint-matchingtask.ThetasksetupisthesameasbeforeandisconvertedtoanOpenAIgymenvironment:astartobservationandagoalobservationarepassedtotheagent.Iftheagentmanagestoreachitwithin2timestheminimumnumberofactionsrequired(theminimumnumberiscalculatedbytheenvironment),theagentreceivesarewardandthetaskisconsideredasuccess.Otherwise,norewardisgiven.TheresultsofthecomparisoncanbeviewedinFig. 4.10.Eachpointinthecurvecontainseachmethod’smeansuccessrate:theaverageofthecumulativerewardfrom100testepisodesfrom5differentinstantiationsoftheRLlearner,soitistheaveragerewardover500episodesintotal.Thetestepisodesaredoneonthesameenvironmentasisusedfortraining,exceptthatthepolicyismaximallyexploitingandminimallyexploring.

\subfloat

[DQNperformance.]  \subfloat[PPOperformance.]  \subfloat[Worldmodelsperformance.]

Fig 4.10: Reinforcementlearningcomparison.Theverticaldashedlinesindicatewhenthecomparedalgorithmhasprocessedthesamenumberoftransitionsasourmethodandthehorizontaldottedlineindicatesthetestperformanceofourmethod.Eachdatapointisthemeansuccessrateof100testepisodesafteravaryingnumberoftrainingsteps,averagedover5differentseedsofeachlearner.Themodel-freemethodsin(a)and(b)traintherepresentationandthecontrollersimultaneouslybyactingintheenvironmentandcollectingnewexperiences.Therepresentationin(c)istrainedon25.6ktransitions,whichisthesamenumberweuse.Theplotshowstheoptimizationcurveforthecontroller,usingaCovariance-MatrixAdaptationEvolutionStrategy,whichhardlyimprovesafter500orsotrainingsteps.Thehorizontallinestartsat0forworldmodelsbecausetherepresentationhasfinishedtrainingontheobservationsbeforethecontrollerisoptimized.

Inourexperiments,theDQNnetworksaremuchmoresampleinefficientthanPPO,whichinturnismoresampleinefficientthanourmethod.However,ourmethodismoretime-consumingduringtesttime.Werequireaforwardpassofthepredictornetworkforeachnodethatissearchedbeforewetakethenextstep,whichcangrowrapidlyifthetargetisfaraway.Incontrast,onlyasinglepassthroughthetraditionalRLnetworksisrequiredtocomputethenextaction.Ourmethodreachessuccessrateonthetraincar(Table 4.4)using25.6ksamples,butthebestPPOrunonlyreachesaftertrainingonthesamenumberofsamples.ThebestsinglePPOrunneeded41.3ksamplestogethigherthansuccessrate,andtheaverageperformanceishigherthanataround55ksamples.Afterthat,somePPOlearnersdeclinedagaininperformance.TheworldmodelspolicyquicklyreachesthesamelevelofperformanceasDQNgotafter50kstepsandPPOafterapproximately8ksteps,butitdoesn’timprovebeyondthat.

\subfloat

[]  \subfloat[]

Fig 4.11: Re-trainingafterplacingobstaclesinacheckerboardpattern.(a)Thetaskisthesameasbefore,butnothinghappensiftheagentattemptstomovetoastatecontainingablackrectangle.(b)Aftertrainingtheagents,were-testedthemafterweintroducedthecheckerboardpatternofobstacles.Ourmethoddoesnotallowforre-traininginthenewenvironment.

WeconcludethatourmethodcomparesfavorablytoothermethodsfromtheRLliteratureintermsofdata-efficiency.

4.4.4 Modifyingtheenvironment

Wenowmodifytheenvironmenttoanswerresearchquestion6(generalization).Toseehowthemethodscomparewhenobstaclesareintroducedtotheenvironment,werepeatthetrialononeofthecarobjectsexceptthattheagentcannolongerpassthroughstateswhoseelevationvaluesaredivisibleby10andazimuthvaluesaredivisibleby40(Fig. 4.11,(a)).Asbefore,thegoalandstartlocationscanhaveanyazimuth-elevationpair,buttheagentcannotmoveintostateswiththepropertiesindicatedbytheblackrectangle.Everyactionisavailabletotheagentatalllocationsasbefore,buttheagent’sstateisunchangedifitattemptstomovetoastatewithablackrectangle.WetrainedLARPusingthecontrastiveloss,PPO,andDQNagentsuntiltheyreachedaccuracyonourplanningtaskandthentestedthemwiththeaddedobstacles.Ourmethodlosesaboutperformance,butPPOloses.Nevertheless,wecancontinuetrainingPPOuntilitquicklyreachestopperformanceagain(Fig. 4.11,(b)).Ourmethodisnotre-trainedforthenewtask,andDQNdidnotreachagoodperformanceagaininthetimeweallottedforre-training.Thus,weseethatourmethodisquiteflexibleandgeneralizeswellwhenobstaclesareintroducedtotheenvironment.

4.4.5 Transfertodissimilarobjects

Wenowconsiderresearchquestion6(generalization)furtherbyinvestigatinghowwellourmethodtransfersknowledgefromonedomaintoanother.Selectingthebestdimensionalityfortherepresentationfromtheprevioussetofexperiments,weinvestigatefurthertheirperformanceinhardersituationsusingunseen,non-carobjects.Themodelsaretrainedonthesamecarobjectsasinthepreviousexperiment,buttheyaretestedonanarrayofdifferentplasticsoldiers:akneelingsoldierholdingabazooka,astandingsoldierwitharifle,aNativeAmericanwithabowandspearandacowboywitharifle(Fig. 4.12).

Fig 4.12: Toysfortransferlearningexperiments.Fromlefttoright:Soldier(Kneeling),Soldier(Standing),NativeAmericanwithBowandCowboywithRifle.
Qualitativeresults

Weofferavisualizationofthelearnedrepresentationofthekneelingsoldiertoyusingourmethod,aconvolutionalencoder,andVGG16inFig. 4.13.Eachembeddingwasreducedto2dimensionsusingt-SNE.Everymethodstructuresthedomainsimilarly.Inthebottomrow,weseethatthelargestclustersforallmethodsaretheoneswiththehighest(tealdots)illuminationsettings,whichisexplainedbytheeffectofthelightingonthepixelvalueintensities.Withintheseclusters,weseeclusteringbasedontheazimuth(middlerow).Finally,withintheseclusters,thereisagradientstructurebasedonelevation(toprow).Thisisduetotheelevationchanginginsmallerstep-sizes,with5degreedifferences,thanazimuthwith20degreedifferences.

(a) LARPelevationt-SNE
(b) CAEelevationt-SNE
(c) VGG16elevationt-SNE
(d) LARPazimutht-SNE
(e) CAEazimutht-SNE
(f) VGG16azimutht-SNE
(g) LARPlightingt-SNE
(h) CAElightingt-SNE
(i) VGG16lightingt-SNE
Fig 4.13: Latentspacevisualization.Theclusteringstructureofatoysoldierembeddingafterdimensionalityreductionwitht-SNE.Thetoprowcolorsthesamplesbyelevation,themiddlerowbyazimuthandthebottomonebylighting.Theleftcolumnistheembeddinggivenbyourmethodusingthecontrastivelossterm,themiddlebyaconvolutionalautoencoder(CAE),andtherightbyapretrainedVGG16network.
Quantitativeresults

OurexperimentsincludethepretrainedVGG16networkbecausewebelievethataflexiblealgorithmshouldratherbebasedonagenericmulti-purpose-representationandnotonaspecificrepresentation.Tooursurprise,itisoutperformedbyourapproachoneveryobject,evenastheviewpoint-matchingtaskisdoneonobjectsthataredifferentfromtheoriginalcartoystheyaretrainedon.WealsotrytransferlearningaVAEandaconvolutionalencoderusingthesamenetworkasourrepresentationandaUMAPembedding,with64featureseach.Wedisplaytheresultsofthesameviewpoint-matchingtasksasbeforeusingthesedifferentrepresentationsontheunseenarmyfiguresinTable 4.5.Convolutionalencodershavethelowestperformanceoutofallthemethods,buttheVAErepresentationhasthemostsimilarperformancetoourownrepresentations.

Representation Soldier Soldier NativeAmerican Cowboy
(Kneeling) (Standing) withBow withRifle Mean
LARP(Contrastive) 22.7 11.1 12.1 20.7 16.7
LARP(Sphering) 18.4 15.3 15.2 15.8 16.2
LARP(Decoder) 14.3 16.6 14.7 15.6 15.3
VAE 14.4 13.1 13.6 13.8 13.7
VGG16 12.3 12.9 11.8 13.6 12.7
UMAP 8.6 10.8 9.7 11.1 10.1
Conv.Encoder 4.9 14.3 11.5 6.4 9.3
Table 4.5: Transferlearningperformance.Themethodsaretrainedonthedatasetofdifferentcarimages,andthentheirperformanceondissimilartoysistested.Eachnumberisthemeansuccessrateofviewpointmatchingoutof1000trials.

Ourrepresentationisbetterthantheothersforthisexperiment.Butweseethattheperformanceofourmethoddropssignificantlyasweattemptgeneralizingtodifferentobjectswithchangedinputstatistics.Notethateventhoughtherepresentationclustering(Fig. 4.13)wasmoreclear-cutusingconvolutionalencodersandVGG,comparedtoours,theydidnotperformaswellinthetrials.Weattributethistothefactthatthemainlossterminourrepresentationwastheoneoffuturestatepredictability,whichmaynotgiveasclearofastructureintwodimensionsasarepresentationthatistrainedonlyforstaticimagereconstruction.

4.5 Conclusion

Inthischapter,wepresentLatentRepresentationPrediction(LARP)networkswithapplicationstovisualplanning.WejointlylearnamodeltopredicttransitionsinMarkovdecisionprocesseswitharepresentationtrainedtobemaximallypredictable.Thisallowsustoaccuratelysearchthelatentspacedefinedbytherepresentationusingaheuristicgraphtraversalalgorithm.Wevalidateourmethodonaviewpoint-matchingtaskderivedfromtheNORBdataset,andwefindthatarepresentationthatisoptimizedjointlywiththepredictornetworkperformsbestinourexperiments.Acommonissueofunsupervisedrepresentationlearningisoneoftrivialsolutions:aconstantrepresentationwhichoptimallysolvestheunsupervisedoptimizationproblembutbringsacrossnoinformation.Toavoidthetrivialsolution,weconstrainthetrainingbyintroducingaspheringlayeroralosstermthatiseithercontrastiveorreconstructive.Anyoftheseapproacheswilldothejobofpreventingtherepresentationsfromcollapsingtoconstants,andnoneofthemdisplaysstrongerperformancethantheothersinourexperiments.OurLARPrepresentationiscompetitivewithpretrainedrepresentationsforplanningandcomparesfavorablytootherreinforcementlearning(RL)methods.Ourapproachisasoundsolutionforlearningausefulrepresentationthatissuitableforplanningonlyfrominteractions.Furthermore,wefindthatourmethodhasbetterdata-efficiencyduringtrainingthanseveralreinforcementlearningmethodsfromtheliterature.However,adisadvantageofourapproachcomparedtostandardRLmethodsisthattheexecutiontimeofourmethodscalesworsewiththesizeofthestatespace,asaforwardpassiscalculatedforeachnodeduringthelatentspacesearch,potentiallyresultinginacombinatorialexplosion.Ourapproachisadaptabletochangesinthetasks.Forexample,oursearchwouldonlybeslightlyhinderedifsomeobstacleswereplacedintheenvironmentorsomestateswereforbiddentotraversethrough.Furthermore,ourmethodisindependentofspecificrewardsordiscountrates,whilestandardRLmethodsareusuallyrestrictedintheiroptimizationproblems.Often,thereisachoicebetweenoptimizingdiscountedorundiscountedexpectedreturns.Simulation/rollout-basedplanningmethodsarenotrestrictedinthatsense:Ifrewardtrajectoriescanbepredicted,onecanoptimizearbitraryfunctionsoftheseandregularizebehavior.Forexample,arisk-averseportfoliomanagercanprioritizesmoothrewardtrajectoriesovervolatileones.Futurelinesofworkshouldinvestigatefurthertheeffectofthedifferentconstraintsontheend-to-endlearningofrepresentationssuitedforapredictiveforwardmodel,aswellasconsideringnovelones.Thesearchalgorithmcanbeimprovedandmadefaster,especiallyforhigher-dimensionalactionspacesorcontinuousones.OurnetworkcouldalsoinprinciplebeusedtotrainanRLsystem,forinstance,byencouragingittoproducesimilaroutputsasoursandtherebycombiningdata-efficiencywithfastperformanceduringinferencetime.

Chapter 5 Rewardpredictionforrepresentationlearningandrewardshaping

ThepreviouschapterintroducedtheLARPnetworkandshoweditsusefulnessinview-pointmatchingexperiments,incomparisontoreinforcementlearning(RL)methods,particularlythemodel-basedones.Thenetworklearnsstatefeaturesthataretailor-madeforapredictionmodule,whichisthenusedforgraph-basedsearch.Thelearningofthestaterepresentationandpredictorisdoneinaself-supervisedmanner,independentlyofarewardsignal.Themethodisdata-efficientandisabletospeedtolearningfornewtasksafterpre-trainingonasimilarone.However,themethodrequiresaforwardpassofthenetworkforeverynodethatisconsideredduringthegoalsearch,whichscalespoorlywiththesizeofthestatespacecomparedtomethodsthatonlyneedtoprocessthecurrentstate.Dependingonthebranchingfactoroftheenvironmentandthedetailsofthelatentrepresentationgraphsearch,thiscanresultinacombinatorialexplosion.Inthischapter111Thischapterisadaptedfrom(hlynsson2021reward).,weapproachtheproblemoflearningstaterepresentationsforRLfromanotherangle:insteadofpredictingtheresultsofactions,welearnarepresentationthatispredictiveofthelocaldistancefromthegoalinsingle-goalenvironments.Therepresentationislearnedalongsidearewardpredictorthatlearnstoestimateeitheraraworasmoothedversionofthetruerewardsignal.Weaugmentthetrainingofout-of-the-boxRLagentsbyshapingtherewardusingourrewardpredictorduringpolicylearning.Usingourrepresentationforpreprocessinghigh-dimensionalobservations,aswellasusingthepredictorforrewardshaping,isshowntosignificantlyenhanceActorCriticusingKronecker-factoredTrustRegionandProximalPolicyOptimizationinsingle-goalenvironmentswithvisualinputs.Theremainderofthechapterhasthefollowingstructure:WestartwithachapterintroductioninSection5.1,followedbyanexplanationofrequiredbackgroundknowledgeforthischapterinSection5.2.WegointorelatedworkinSection5.3,afterwhichthedetailsofourapproachisoutlinedinSection5.4.ThemethodologyofourexperimentsisexplainedinSection5.5.WedisplayanddiscusstheresultsofourexperimentsinSection5.6.Finally,weclosethechapterwithconcludingremarksinSection5.7.

5.1 Introduction

EventhoughthedominanceofhumansisbeingtestedbyRLagentsonnumerousfronts,therearestillgreatdifficultiesforthefieldtoovercome.Forinstance,thedatathatisrequiredforalgorithmstoreachhumanperformanceisonafarlargerscalethanthatneededbyhumans.Furthermore,thegeneralintelligenceofhumansremainsunchallenged.EventhoughanRLagenthasreachedsuperhumanperformanceinonefield,itsperformanceisusuallypoorwhenitistestedinnewareas.Thestudyofmethodstoovercometheproblemofdata-efficiencyandtransferabilityofRLagentsinenvironmentswheretheagentmustreachasinglegoalisthefocalpointofthiswork.Weconsiderasimplewayoflearningastaterepresentationbypredictingeitheraraworasmoothedversionofasparserewardyieldedbyanenvironment.Thetwoobjectives,learningastaterepresentationandpredictingthereward,aredirectlyconnectedaswetrainadeepneuralnetworkfortheprediction,andthehiddenlayersofthisnetworklearnareward-predictivestaterepresentation.Therewardsignaliscreatedbycollectingdatafromarelativelylownumberofinitialepisodesusingacontrollerthatactsrandomly.Therepresentationisthenextractedfromanintermediatelayerofthepredictionmodelandre-usedasgeneralpreprocessingforRLagents,toreducethedimensionalityofvisualinputs.Theagentprocessesinputscorrespondingtoitscurrentstateaswellasthedesiredendstate,whichisanalogoustomentallyvisualizingagoalbeforeattemptingtoreachit.Thisgeneralapproachofrelyingonstaterepresentations,thatarelearnedtopredicttherewardratherthanmaximizingit,hasbeenmotivatedintheliterature(lehnert2020reward)andweshowthatourrepresentationiswell-suitedforsingle-goalenvironments.Ourworkaddstotherecentlygrowingbodyofknowledgerelatedtodeepunsupervised(hlynsson2019learning)orself-supervised(schuler2018gradient)representationlearning.WealsoinvestigatetheeffectivenessofaugmentingtherewardforRLagents,whentherewardissparse,withanovelproblem-agnosticrewardshapingtechnique.Therewardpredictor,whichisusedtotrainourrepresentation,isnotonlyusedasapartofanauxiliarylossfunctiontolearnarepresentation,butitisalsousedduringtrainingtheRLsystemtoencouragetheagenttomoveclosertoagoallocation.SimilartoadvantagefunctionsintheRLliterature(schulman2015high),giventhetrainedrewardpredictor,theagentreceivesanadditionalrewardsignalifitmovesfromstateswithalowpredictedrewardtostateswithahigherpredictedreward.Wefindthisrewardaugmentationtobebeneficialforourtestenvironmentwiththelargeststate-space.

5.2 Background

5.2.1 Rewardshaping

Sparserewardsinenvironmentsisacommonproblemforreinforcementlearningagents.Theagent’sobjectiveistolearnhowtoassociateitsinputswithactionsthatleadtohighrewards,whichcanbealengthyprocessiftheagentonlyrarelyexperiencespositiveornegativerewards.Rewardshaping(mataric1994reward; ng1999policy; brys2015policy)isapopularmethodofmodifyingtherewardfunctionofanMDPtospeeduplearning.Itisusefulforenvironmentswithsparserewardstoaugmentthetrainingoftheagent,butskillfulapplicationsofrewardshapingcaninprincipleaidtheoptimizationforanyenvironment–althoughtheefficacyoftherewardshapingishighlydependentonthedetailsoftheimplementation(clark2016faulty).Inthelastfewyears,rewardshapinghasbeenshowntobeusefulforcomplexvideogameenvironments,suchasreal-timestrategygames(efthymiadis2013using)andplatformers(brys2014multi)andithasalsobeencombinedwithdeepneuralnetworkstoimproveagentsinfirst-personshootergames(lample2017playing).Asanillustration,considerlearningapolicyforcarracing.Ifthegoalistotrainanagenttodriveoptimally,thensupplyingitwithapositiverewardforreachingthefinishlinefirstisintheorysufficient.However,ifitispunishedforactionsthatareneverbeneficial,forinstancecrashingintowalls,itprioritizeslearningtoavoidsuchsituations,allowingittoexploremorepromisingpartsofthestatespace.Furthermore,justreachingthegoalisinsufficientifthereiscompetition.Tomakesurethatwehaveawinningracer,asmallnegativerewardcanbeintroducedateverytimesteptourgetheagenttoreachthefinishlinequickly.Notethatthedetailsoftherewardshapinginthisexamplerequiresdomainknowledgefromadesignerwhoisfamiliarwiththeenvironment.Itwouldbemoregenerallyusefuliftherewardshapingwouldbeautonomouslylearned,justasthepolicyoftheagent,asweproposetodointhiswork.

5.2.2 Reward-predictivevs.reward-maximizingrepresentations

lehnert2020rewardmakethedistinctionbetweenreward-maximizingrepresentationsarereward-predictiverepresentations.Theyarguehowreward-maximizingrepresentationscantransferpoorlytonewenvironments,whilereward-predictiverepresentationsgeneralizesuccessfully.TakethesimplegridworldnavigationenvironmentsinFig.5.1,forexample.Theagentstartsatarandomtileinthegridandgetsarewardof+1byreachingtherightmostcolumninEnvironmentAorbyreachingthemiddlecolumninEnvironmentB.ThestatespaceinEnvironmentAcanbecompressedfromthegridtoavectoroflength3,ofreward-predictiverepresentations.Topredictthediscountedreward,itsufficestodescribetheagent’sstatewithifitisinthethrow.

Fig 5.1: Reward-maximizingvs.reward-predictiverepresentations.Inthisgridworldexample,theagentstartstheepisodeatarandomlocationandcanmoveup,down,left,orright.Theepisodeendswitharewardof1andterminateswhentheagentreachestherightmostcolumn.Boththereward-predictiverepresentationandreward-maximizingrepresentationand,respectively,areusefulforlearningtheoptimalpolicyinEnvironmentA.Thereward-predictiverepresentationcollapseseachcolumnintoasinglestatetopredictthediscountedfuturereward.Thereward-maximizingrepresentationmakesnosuchdistinction,asmovingrightistheoptimalactioninanystate.ItisadifferentstoryiftherepresentationsaretransferredtoEnvironmentB,wherereachingthemiddlecolumnisnowthegoal.Therepresentationcanbereused,andtheoptimalpolicyisfoundiftheagentnowtakesastepleftin.However,therepresentationisunabletodiscriminatebetweenthedifferentstatesandisuselessfordeterminingtheoptimalpolicy.

Thereward-maximizingrepresentationforEnvironmentAismuchsimpler:thewholestatespacecanbecollapsedtoasingleelement,withtheoptimalpolicyofalwaysmovingtotheright.Iftheserepresentationsarekept,thenthereward-predictiverepresentationisinformativeenoughforaRLagenttolearnhowtosolveEnvironmentB.Thereward-maximizingrepresentationhasdiscardedtoomanydetailsoftheenvironmenttobeusefulforsolvingthisnewenvironment.

5.2.3 Successorfeatures

Thesuccessorrepresentationalgorithmlearnstwofunctions:theexpectedrewardreceivedaftertransitioningintoastate,aswellasthematrixofdiscountedexpectedfutureoccupancyofeachstate,assumingthattheagentstartsinagivenstateandfollowsaparticularpolicy.Knowingthequantitiesandallowsustorewritethevaluefunction:

(5.1)

Themotivationforthisalgorithmisthatitcombinesthespeedofmodel-freemethods,byenablingfastcomputationsofthevaluefunction,withtheflexibilityofmodel-basedmethodsforenvironmentswithchangingrewardcontingencies.Thismethodismadeforsmall,discreteenvironments,butithasbeengeneralizedforcontinuousenvironmentswithso-calledfeature-basedsuccessorrepresentations,orsuccessorfeatures(SFs)(barreto2016successor).TheSFalgorithmsimilarlycalculatesthediscountedexpectedrepresentationoffuturestates,giventheagenttakestheactioninthestateandfollowsapolicy:

(5.2)

whereissomestaterepresentation.BoththeSFandtherepresentationcanbedeepneuralnetworks.

5.3 Relatedwork

5.3.1 Reward-predictiverepresentations

lehnert2020rewardcomparesuccessorfeatures(SFs)toanonparametricBayesianpredictorthatistrainedtolearntransitionandrewardtablesfortheenvironment,eitherwithareward-maximizingorareward-predictivelossfunction.lehnert2020successorproveunderwhatconditionssuccessorfeatures(SFs)areeitherreward-predictiveorreward-maximizing(seedistinctioninSection5.2.2).TheyalsoshowthatSFsworksuccessfullyfortransferlearningbetweenenvironmentswithchangingrewardfunctionsandunchangedtransitionfunctions,buttheygeneralizepoorlybetweenenvironmentswherethetransitionfunctionchanges.Ourworkisdistinctfromthereward-predictivemethodsthattheycompare,asourrepresentationdoesnotneedtocalculateexpectedfuturestateoccupancy,asisthecaseforSFs.Ourmethodscalesbetterformorecomplicatedstate-spacesbecausewedonottabulatethestates,astheydowiththeirBayesianmodel,butlearnarbitrarycontinuousfeaturesofhigh-dimensionalinputdata.Inadditiontothat,learningourrewardpredictorisnotonlya"surrogate"objectivefunction,asweuseitforrewardshapingaswell.

5.3.2 Rewardshaping

Theadvantagesofrewardshapingarewellunderstoodintheliterature(mataric1994reward).ArecenttrendinRLresearchisthestudyofmethodsthatcanlearntherewardshapingfunctionautomatically,withouttheneedof(oftenfaulty)humanintervention.marashi2012automaticassumethattheenvironmentcanbeexpressedasagraphandthatthisgraphformulationisknown.Underthesestrongassumptions,theyperformgraphanalysistoextractarewardshapingfunction.Morerecently,zou2019rewardhaveproposedameta-learningalgorithmforpotential-basedautomaticrewardshaping.Ourapproachisdifferentfrompreviousworkasweassumenoknowledgeabouttheenvironmentandtrainasimplepredictortoapproximate(potentiallysmoothed)rewards,whichisthenusedtoconstructapotential-basedrewardshapingfunction.

5.3.3 Goal-conditionedreinforcementlearning

kaelbling1993learningstudiedenvironmentswithmultiplegoalsandsmallstate-spaces.Intheirproblemsetting,theagentmustreachaknownbutdynamicallychanginggoalinthefewestnumberofmoves.Theobservationspaceisofalowenoughdimensionfordynamicprogrammingtobesatisfactoryintheircase.schaul2015universalintroducetheUniversalValueFunctionApproximatorsandtackleenvironmentsoflargerdimensionsbylearningavaluefunctionneuralnetworkapproximatorthatacceptsboththecurrentstateandagoalstateastheinputs.Inasimilarvein,pathak2018zerolearnapolicythatisgivenacurrentstateandagoalstateandoutputsanactionthatbridgesthegapbetweenthem.hlynsson2020latentlearnapredictablerepresentationthatispairedwitharepresentationpredictorandcombineitwithgraphsearchtofindagivengoallocation.Incontrasttotheseapproaches,welearnareward-predictiverepresentationinaself-supervisedmanner,whichisusedtopreprocessrawinputsforRLpolicies.

5.4 Approach

Inthissection,weexplainourapproachmathematically.Intuitively,wetrainadeepneuralnetworktopredicteitheraraworasmoothedrewardsignalfromasingle-goalenvironment.Theoutputofanintermediatelayerinthenetworkisthenextractedastherepresentation–forexample,bysimplyremovingthetoplayersofthenetwork.Thefullrewardpredictornetworkisusedforrewardshapingbyrewardingtheagentformovingfromlowerpredictedvaluestowardhigherpredictedvaluesofthenetwork.

5.4.1 Learningtherepresentation

Supposethatisadifferentiablefunctionparameterizedbyandisapositiveinteger.WeusetoapproximatethediscountedreturninaPOMDPwithasparsereward:theagentreceivesarewardof0foreachtimestepexceptwhenitreachesagoallocation,atwhichpointitreceivesapositiverewardandtheepisodeterminates.Givenanexperiencebuffer,wecreateanewdataset.Thenewrewardsarecalculatedaccordingtotheequation

(5.3)

whereisadiscountfactorandisthedifferencebetweenandthetimestepindexofthefinaltransitioninthatepisode,forsomemaximumtimehorizon.Throughoutourexperiments,wekeepthevalueofthediscountfactorequaltoandwetrainonor.Assumethatourdifferentiablerepresentationfunctionisparameterizedbyandmapsthe-dimensionalrawobservationofthePOMDPtothe-dimensionalfeaturevector.Wetraintherepresentationforthediscounted-rewardpredictionbyminimizingthelossfunction

(5.4)

withrespecttotheparametersofandtheparametersofoverthewholedataset.SeeFig.5.2foraconceptualoverviewofourrepresentationlearning.

Fig 5.2: Learningandusingtherepresentation.Ourrepresentationandrewardpredictoristrainedwiththeelementshighlightedinblue.ThetrainedrepresentationisthenusedfordimensionalityreductionforanRLagent,thatinteractswiththeenvironment,asindicatedbytheelementshighlightedinred.

5.4.2 Rewardshaping

ng1999policydefinearewardshapingfunctionaspotential-basedifthereexistsafunctionsuchthatforallstatesthefollowingequationholds:

(5.5)

andistheMDP’sdiscountfactor.Theyproveforsingle-goalenvironmentsthateveryoptimalpolicyfortheMDPisalsooptimalforitsrewardshapedcounterpart,andviceversa.Theyalsoshow,foragivenstatespaceandactionspace,thatifisnotpotential-based,thenthereexistatransitionfunctionandarewardfunctionsuchthatnooptimalpolicyinisoptimalin.Decidingthattherewardshapingfunctionshouldbepotential-basedisjustthefirststepinitsdesign.Nowassumethatwehaveanenvironmentwhereanagentistaskedwithreachingagoalstate.Thatis,foragivendistancefunctiontheagentreceivesarewardof1ifitiscloseenoughtothegoallocation,,forsomerewardthreshold.Otherwise,itreceivesarewardof0.Thedistancebetweentheagent’slocationnewandthegoallocationcanbeausefulvaluetocalculateinthedesignofarewardshapingfunction

(5.6)

However,thisdependsontheenvironment,astheagentcouldgetstuckinlocaloptimabeforecomingclosetothegoal,i.e.ifitwouldhavetomovethrougharegionwithalargebeforeitcangloballyminimizeit.ThiscouldforexamplebethecaseinamazeenvironmentifistheEuclideandistancebetweenthecoordinatesoftheagent’slocationandthegoallocationandthereisawallbetweentheagentandthegoal.trott2019keepingproposetosolvethisbyincorporatingthepotentiallocaloptimaintherewardshapingfunctionasso-called"anti-goals"tobeavoided

(5.7)

Thesestatescanbehand-pickedbydomainexperts.However,addinganti-goalslikethiscoulditerativelyintroduceevenmorelocaloptimaandasolutiontotheoriginalproblemisnotguaranteed.Itisgenerallynottruethatthedistancefunctionandallthevariablesneededtocalculateit,suchasthecoordinatesoftheagentandthegoalinamaze,areavailabletotheagent.Evenifwerecomputable,usingitnaivelycanbringaboutitsownproblems,aswasalludedtoabove.WearguethatinsteadofusinginEquation 5.6,itwouldbebettertomeasurethedistancebetweentheagentandthegoalintermsofhowmanyactionstheagenthastotakeuntilthegoalisreached.Thisfunctionisnotassumedtobegiven,butitcanbeestimatedastheagentisbeingtrainedontheenvironment,forinstancebyoptimizingEquation 5.4.Additionally,wewouldlikeourrewardshapingfunctiontobepotential-based(Equation5.5)toreapthetheoreticaladvantages.Thus,weproposeapotential-basedrewardshapingfunctionbasedonthediscountedrewardpredictor

(5.8)

where,istherewardpredictorandisourrepresentationfromtheprevioussection.Notethatbothandareassumedtobefullytrainedbeforethepolicyoftheagentistrained,forexampleusingdatagatheredbyarandompolicy,buttheycaninprinciplealsobeupdatedasthepolicyisbeinglearned.Thefactorscalesdowntheintensityoftherewardshapingwhereisthenumberofepisodesthattheagenthasexperiencedandisthemaximumnumberofepisodeswheretheagentistrainedusingrewardshaping.Thestrengthoftherewardshapingisthehighestinthebeginningtocounteractpotentiallyadverseeffectsoferrorsintherewardpredictor.Itisalsomoreimportanttoincentivizemovingtowardthegeneraldirectionofthegoalintheearlystagesoflearning,afterwhichtheun-augmentedrewardsignaloftheenvironmentisallowedto"speakforitself"andguidethelearningoftheagenttowardthegoalprecisely.

5.5 Methodologyandimplementation

5.5.1 Environment

ThemethodistestedonthreedifferentgridworldenvironmentsbasedontheMinimalisticGridworldEnvironment(MiniGrid)(gym_minigrid).Tilescanbeempty,occupiedbyawalloroccupiedbylava.ThestructureoftheenvironmentsfitnaturallyintoourPOMDPtupletemplate(Eq.2.1):

  • Theconstituentstatesofaredeterminedbytheagent’slocationanddirection(facingnorth,west,southoreast).SeeFig.5.2(a)forthreedifferentworldstatesinoneofourenvironments.

  • Theactionspaceconsistsofthreeactions:(1)turnleft,(2)turnrightand(3)moveforward.

  • Thetransitionfunctionisdeterministic.Theagentrelocatestothetileitfacesifitmovesforwardandthetileisempty,andnothinghappensifthetileisoccupiedbyawall.Theepisodeterminatesifthetileisoccupiedbylavaorthegoal.Theagentrotatesinplaceifitturnsleftorright.

  • Reachingthegreentilegoalgivesarewardof,everyotheractiongives0points.Theenvironmentautomaticallytimesoutaftersteps.

  • Differsbetweenthethreeenvironments(seebelow).

  • Allsubsetoftiles,representedbyarrays,fromthepointofviewofanagentwhocannotseethroughwalls,seeFig.5.2(b).

  • Thepointofviewoftheagentfromitscurrentviewpoint(Fig.5.2(b))andagoalobservation(Fig.5.2(c)).

  • thediscountfactoris.

(a) Fullworldstates.
(b) Agent’spointofview.
(c) Goalobservations.
Fig 5.3: Two-roomenvironment.Theredagentmustreachthegreengoalinasfewstepsaspossible.Theagentstartseachepisodebetweenthetworooms,facingarandomdirection(up,down,leftorright).Eachcolumncorrespondstoasnapshotofoneepisode.Thelighttilescorrespondtowhattheagentsees,whilethedarktilesareunseenbytheagent.(a)Examplesofthefullstate(b) Theobservationfromtheagent’scurrentstate(c)Agoalobservation.Thisistheagent’spointofviewfromastatethatseparatestheagentfromthegoalbyoneaction.

Weconsiderthefollowingthreeenvironments:

Two-roomenvironment

Theworldisagridoftiles,splitintotworooms,wherewallsareplacedatdifferentlocationstofacilitatediscriminationbetweentheroomsfromtheagent’spointofview.(Fig.5.3).Theagentisplacedbetweenthetworooms,facingarandomdirection.Thegoalisatoneofthreepossiblelocations.Thisisamodifiedversionoftheclassicalfour-roomenvironmentlayout(sutton1999between).

Lavagapenvironment

Inthisenvironment,theagentisinaroomwithacolumnoflavaeitheroneortwospacesinfrontoftheagent(Fig.5.4)withagapinarandomrow.Theagentalwaysstartsintheupperleftcornerandthegoalisalwaysinthelowerrightcorner.

Fig 5.4: Thelavagapenvironment.Theredagentmustreachthegreengoalinasfewstepsaspossiblewhileavoidingtheorangelavatiles.Eachepisodeisrandomlyselectedtobeoneoftheeightpicturedconfigurations.Iftheagenttriestogothroughtheorangelavatile,itexperiencesanimmediateepisodeterminationwithnoreward.Notethatthewalltilesthatarelighterthantheothersarepresentlyintheagent’sfieldofvision.
Four-roomenvironment

Anexpansiontothetwo-roomenvironmentwithtwoadditionalrooms(Fig.5.5).Inthissetup,boththeagentandthegoallocationareplacedatrandomlocationswithinthegridworld.

Fig 5.5: Four-roomenvironment.Theagent(redtriangle)mustreachthegoal(greensquare)inasfewstepsaspossible,botharerandomlyplacedineachepisode.Thegridofhighlightedtilesinfrontoftheagentindicatesitsobservation.

5.5.2 Baselines

WecombineourrepresentationswithtwoRLalgorithmsasimplementedinStableBaselines(stable-baselines)usingthedefaulthyperparameters:

  • (ACKTR)ActorCriticusingKronecker-FactoredTrustRegion(wu2017scalable),whichcombinesactor-criticmethods,trust-regionoptimization,anddistributedKroneckerfactorizationtoenhancedata-efficiency.

  • (PPO2)AversionoftheProximalPolicyOptimizationalgorithm(schulman2017proximal).Itmodifiestheoriginalalgorithmbyusingclippedvaluefunctionsandanormalizedadvantage.

Forbothalgorithms,sixvariationsarecompared:

  • (DeepRL)TheRLalgorithmlearnstherepresentationfromscratchonrawimages

  • (SF)Theinputispreprocessedusingsuccessorfeatures

  • (Ours1r)Theinputispreprocessedusingourrepresentation,trainedonrawrewardpredictions

  • (Ours1r+Shaping)Theinputispreprocessedusingourrepresentationandtherewardisshaped,trainedonrawrewardpredictions

  • (Ours64r)Theinputispreprocessedusingourrepresentation,trainedonsmoothedrewardpredictions

  • (Ours64r+Shaping)Theinputispreprocessedusingourrepresentationandtherewardisshaped,trainedonsmoothedrewardpredictions

Carehasbeentakentoensurethateachvariationhasthesamearchitectureandthesamenumberofparameters.

5.5.3 Modelarchitectures

EverymodelisrealizedasaneuralnetworkusingKeras(chollet2015keras).Below,therepresentationandpolicynetworksareusedforourmethodandtheSFcomparison,therewardpredictionnetworkisusedonlyforourmethodandthedeepRLnetworkisusedonlyforthedeepRLcomparison,wheretheRLalgorithmalsolearnstherepresentation.Therepresentationnetworksaretwoconvolutionalnetworks(Table5.1)withainput,takingeithertheagent’scurrentobservationorthegoalobservation.

Layer Filters Filtersize Stride Padding Outputshape
Inputtensor - - - -
Convolution 8 3 None
ReLU - - - -
2Dmaxpooling 8 - None
Convolution 16 2 None
ReLU - - - -
Flatten - - - - 16
Dense - - - - 16
Table 5.1: Representationnetwork.

Thefirstlayersubsamplestheinput,keepingonlyeveryothercolumnandrow.Thisisfollowedby8filtersofsizewithastrideof3.ThisisfollowedwithaReLUactivationandamaxpoolinglayerwithastridevalueof2.Thepoolinglayer’soutputispassedtoalayerwith16convolutionalfiltersofsizeandastrideof2andaReLUactivationfunction.Theoutputisthenflattenedandpassedtoadenselayerwith16unitsandalinearactivation,definingthedimensionoftherepresentation.Nozeropaddingisappliedintheconvolutionallayersorthepoolinglayer.Thepolicynetworksarethree-layerfully-connectednetworks(Table5.2)acceptingtheconcatenatedoutputoftherepresentationnetworkfortheagent’scurrentpointofviewandthegoalobservationasaninput.Thefirsttwolayershave64unitsandaReLUactivation,andthelastlayerhas3unitsandalinearactivationfunction.Thethreeunitsrepresentthethreeactionsleft,right,andforwardinaonehotencoding.Winnertakesallisusedtodecideontheaction.

Layer Units Outputshape
Inputtensor - 32
Dense 64 64
ReLU - 64
Dense 64 3
ReLU - 3
Dense 3 3
Table 5.2: Policynetwork.

Ourrewardpredictionnetworkisathree-layerfully-connectednetwork(Table5.3)withthesameinputasthepolicynetwork:theconcatenatedrepresentationoftheagent’scurrentviewandthegoalobservation.Thefirsttwolayershave256unitsandaReLUactivation,butthelastlayerhas1unitandalogisticactivationfunction.

Layer Units Outputshape
Inputtensor - 32
Dense 256 64
ReLU - 64
Dense 256 3
ReLU - 3
Dense 1 3
Logistic - 3
Table 5.3: Rewardpredictionnetwork.

ThedeepRLnetworkstackstherepresentationnetworkandthepolicynetworkontopofeachother.Therepresentationnetworkacceptstheinputandoutputsthelow-dimensionalrepresentationtothepolicynetworkthatoutputstheactionscores.

5.5.4 Trainingtherepresentationandpredictornetworks

Wecollectadatasetofthousandtransitionsbyfollowingarandompolicyinthetwo-roomenvironment.Forthisdatacollection,eachepisodehasachancetohavethegoallocationinthebottomroomorontheleftsideofthetoproom(seetheleftandmiddlepicturesinFig.5.2(a)).Therewardpredictorandtherepresentationaretrainedinthismannerforallexperiments,includingthelavagapandthefour-roomenvironment.Thus,weusearepresentationandrewardpredictorthathaveneverseenlava.Fortheexperimentswithsmoothedrewards,thesparserewardassociatedwiththeobservationsinthedatasetisaugmentedbyassociatinganewrewardtothestatesleadingtoobservationswithapositivereward,accordingtoEquation5.3,withadiscountfactorof.Additionally,aftertherewardhasbeen(potentially)smoothedinthisway,observationsassociatedwithapositiverewardareoversampledtimestobalancethedataset,regardlessofwhethertherewardhasbeenaugmentedornot.

5.6 Resultsanddiscussion

Intheexperiments,wecompareRLagentsthatlearntheirrepresentationsfromscratch(DeepRL)toagentsthatpreprocesstheirinputswithdifferentrepresentations.Wecompareourrepresentation,trainedonrawrewardpredictions–with(Ours1r)orwithoutrewardshaping(Ours1r+Shaping)–toourmethodtrainedonsmoothedrewardprediction,alsowith(Ours64r)orwithoutrewardshaping(Ours64r+Shaping).Weuse"64r"todenotethatourmethodwastrainedwithrewardshapingand"1r"todenotethatourmethodwastrainedwithoutaugmentation.Asabaseline,wecompareourrepresentationtoareward-predictiverepresentationfromtheliterature,SuccessorFeatures(SFs).

5.6.1 Two-roomenvironment

Westartbyvisualizingtheoutputsofourrewardpredictorintherooms,dependingonthegoallocation,inFig.5.6.Eachsquareindicatestheaveragepredictedrewardfortransitioningtothecorrespondingtileintheroom.Thepredictedrewardspikesinanarrowregionaroundthetwogoallocationsthatwereusedtotraintherawrewardpredictor(Fig.5.5(a)),buttheareaofstateswithhighpredictedrewardsiswideraroundthetestgoal.Thisdifferenceisduetooverfittingonthespecifictrainingpathsthatweremorefrequentlytakentowardtherespectivegoals,butthisdoesnotharmthegeneralizationcapabilitiesofthenetwork.Thepeakynessofthepredictionsdisappearswhenthepredictoristrainedonthesmoothedrewards(Fig.5.5(b)).However,higherpredictedrewardsinthecorneroftheotherroomappear.Bothscenarios,rawandsmoothedrewardprediction,showpromisefortheapplicationofrewardshapingunderourtrainingscheme,astheagentwouldbenefitfromfindingneighborhoodswithhighervaluesofpredictedrewarduntilitreachesthegoal,insteadofhavingtorelysolelyonasparserewardthatisonlygivenwhentheagentlandsexactlyonthegoalstate.

(a) Rawrewardprediction.
(b) Smoothedrewardprediction.
Fig 5.6: Predictedrewards,two-roomenvironment.Thepredictoristrainedonthesetupsshownontheleftandinthemiddle,andtestedforthesetupontheright.Thecolorbecomeswarmerforstateswheretherewardispredictedtobehigher.

InFigure5.7,weillustratethevarianceofthemeanreward(leftside)andthevarianceoftheoptimalperformance(rightside)ofthedifferentmethods,asafunctionofthetimestepstakenfortraining.Weaverageover10runsandineachrunweperform10testrollouts,soeachpointistheaggregateof100episodesintotal.222Notethatthestandarddeviationofthemeanrewardofallepisodes,fromallrunsputtogether,isapproximatelyhigherthanthestandarddeviationofthemeanofthemeansorthemeanofthemins.Theerrorbandsindicatetwostandarddeviations.ThismethodologyofgeneratingtheplotsalsoappliestoFig.5.9,Fig.5.10andFig.5.12.ThelearningcurvesofbothACKTRandPPO2getclosetothehighestachievablemeanrewardofthefastestusingourrepresentations.TherenosignificantbenefitfromusingsmoothedrewardshapingforACKTR,andtherawrewardshapingisinfactharmfulinthiscase.ForPPO2,theagentusingourrepresentationthatistrainedonrawrewardpredictionslearnsthefastest.RegulardeepRL,wheretherepresentationsarelearnedfromscratch,isclearlyoutperformedbythevariantsthatusereward-predictiverepresentations.WebelievethatthisisbecauseRLagentscangenerallybenefitfromtheinputbeingpreprocessed,asthecomputationaloverheadforlearningthepolicyisreduced.Thiseffectisenhancedwhenthepreprocessingisgood,whichisthecaseforourreward-predictiverepresentation:itabstractsawayunnecessaryinformationasitistrainedtooutputfeaturesthatindicatethedistancebetweentheagentandthegoal,whenthegoalisinview.Thedifferenceinaggregatedmeanrewardsvs.aggregatedminimumepisodelengthscanbeexplainedduetosystematicallydifferentbehaviors.Forexample,anagentmighthaveaweaklong-termstrategyofcheckingthedifferentrooms,givingitpooraveragemeanrewards,butastrongshort-termtacticoftakingthedirectcoursetothegoalwhenitseesit,givingitagoodaverageminimumepisodelength.

Fig 5.7: Two-roomenvironment.Intheseexperiments,thereareonlytworoomsandtheagentmustreachagoalthatisalwaysatthesamelocation.Theagentcantraversebetweentheroomsandstartseachepisodebetweenthem,facingarandomdirection.Theleftsideshowsthemeanofeveryagent’smeanrewardandtherightsideshowsthemeanofeveryagent’sminimumepisodelength.

5.6.2 Lavagapenvironment

Learningfromscratch

TheheatmapsofaveragepredictedrewardsarevisualizedinFig.5.8.Therewardpredictorwastrainedonthetwo-roomenvironment.Thetilesclosesttothegoalhavethehighestvalues,withaparticularlysmoothgradienttowardthegoalforthesmoothed-rewardpredictor,whichdemonstratesthatthereispotentialgainfromtransferringtheprediction-basedrewardshapingbetweensimilarenvironments.ThelearningperformanceofthedifferentmethodscanbeseeninFig.5.9.Thedecidedlyfastestlearningcanbeobservedwhentheactor-criticmethodiscombinedwithourrepresentation,trainedonrawrewardpredictionsandwithoutrewardshaping.RegulardeepRListhesecond-best,butwithaverylargevarianceontheperformance.OurrewardshapingvariationsandtheSFsareverycloseinperformance,albeitsignificantlyworsethantheothertwo.Thepoorperformanceofrewardshapingcanbeexplainedbythefactthatthereareveryfewstates,whichmakestherewardshapingunnecessaryinsuchasimpleenvironment.AllthemethodslookmoresimilarwhenPPO2optimizationisapplied,withrespecttothemeanrewards,butourvariantthatistrainedonsmoothrewardpredictionandusesrewardshapingreachesthehighestaverageperformanceinthelastiterations.

Fig 5.8: Predictedrewards,lavagap.Averagepredictedrewardperstateinthelavagapenvironment.
Fig 5.9: Lavagapexperiment.Allpoliciesarerandomlyinitializedandlearntosolvethelavagapenvironmentfromscratch.TherepresentationsinallmethodsexceptforDeepRLarelearnedonthetraininggoalsinthetwo-roomenvironment(seeFig.5.6).Theleftsideshowsthemeanofeveryagent’smeanreward,andtherightsideshowsthemeanofeveryagent’sminimumepisodelength.
Transferlearning

Toinvestigatehowthemethodscompareforadaptingtonewenvironments,wetrainedthepoliciesfor8000stepsonthetwo-roomenvironmentbeforelearningtosolvethelavagapenvironment,seeFig.5.10.Ourmethod,withoutrewardshaping,facilitatesthefastestlearningforACKTRinthiscase.DeepRListhemostseverelyaffectedbythischange,whichisprobablyduetothemethodlearningareward-maximizingrepresentationinoneenvironmentthatdoesnottransferwelltoanotherenvironment.EveryPPO2variationlooksbadforthisscenario,butthesmooth-rewardpredictionrepresentationwithrewardshapinghasthehighestmeanrewardandourraw-rewardpredictionrepresentationhasthelowestaverageminimumepisodelength.

Fig 5.10: Re-learningexperimentThedifferentmethodsaretrainedforeightthousandtrainingstepsonthetwo-roomenvironmentbeforebeingtrainedonthelavagapenvironment.Thecurvesshowthemeanrewardonthelavagapenvironment.Theleftsideshowsthemeanofeveryagent’smeanreward,andtherightsideshowsthemeanofeveryagent’sminimumepisodelength.

Wevisualizetrajectoriesofanagentthatistrainedonourrepresentation(Ours1r)asittraversesthelavagapsenvironment(Fig.5.11).Forinspectionofcaseswhereitfails,wechooseanagentthathasbeentrainedforthousandtimestepsonlyandhasaroundmeanreward.

(a) Sixsuccessfulepisodes.
(b) Sixfailedepisodes.
Fig 5.11: Lavagaptrajectories.Avisualizationofsixsuccessfulandsixfailedtrajectoriesbyanagentthatwastraineduptoapproximatelysuccessrateusingourrepresentation.Thecolorgradientgoesfromthefirstactionstakenintheepisodeinbluetothelastactionsinred.Notethatrotationsin-placearenotvisualized,butonlytransitionsbetweentiles.

5.6.3 Four-roomenvironment

Inourfinalcomparison,weaddtwoadditionalroomstothetwo-roomenvironmentandrandomizeboththegoallocationandthestartingpositionoftheagent,withtheresultsshowninFig.5.12.Lookingattheminimumepisodelengths,fortheACKTRlearner,ourraw-rewardpredictionrepresentationwithrewardshapingperformsbestandtheonewithoutrewardshapingcomesinsecond.ThereislittlediscernibledifferencebetweentheperformanceofSFsandDeepRL,buttheybothperformsignificantlyworsethanourmethods.Thescaleofthemeanrewardisagreatdeallowerthaninthepreviousexperiments,sincetheaveragedistancebetweenthestartingtileoftheagentandthegoalismuchlargerthanintheprevioustwoenvironments.Forthisscenario,allthemethodslooksimilarlybadforthePPO2policy,exceptforourraw-rewardrepresentations,withrewardshaping,whichhasthelowestminimumepisodelength.Thebigadvantageofrewardshapinginthisenvironmentcomparedtothetwo-roomenvironmentcanbeexplainedbytheincreasedcomplexity,makingtherewardshapingmorehelpfulinguidingtheagent’ssearch.Inthepreviousexperiments,theagentandgoallocationsstartatfixedlocations,allowingtheagentstosolveitbyrotememorization.Therewardshapingfunctioncalculatedbytheraw-rewardpredictorfaressignificantlybetterinthissituation.Wehypothesizethatthisisduetothesmoothed-rewardpredictordistractingtheagentbypushingittocorners,asthevisualizationinFig.5.5(b)wouldsuggest.Therewardshapinggivenbytheraw-rewardpredictorismorediscriminative,asweseeinFig.5.5(a).Theagentreceivesapositiverewardassoonasthegoalreachesitspointofview,whichisanylocationuptosixtilesinfrontofitandnofurtherthan3tilesawayfromittotheleftortotheright.Thisallowstherewardshapingfunctiontoguidetheagentdirectlytothegoal,assumingthattheyareinthesameroomandthatthereisnowallobstructingtheagent’sfieldofvision.

Fig 5.12: Fullfour-roomenvironment.Theagentandgoalareplacedatrandomlocationsatthestartofeachepisode.Theleftsideshowsthemeanofeveryagent’smeanreward,andtherightsideshowsthemeanofeveryagent’sminimumepisodelength.

ThreesuccessfulandthreefailedtrajectoriesofanACKTRagentthathasbeentrained,usingourrepresentation(Ours1r)foramilliontimestepsarevisualizedinFig.5.13.Wecanseeundesirablebehaviorinboththesuccessfulandthefailedtrajectories,thattheagentwasteseffortre-visitingtilesithasalreadybeento.

(a) Threesuccessfultrajectories
(b) Threefailedtrajectories
Fig 5.13: Four-roomtrajectories.TheagentwastrainedonourrepresentationswithActorCriticusingKronecker-FactoredTrustRegionforamilliontimesteps.

5.7 Conclusion

Processinghigh-dimensionalinputsforreinforcementlearning(RL)agentsremainsadifficultproblem,especiallyiftheagentmustrelyonasparserewardsignaltoguideitsrepresentationlearning.Inthiswork,weputforwardamethodtohelpalleviatethisproblemwithamethodoflearningrepresentationsthatpreprocessesvisualinputsforRLmethods.Ourcontributionsare(i) areward-predictiverepresentationthatistrainedsimultaneouslywitharewardpredictorand(ii) arewardshapingtechniqueusingthistrainedpredictor.Thepredictorlearnstoapproximateeithertherawrewardsignalorasmoothedversionofit,anditisusedforrewardshapingbyencouragingtheagenttotransitiontostateswithhigherpredictedrewards.Weusedaviewofthegoalasasecondinputforthemethodsinourexperiments,butthisisinprinciplenotnecessary,asmovingtowardthegreentileasitbecomesvisibleissufficient.Removingthegoalinputmightencouragetheagentstolearnpoliciesthatscanalltheroomsfasteruntilthegoalreachesitsfieldofvision.Wehaveshowntheusefulnessofourrepresentationandourrewardshapingschemeinaseriesofgridworldexperiments,wheretheagentreceivesahigh-dimensionalobservationofitsgoalasaninputalongwithanobservationofitsimmediatesurroundings.Preprocessingtheinputusingthisrepresentationspeedsupthetrainingoftwoout-of-the-boxRLmethods,ActorCriticusingKronecker-FactoredTrustRegionandProximalPolicyOptimization,comparedtohavingthesemethodslearntherepresentationsfromscratch.Inourmostcomplicatedexperiment,combiningourrepresentationwithourrewardshapingtechniqueisshowntoperformsignificantlybetterthanthevanillaRLmethods,whichhintsatitspotentialforsuccess,especiallyinmorecomplexRLscenarios.

Chapter 6 Comparisonofourthreemethods

WecompareinthischapterthealgorithmsthatwedevelopedoverthecourseofthePhDwork:thegradient-basedICArepresentation(GrICA)fromChapter3,thelatentrepresentationpredictionnetwork(LARP)fromChapter4andthereward-predictionrepresentationfromChapter5(RewPred).Weusetherepresentationscalculatedbythesemethodstopreprocessinputsfordeepreinforcementlearning(DeepRL)agentsonfourdifferenttasks:acart-polebalancingenvironment,thetwo-roomandfour-roomgoal-findingenvironmentsfromthepreviouschapterandanobstacleavoidancetask.Theimplementationofourmethodsisthesameasdescribedintheexperimentalsectionoftheircorrespondingchapter,unlessspecifiedotherwise.InSection6.2weintroducethenewvisualcart-poleenvironmentanddiscussthenetworkarchitectures.InSection6.3wedescribeanddiscusstheresultsoftheexperiments.

6.1 Introduction

AseachmethodwasintroducedoverthecourseofthePhDwork,itwasonlycomparedtoothersimilarmethodsintheliterature.Wenowtaketheopportunitytocomparethemagainsteachotherindifferentenvironments.Thecart-poleenvironmentandtheobstacleavoidanceenvironmenthavethecommonalitythattheybothrequireonlyreactiveshort-termplanningtomaximizethereward,butthegoal-findingenvironmentsrequiremorelong-termplanningtofindandreachthegoal.EveryoneofourrepresentationsisusedforpreprocessingthevisualobservationsofeachenvironmentforRLalgorithms.TheRLagentsarethesameonesusedinthepreviouschapter:ACKTRandPPO2,withthedefaultparametersfromtheStableBaselinespackage.Aswewerenotabletosolvethecart-poleenvironmentusingACKTR,onlyPPO2isusedforthattask.WecomparethemethodfromChapter4,LARP,inthismanner,eventhoughitisnottrainedforpreprocessinginputsformodel-freeagentsbutisdevelopedtobepairedwithapredictorforpredictivestaterepresentationrollouts

6.2 Methods

6.2.1 Visualcart-pole

WestartbycomparingourmethodsasvisualpreprocessingforaPPO2policyonavariantoftheclassicalcart-polebalancingtask.Apoleisattachedtoacartthatmovesalongafrictionlesstrack.Thegoalistokeepthepoleuprightforaslongaspossible,buttheagentmustpushthecarttotheleftortotherightateverytimestep–choosingtoremaininplaceisnotanoption.Theoriginalenvironmentsuppliesfourscalarvariablestotheagent:thecartpositionandvelocityandthepoleangleandangularvelocity.Theagentreceivesapositiverewardofforeachtransition.Insteadofusingthesescalars,weadaptthistasktoourparadigmbyusingavisualizationoftheenvironment,seeFig.6.1.Tobeginwith,weextracttherenderingoftheenvironmentthatismeantfordebuggingandvisualizationpurposes.Aswecanapproximatethecartpositionandvelocityandthepoleangleandangularvelocityusingtwoadjacentframes,weusethecurrentandpreviousobservations.Tosimplifytheinputfortheagent,wetakethedifferenceofthetwoframes,croptheimagearoundthecartandbinarizeit.Theonlyinformationthatislostinthisprocess,comparedtotheoriginalenvironment,isthepositionofthecart,whichisnotsoimportantforsolvingthetask.Inadditiontogivingtheagentapositiverewardofforeverytimestepitkeepsthepolebalanced,wealsosupplyitwithanegativerewardofwhenthepolefallsdown.

Fig 6.1: cart-poleprocessingpipeline.Wereplacetheoriginal4-dimensionalstatespaceoftheOpenAIgymcart-poleenvironmentwiththeoutputofitsrenderingfunction.Thepreviousframeissubtractedfromthecurrentframe,thenon-zeropixelvaluesarecentered,croppedandthefinalimageisconvertedtobinary.

Ourmethodsaretrainedon5000transitionsthatarecollectedfromarandompolicy.Eventhoughthisisnotasparse-rewardenvironment,wecalculatednewrewardsfortheRewPrednetworkaccordingtoEquation5.3withadiscountfactorofandamaximumhorizonof.Thishastheeffectthateverytimestepisassociatedwitharewardof,exceptforthesixtransitionsleadinguptotheendoftheepisode,whichhavetherewardsand.

6.2.2 Roomenvironments

WecomparetheperformanceofPPO2andACKTRpoliciesastheyuseourmethodsforpreprocessingonthetwo-roomandfour-roomgoaltasksfromthepreviouschapter,seeSection5.5.1.ThedatagatheringfortherepresentationsusedforpreprocessingfortheRLagentsisunchangedfromthepreviouschapter.

6.2.3 Obstacleavoidance

Inthisgridworldenvironment,theagentisplacedinaroomfilledwithcircularobjectsthatmoverandomlyineachstep(Fig.6.2).

Fig 6.2: Obstacleavoidanceenvironment.Theagentisrewardedformaximizingthedistancebetweenitselfandtheclosestcircle.Theleftfiguredisplaysthefullworldstateandtherightfiguredisplaysthecorrespondingobservationthattheagentreceives.

Theagentreceivesvisualinputsofdimensionandmuststayasfarawayfromtheobjectsaspossible,asitreceivesarewardthatisequaltotheEuclideandistancebetweenitselfandtheclosestcircle.Theepisodeendswitharewardofaftertimestepsorwhentheagentcollideswithacircle,whichgivesarewardof.Thediscountfactorfortheenvironmentis0.9.

6.2.4 Architectures

Allrepresentationsandpolicieshavethesamearchitectureastheinthepreviouschapter,exceptthattheoutputoftherepresentationhasbeenloweredtoa16-dimensionalvectorforlowercomputationaltimes,seeTable5.1andTable5.2.TheRLmodelshavethesamearchitectureexceptthattherepresentationandpolicymodulesaretrainedend-to-end.ThepredictionmoduleoftheLARPnetworkisthesameasinChapter4,seeTable4.3.ThemutualinformationneuralestimatornetworkusedfortrainingGrICAisthesameasinChapter3,seeSection3.4.

6.3 Resultsanddiscussion

6.3.1 Visualcart-pole

Theresultsfromthevisualcart-poleexperimentcanbeseeninFig.6.3.ThedeepRLagentachievesthehighestmeanrewardbyasignificantmargin,probablybecausetheenvironmentrequirestheagenttotakequickactionsinresponsetothefast-changingenvironment.

Fig 6.3: cart-poleresults.Eachpointistheaggregateof5differentpoliciesthataretrainedfromscratchandtestedfor20episodes,each.

ThelearningcurveistheworstwhentheagentistrainedontheLARPrepresentations,whichistobeexpectedasitislearnedtobepairedwitharepresentationpredictorandgraphsearch,whichitisnotusedforthisscenario.TheRewPredandGrICAlearningcurveslooksimilarlygood,withtheRewPredrepresentationachievingaslightlyhigheraveragereward.Bothrepresentationsshouldtheoreticallybeusefulhere:learningarepresentationthatispredictiveofhowclosethepoleisfromfallingdown(RewPred)oughttoguidetheagent’sactionsawayfromstateswherethepoleisabouttofalldown,andrecoveringthethreestatisticallyindependentlatentvariablesgeneratingtheenvironment(GrICA),whichwouldbeaperfectcompressionoftheenvironment.Inthisscenario,however,theydonotbeattherepresentationlearningoftheDeepRLalgorithm.

6.3.2 Roomgoal-finding

Theresultsfortrainingonthesmooth-rewardpredictionRewPred111Denoted"Ours64r"inthepreviouschapter.andDeepRLrepresentationsarekeptintheplotsfromthepreviouschapter,andwehaveaddedtheresultsfortheGrICAandtheLARPrepresentations.

Two-roomenvironment

Theresultsforthegoal-findingtask,whentheagentstartsfacingarandomdirectionbetweentworoomsandmustlocateastaticgoal,canbeseeninFig.6.4.Wedisplayheretheversionofthereward-predictiveRewPredrepresentationfromthepreviouschapterthatistrainedonsmoothedrewards.

Fig 6.4: Gridworldcomparison:two-roomgoal-findingresults.Theagentstartsinbetweentworoomsandgetarewardfromreachingastaticgoallocationthatisoneoftworooms.Eachpointinthelearningcurveisaggregatedfrom10initializationsofpoliciesthatrun10testepisodeseach.

FortheACKTRpolicy,boththeLARPandtheGrICAlearningcurves222SeethepreviouschapterfortheRewPredresultsareconsiderablylowerthanthoseoftheothertwo,withtheGrICArepresentationagainlookingslightlybetteroutofthetwo,bothwhenthemeanrewardandtheminimumepisodelengthsareconsidered.ThisisprobablyalsoduetothefactthattheadvantageoftheLARPrepresentationdisappearswhenthepredictorthatitistrainedwithisdiscardedandtherepresentationisbeingusedinamodel-freesetting.ForthePPO2policy,neitherthepoliciesthattrainontheLARPnortheGrICArepresentationsshowsignsoflearningtosolvetheenvironment.

Four-roomenvironment

Inthisexperiment,nocombinationofRLagentandrepresentationshowsprogresstowardsolvingtheenvironment,exceptforRewPredpairedwithACKTR.BoththeLARPandGrICAvariantsbarelydisplaylearninginthepreviousexperiment,butallsignsofprogressdisappearwhentheenvironmentissufficientlycomplex.ThetrainingresultscanbeseeninFig.6.5.

Fig 6.5: Four-roomgoal-findingresults.Theagentstartsatarandomlocationgetarewardfromreachingadynamicgoallocationthatisinoneoffourrooms.Eachpointinthelearningcurveisaggregatedfrom10initializationsofpoliciesthatrun10testepisodeseach.

6.3.3 Obstacleavoidance

Thealgorithm’slearningcurvesfortheobstacleavoidanceexperimentcanbeseeninFig.6.6.ForbothPPO2andACKTR,weobservethatpreprocessingwithourmethodsisnotbeneficialastheenvironmentissolvedthefastestusingdeepRL,andbyasignificantmarginforACKTR.ThisshowsthatourmethodisoutperformedbyregulardeepRLforenvironmentswhereshort-term,reactivepoliciesareneeded–incontrasttothegoal-findingtasks,wheretheagentneedstoexecutealong-termplan.

Fig 6.6: Obstacleavoidanceresults.Theagentgetsarewardforkeepingitsdistancefromobjectsthatmoverandomly.Eachpointinthelearningcurveisaggregatedfrom10initializationsofpoliciesthatrun20testepisodeseach.

6.4 Conclusion

Inthischapter,weinvestigatedhowthethreemethods,thatwehavedevelopedinthisPhDwork,comparewhentheyareusedforpreprocessingvisualinputsforRLagents.TheRLagentsaretestedinfourdifferentenvironments,twoofwhichrequireshort-termdecision-makingandtheothertworequirelong-termdecision-makingforsuccess.OurrepresentationthatwasdevelopedinChapter3,GrICA,doesnotfacilitatelearningforanyoftheenvironments.ThesameholdstruefortheChapter4representation,LARP,althoughthatonewasnotdevelopedforpreprocessinginputstoRLagentsbutrathertobeusedinconjunctionwithapredictionfunctionforplanninginalatentrepresentationspace.Ourreward-predictiverepresentationfromChapter5isshowntospeeduplearningforadeepRLagentonthelong-termplanningtasks.

Chapter 7 Summaryandconclusion

Justasthecomplexityofthetasksthatdeepreinforcementlearning(RL)isbeingappliedtoincreases,sodoestheapparentnumberofproblemsthefieldhas.ThegoalofthisresearchwastodiscoverusefulrepresentationsthatcanbeusedtoalleviatetwoofmoderndeepRL’sproblems,namely,thoseoftraininginstabilityanddatainefficiency.OverthecourseofthePhDwork,wedevelopedthreerepresentationlearningmethodsandinvestigatedtheirsuitabilityforfulfillingthisgoal:

  • Agradient-basedICAmethodforlearningstatisticallyindependentfeatures(Chapter3).Weestimatethemutualinformationbetweenoneoutputcomponentofanencoderandalltheothersusinganeuralnetwork.Inapush-pullfashion,themutualinformationestimatorandtheencoderaretrained,resultinginasystemthatoutputsstatisticallyindependentfeaturesoftheinput.TheoutputofGrICAiscomparetotheoutputofFastICA,anestablishedICAproblem-solvingmethodintheliterature,fornoisyblindsignalseparation.

  • Thissystem,whichwecall"LatentRepresentationPredictor"(Chapter4),learnsatransitionmodeloftheenvironment.Weevaluateoursystembycombiningitwithgraphsearchtomanipulatetoyobjectstomatchagivenviewpoint.Ouralgorithmlearnsastaterepresentationjointlywithaone-steplookaheadpredictor.Wediscussandcomparethreedifferentconstraintsthatcanbeplacedonthesystemtopreventthesolutionfromcollapsingtoaconstantfunction.OurapproachoutperformsdeepRLinalow-dataregimeontheviewpoint-matchingtask.

  • Areward-predictiverepresentation,thatislearnedalongwithajointlylearnedrewardpredictor(Chapter5).Therewardpredictorisemployedforrewardshaping:Theagentisrewardedformovingfromstatesoflowpredictedrewardstostatesofhigherpredictedrewards.Themethodistestedinseveralgridworldenvironmentswheretheagentmustreachagoal.ThelearningofdeepRLagentsisspedupwhentheirinputsarepreprocessedbytheRewPredrepresentationinourexperimentsandtherewardshapingishelpfulwhentheenvironmentissufficientlycomplex.

NoteveryunsupervisedlearningtechniqueyieldsastaterepresentationthatisusefulinthecontextofRL,asseenbytheresultsacrosstheboardusingourGrICAalgorithm.However,basedonourresults,wehavefoundthatthetrainingofdeepRLmethodscanbeaugmentedbylearningappropriaterepresentationsinaself-supervisedmannerinenvironmentswheretheagentmustcarryoutlong-termplanningtoreachasinglegoalstate.AnimportantlimitationofthisworkisthehighcostofcomputationalresourcesrequiredforcarryingoutRLresearch.Toillustrate,reproducingDeepMind’s2017Gopaper(silver2017mastering)isestimatedtocost35milliondollarsusingGoogle’scloudcomputingservice(dan2018how)–abudgetcurrentlyunavailabletoPhDstudents.Reproducingtheirresultsonlyinvolvesmaximizingtheperformanceofasinglemodel,whichdoesnottakeintoaccounttheworkbehindfindingthefinalmodel.RLmethodshavemanypotentialsettingstotune,andrunninghyperparametersearchesiscostly.Forthisreason,weconcentratedonlyonusingthedefaultparametersofRLmodelimplementationsasofferedbytheStableBaselinespackage.Tomakemattersworse,theperformanceofRLtechniquescanbehighlysensitivetothehyperparametersettings,andbadluckcancausetheresearchertodismissamethodifunfortunatevaluesareinitiallychosen.Alargercomputationalbudgetallowsresearcherstotrymorecombinationsofparameterswhennewmethodsaretested,lesseningtheriskofpromisingapproachesbeingprematurelydismissedduetobadluck.ThemethodswerealsotestedforpreprocessingthevisualinputsofdeepRLalgorithmsinenvironmentswherethereisasetofstatethattheagentmustavoid,namely,preventingapolefromfallingoffacartorpreventingtheagentfromcollidingwithrandomlymovingobjects.Preprocessingtheinputsusingourmethodsisnotbeneficialinthesecases,potentiallyindicatingthatallowingdeepRLalgorithmstolearnthestaterepresentationsispreferablefortasksthatrequireimmediatereactionstoquicklychangingenvironments,particularlywhereinactionleadstoanundesirableoutcome.Althoughourgradient-basedICAwasnotfoundadvantageousinourgoalofaugmentingdeepRLtraining,itissuccessfulinrecoveringindependent,noisysourcesjustaswellasFastICA.Apromisingavenueofresearchistotakeadvantageoftheflexibilityofferedbyourmethod.Ouralgorithmcanbepairedwithanydifferentiablefunction,suchasconvolutionalneuralnetworks(convnets),totacklemoredifficultproblems,forexamplenonlinearICA.EventhoughgeneralnonlinearICAproblemsareill-posed,regularizationcanbeappliedtomakethemwell-posed.Thiscaninprinciplebedoneusingourmethodviathedesignoftheneuralnetwork.Oneaspectofrepresentationlearningthathasbecomepopularinrecentyears,butwasnotinvestigatedinthisPhDwork,istheapplicationofmemorymodulesinneuralnetworks.Theroleofmemorycaneasilybeintegratedintoourmethodsbydesigningourfunctionapproximatorsasartificialrecurrentneuralnetworks,forexamplebycombininglongshort-termmemorynetworkswithconvnets.AlthoughtheproblemoflearningrepresentationsinthecontextofRLremainsunsolved,theimplicationofourworkisthattakingadvantageoftherichunsupervisedsourceofsupervisionthatishiddeninRLenvironmentscanleadtodata-efficient,stablealgorithmsthatareresilienttochangesintheenvironment.AstraightforwardwayofextendingourworkwouldbetotakeadvantageofthefulldatathatisgivenineverytransitioninanRLenvironment:thecurrentstate,theactiontakeninthestate,theresultingnextstateandtherewardgiventotheagentbytheenvironment.ItwouldbeaninterestingfuturelineofworktodothisbysimplycombiningthelossfunctionofLARP,whichtrainsontuples,andRewPred,whichtrainsontuples.Thisnewalgorithmwouldtraintherepresentationtobesimultaneouslypredictableforarepresentationpredictor,butalsotobeusefulforarewardpredictor.SeeFig.7.1foraVenndiagramthatsummarizesthesourcesofsupervisionthatourproposedextensionwoulduseinadditiontothesourcesofsupervisionthatareusedbyourmethods.

Fig 7.1: SupervisionsourceVenndiagram.AnillustrationofthecomponentsoftheRLtransitiontuplesthatareusedbyourmethods:GrICA(Chapter3),LARP(Chapter4),RewPred(Chapter5).ThesetuplesconsistofalltheinformationinvolvedinasingletransitioninanRLenvironment.Theintersectionofeverycomponentcorrespondstothetheoreticalextensionofourwork,combiningelementsofLARPandRewPred.

Thisdissertationaddstoarapidlygrowingbodyofknowledgethatsitsattheintersectionofdeeplearning,reinforcementlearningandrepresentationlearning.Wecontributethreenovelmethodsoflearningstaterepresentationstotheliteratureandexperimentallyevaluatetheireffectivenessinthecontextofreinforcementlearning.Alongroadtoartificialintelligencethatmatchesourgeneralandefficientproblem-solvingcapabilitystillliesaheadofus.\pdfbookmark[0]Bibliographybibliography

Bibliography

Appendix A

\pdfbookmark

[1]Nomenclaturenomenclature\nomenclature[C]StateSpace\nomenclature[C]ActionSpace\nomenclature[C]RealNumbers\nomenclature[S]RLReinforcementLearning\nomenclature[S]MDPMarkovDecisionProcess\nomenclature[S]POMDPPartiallyObservableMDP\nomenclature[S]MLPMultilayerPerceptron\nomenclature[S]ANNArtificialNeuralNetwork\nomenclature[S]DLDeepLearning\nomenclature[S]PCAPrincipalComponentAnalysis\nomenclature[S]ICAIndependentComponentAnalysis\nomenclature[S]SFSuccessorFeatures\nomenclature[S]GrICAOurGradient-basedICAAlgorithm(fromChapter4)\nomenclature[S]LARPLatentRepresentationPrediction(fromchapter5)\nomenclature[S]RewPredOurReward-predictiveRepresentation(fromChapter6)\nomenclature[S]MINEMutualInformationNeuralEstimation\nomenclature[S]CAEVariationalAutoencoder\nomenclature[S]VAEConvolutionalAutoencoder\nomenclature[S]LEMLaplacianEigenmaps\nomenclature[S]Conv.Convolutional\nomenclature[S]ReLURectifiedLinearUnit\nomenclature[S]t-SNEt-distributedStochasticNeighborEmbedding\nomenclature[S]PPOProximalPolicyOptimization\nomenclature[S]DQNDeepQ-Network\nomenclature[S]ACKTRActorCriticusingKronecker-FactoredTrustRegion\nomenclature[S]MSEMeanSquaredError\nomenclature[S]MBRLModel-basedReinforcementLearning\nomenclature[S]MBRLModel-freeReinforcementLearning\nomenclature[V]Representation\nomenclature[V]Learningrateparameter\nomenclature[V]Policyfunction\nomenclature[V]OptimalPolicy\nomenclature[V]Theparametersofadifferentiablefunction\nomenclature[C]StateTransition\nomenclature[C]Probability\nomenclature[C]Rewardfunction\nomenclature[C]state-valuefunctionof\nomenclature[C]action-valuefunctionof\nomenclature[C]Dataset\nomenclature[V]Observationspace\nomenclature[C]Observationfunction\nomenclature[C]Timestepindex\nomenclature[C]Action\nomenclature[C]State\nomenclature[C]Observation\nomenclature[C]Reward\nomenclature[V]Discountfactor\nomenclature[C]Lossfunction\printnomenclature