Co-Imitation: Learning Design and Behaviour by Imitation

Chang Rajani^1,2 , Karol Arndt², David Blanco-Mulero², Kevin Sebastian Luck^{2, 3}, Ville Kyrki²

Abstract

The co-adaptation of robots has been a long-standing research endeavour with the goal of adapting both body and behaviour of a system for a given task, inspired by the natural evolution of animals. Co-adaptation has the potential to eliminate costly manual hardware engineering as well as improve the performance of systems. The standard approach to co-adaptation is to use a reward function for optimizing behaviour and morphology. However, defining and constructing such reward functions is notoriously difficult and often a significant engineering effort. This paper introduces a new viewpoint on the co-adaptation problem, which we call co-imitation: finding a morphology and a policy that allow an imitator to closely match the behaviour of a demonstrator. To this end we propose a co-imitation methodology¹¹1Additional material can be found at https://sites.google.com/view/co-imitation for adapting behaviour and morphology by matching state distributions of the demonstrator. Specifically, we focus on the challenging scenario with mismatched state- and action-spaces between both agents. We find that co-imitation increases behaviour similarity across a variety of tasks and settings, and demonstrate co-imitation by transferring human walking, jogging and kicking skills onto a simulated humanoid.

\affiliations

¹ Department of Computer Science, University of Helsinki, Helsinki, Finland
² Department of Electrical Engineering and Automation, Aalto University, Espoo, Finland
³ Finnish Center for Artificial Intelligence (FCAI), Espoo, Finland
chang.rajani@helsinki.fi, {karol.arndt, david.blancomulero, kevin.s.luck, ville.kyrki} @aalto.fi

1 Introduction

Animals undergo two primary adaptation processes: behavioural and morphological adaptation. An animal species adapts, over generations, its morphology to thrive in its environment. On the other hand, animals continuously adapt their behaviour during their lifetime due to environmental changes, predators or when learning a new behaviour is advantageous. While the processes operate on different time scales, they are closely interconnected and crucial elements leading to the development of well-performing and highly adapted organisms on earth.

While research in robot learning has largely been focused on the aspects of behavioural learning processes, a growing number of works have sought to combine behaviour learning and morphology adaptation for robotics applications via co-adaptation luck2020data; liao2019data; schaff2019jointly; ha2019reinforcement. Earlier works focused primarily on the use of evolutionary optimization techniques sims1994evolving; pollack2000evolutionary, but with the advent of deep learning, new opportunities arose for the efficient combination of deep reinforcement learning and evolutionary adaptation schaff2019jointly; luck2020data. In contrast to fixed behaviour primitives or simple parameterized controllers used earlier, deep neural networks allow a much greater range of behaviours given a morphology.

Figure 1: The proposed *co-imitation* algorithm (centre) is able to faithfully match the gait of human motion capture demonstrations (left) by optimizing both the morphology and behaviour of a simulated humanoid. This is opposed to a pure behavioural imitation learner (right) that fails to mimic the human motion accurately.

Existing works in co-adaptation, however, focus on a setting where a reward function is assumed to be known, even though engineering a reward function is a notoriously difficult and error-prone task singh2019end. Reward functions tend to be task-specific, and even minor changes to the learner dynamics can cause the agent to perform undesired behaviour. For example, in the case of robotics, changing the mass of a robot may affect the value of an action penalty. This means that the reward needs to be re-engineered every time these properties change.

To overcome these challenges, we propose to reformulate co-adaptation by combining morphology adaptation and imitation learning into a common framework, which we name co-imitation. This approach eliminates the need for engineering reward functions by leveraging imitation learning for co-adaptation, hence, allowing the matching of both the behaviour and the morphology of a demonstrator.

Imitation learning uses demonstration data to learn a policy that behaves like the demonstrator. However, in the case where the two agents’ morphologies are different, we face the following challenges: (1) state spaces of demonstrating and imitating agents may differ, even having mismatched dimensionalities; (2) actions of the demonstrator may be unobservable; (3) transition functions and dynamics are inherently disparate due to mismatching morphologies.

To address these issues we propose a co-imitation method which combines deep imitation learning through state distribution matching with morphology optimization. Summarized, the contributions of this paper are:

[noitemsep]
Formalization of the problem of co-imitation: matching both the behaviour and morphology of a demonstrator.
The introduction of Co-Imitation Learning (CoIL), a new co-imitation method, which adapts the behaviour and morphology of an agent by state distribution matching considering incompatible state spaces.
A comparison of morphology optimization using learned non-stationary reward functions with our proposed approach of using a state distribution matching objective.
A demonstration of CoIL by learning behaviour and morphology of a simulated humanoid given real-world demonstrations recorded from human subjects in tasks ranging from walking, jogging to kicking (see Fig. 1).

Figure 2: Top: A demonstrator of jogging from the CMU MoCap Dataset cmu. Middle: The co-imitation Humanoid produces a more natural looking jogging motion whereas the pure imitation learner (bottom) learns to run with a poor gait.

2 Related work

Deep Co-Adaptation of Behaviour and Design

While co-adaptation as a field has seen interest since at least as early as the 90s park1993concurrent; sims1994evolving, in this section we look at previous work in the field especially in the context of deep reinforcement learning. Recent work by gupta2021embodied proposes a mixed evolutionary- and deep reinforcement learning-based approach (DERL) for co-optimizing agents’ behaviour and morphology. Through mass parallelization, DERL maintains a population of 576 agents, which simultaneously optimize their behaviour using Proximal Policy Optimization (PPO) schulman2019ppo. Based on their final task performance (i.e. episodic return), DERL optimizes the morphological structure of agents using an evolutionary tournament-style optimization process.

schaff2019jointly use deep reinforcement learning (RL) for the joint optimization of morphology and behaviour by learning a single policy with PPO. Again, the final episodic return of a design is used to optimize the parameters of a design distribution with gradient descent, from which the subsequent designs are sampled. Similarly, ha2019reinforcement proposes to use REINFORCE to optimize policy parameters and design parameters of a population of agents in a joint manner. The co-adaptation method presented by luck2020data improves data-efficiency compared to return-based algorithms by utilizing the critic learned by Soft Actor Critic (SAC) haarnoja2018soft to query for the expected episodic return of unseen designs during design optimization. While the method we present is closest to the former approach, all discussed co-adaptation methods require access to a reward function, and are thus not capable of co-adapting the behaviour and design of an agent without requiring an engineer to formulate a reward function.

Imitation Learning with Morphology Mismatch

Imitation learning approaches learn a policy for a given task from demonstrator data. In many cases this data can only be produced by an agent (or human) that has different dynamics from the imitator. We will give a brief overview on previous work where a policy is learned in presence of such transfer.

The work by desai2020imitation discusses the imitation transfer problem between different domains and presents an action transformation method for the state-only imitation setting. hudson2022skeletal on the other hand learn an affine transform to compensate for differences in the skeletons of the demonstrator and the imitator. These methods are based on transforming either actions or states to a comparable representation.

To perform state-only imitation learning without learning a reward function, dadashi2021primal introduced Primal Wasserstein Imitation Learning (PWIL), where a reward function is computed based directly on the primal Wasserstein formulation. While PWIL does not consider the case where the state space and the morphology are different between the demonstrator and the imitator, it was extended into the mismatched setting by fickinger2021cross. They replace the Wasserstein distance with the Gromov-Wasserstein distance, which allows the state distribution distance to be computed in mismatched state spaces. In contrast, our method addresses the state space mismatch by transforming the state spaces to a common feature representation, allowing for more control over how the demonstrator’s behaviour is imitated. Additionally, in contrast to these works, we optimize the morphology of the imitator to allow for more faithful behaviour replication.

peng2020learning propose an imitation learning pipeline allowing a quadrupedal robot to imitate the movement behaviour of a dog. Similarly, xu2021gan use an adversarial approach to learn movements from human motion capture. Similar to us, these papers match markers between motion capture representations and robots. However, in the first, a highly engineered pipeline relies on a) the ability to compute the inverse kinematics of the target platform, and b) a hand-engineered reward function. In the latter, imitation learning is used for learning behaviour, but neither method optimizes for morphology.

3 Preliminaries

Imitation Learning as distribution-matching

For a
given expert state-action trajectory $τ^{E} = (s_{0}, a_{0}, s_{1},$
$a_{1}, \dots, s_{n}, a_{n})$ , the imitation learning task is to learn a policy $π^{I} (a | s)$ such that the resulting behaviour best matches the demonstrated behaviour. This problem setting can be understood as minimizing a divergence, or alternative measures, $D (q (τ^{E}), p (τ^{I} | π^{I}))$ between the demonstrator trajectory distribution $q (τ^{E})$ and the trajectory distribution of the imitator $p (τ^{I} | π^{I})$ induced by its policy $π^{I}$ (see e.g. osa2018algorithmic for further discussion).

While there are multiple paradigms of imitation learning, a recently popular method is adversarial imitation learning, where a discriminator is trained to distinguish between policy states (or state-action pairs) and demonstrator states ho2016generative; orsini2021matters. The discriminator is then used for providing rewards to an RL algorithm which maximizes them via interaction. In the remainder of the paper we will be focusing on two adversarial methods with a divergence-minimization interpretation which we will now discuss in more detail.

Generative Adversarial Imitation Learning (GAIL)

GAIL trains a standard classifier using a logistic loss which outputs the probability that a given state comes from the demonstration trajectories ho2016generative. The reward function is chosen to be a function of the classifier output. Many options are given in literature for the choice of reward, evaluated extensively by orsini2021matters. Different choices of rewards correspond to different distance measures in terms of the optimization problem. Here, we consider the reward introduced by fu2017learning:

r (s_{t}, s_{t + 1}) = log (ψ (s_{t})) - log (1 - ψ (s_{t})),

(1)

where $ψ$ is a classifier trained to distinguish expert data from the imitator. Maximizing this reward corresponds to minimizing the Kullback-Leibler divergence between the demonstrator and policy state-action marginals ghasemipour2020divergence.

State-Alignment Imitation Learning (SAIL)

In contrast to GAIL, SAIL liu2019state uses a Wasserstein-GAN-style arjovsky2017wasserstein critic instead of the standard logistic regression-style discriminator. Maximizing the SAIL reward corresponds to minimizing the Wasserstein distance villani2009optimal between demonstrator and policy state-marginals (see liu2019state for details).

4 A General Framework for Co-Imitation

We formalize the problem of co-imitation as follows. Consider an expert MDP described by $(S^{E}, A^{E}, p^{E}, p_{0}^{E})$ , with state space $S^{E}$ , action space $A^{E}$ , initial state distribution $p_{0}^{E} (s_{0}^{E})$ , and the transition probability $p^{E} (s_{t + 1}^{E} | s_{t}^{E}, a_{t}^{E})$ . Furthermore, assume that the generally unknown expert policy is defined as $π^{E} (a_{t}^{E} | s_{t}^{E})$ . In addition, an imitator MDP is defined by $(S^{I}, A^{I}, p^{I}, p_{0}^{I}, π^{I}, ξ)$ , where the initial state distribution $p^{I} (s_{0}^{I} | ξ)$ and transition probability $p^{I} (s_{t + 1}^{I} | s_{t}^{I}, a_{t}^{I}, ξ)$ are parameterized by a morphology-parameter $ξ$ . The trajectory distribution of the expert is given by

q (τ^{E}) = p_{E} (s_{0}^{E}) T - 1 \prod t = 0 p_{E} (s_{t + 1}^{E} | s_{t}^{E}, a_{t}^{E}) π^{E} (a_{t}^{E} | s_{t}^{E}),

(2)

while the imitator trajectory distribution is dependent on the imitator policy $π^{I} (a | s, ξ)$ and chosen morphology $ξ$

(3)

It follows that the objective of the co-imitation problem is to find an imitator policy $π^{I *}$ and the imitator morphology $ξ^{*}$ such that a chosen probability-distance divergence measure or function $D (\cdot, \cdot)$ is minimized, i.e.

ξ^{*}, π^{I}^{*} = a r g m i n ξ, π^{I} D (q (τ^{E}), p (τ^{I} | π^{I}, ξ)) .

(4)

For an overview of potential candidate distance measures and divergences see e.g. ghasemipour2020divergence. For the special case that state-spaces of expert and imitator do not match, a simple extension of this framework is to assume two transformation functions $ϕ (\cdot) : S^{E} \to S^{S}$ , and $ϕ_{ξ} (\cdot) : S^{I} \to S^{S}$ where $S^{S}$ is a shared feature space. For simplicity we overload the notation and use $ϕ (\cdot)$ for both the demonstrator and imitator state-space mapping.

5 Co-Imitation by
State Distribution Matching

We consider in this paper the special case of co-imitation by state distribution matching and present two imitation learning methods adapted for the learning of behaviour and design. The co-imitation objective from Eq. (4) is then reformulated as

D(q(τE),p(τI|πI,ξ))def=\joinrel=D(q(ϕ(sE)),p(ϕ(sI)|πI,ξ)).

(5)

Similar to lee2019efficient we define the marginal feature-space state distribution of the imitator as

	$p(ϕ(sI)\|πI,ξ)def=\joinrel=$		(6)
	$E_{\begin{matrix} s_{0}^{I} \sim p^{I} (s_{0}^{I} \| ξ) a_{t}^{I} \sim π^{I} (a_{t}^{I} \| s_{t}^{I}, ξ) s_{t + 1}^{I} \sim p^{I} (s_{t + 1}^{I} \| s_{t}^{I}, a_{t}^{I}, ξ) \end{matrix}} [\frac{1}{T} T \sum t = 0 1 (ϕ (s_{t}^{I}) = ϕ (s^{I}))],$

while the feature-space state distribution of the demonstrator is defined by

	$q(ϕ(sE))def=\joinrel=$		(7)
	$E_{\begin{matrix} s_{0}^{E} \sim p^{E} (s_{0}^{E}) a_{t}^{E} \sim π^{E} (a_{t}^{E} \| s_{t}^{E}) s_{t + 1}^{E} \sim p^{E} (s_{t + 1}^{E} \| s_{t}^{E}, a_{t}^{E}) \end{matrix}} [\frac{1}{T} T \sum t = 0 1 (ϕ (s_{t}^{E}) = ϕ (s^{E}))] .$

Intuitively, this formulation corresponds to matching the visitation frequency of each state in the expert samples in the shared feature space. In principle any transformation that maps to a shared space can be used. For details of our specific choice see Section 6.1. Importantly, this formulation allows us to frame the problem using any state marginal matching imitation learning algorithms for policy learning. See ni2021f for a review of different algorithms.

An overview of CoIL is provided in Algorithm 1. We consider a set of given demonstrator trajectories $T_{E}$ , and initialize the imitator policy as well as an initial morphology $ξ_{0}$ . Each algorithm iteration begins with the robot training the imitator policy for the current morphology $ξ$ for $N_{ξ}$ iterations, as discussed in Section 5.1. The set of collected imitator trajectories $T_{ξ}^{I}$ and morphology are added to the dataset $Ξ$ . Then, the morphology is optimized by computing the distribution distance measure following Algorithm 2. The procedure is followed until convergence, finding the morphology and policy that best imitate the demonstrator. We follow an alternating approach between behaviour optimization and morphology optimization as proposed by prior work such as luck2020data.

1:Set of demonstration trajectories

T^{E} = {τ_{0}^{E}, . . .}

2:Initialize

π^{I}

ξ = ξ_{0}

T^{I} = \emptyset

Ξ = \emptyset

, and RL replay

R_{RL}

3:while not converged do

4: Initialize agent with morphology

ξ

5: for

n = 1, \dots, N_{ξ}

episodes do

6: With current policy

π^{I}

sample state-action trajectory . . . a . . . . .

(s_{0}^{I}, a_{0}^{I}, \dots, s_{t}^{I}, a_{t}^{I}, s_{t + 1}^{I}, \dots)

in environment

7: Add tuples

(s_{t}^{I}, a_{t}^{I}, s_{t + 1}^{I}, ξ)

to replay

R_{RL}

8: Add state-trajectory

τ_{n, ξ}^{I} = (s_{0}^{I}, s_{1}^{I}, . . .)

T^{I}

9: Compute rewards

r (ϕ (s_{t}^{I}), ϕ (s_{t + 1}^{I}))

using IL strategy

10: Add rewards

r (ϕ (s_{t}^{I}), ϕ (s_{t + 1}^{I}))

R_{RL}

11: Update policy

π^{I} (a_{t}^{I} | s_{t}^{I}, ξ)

using RL and

R_{RL}

12: end for

13: Add

(ξ, T_{ξ}^{I})

Ξ

with

T_{ξ}^{I} = {τ_{0_{ξ}, ξ}^{I}, . . ., τ_{N_{ξ}, ξ}^{I}}

14:

ξ =

Morpho-Opt

(T^{E}, Ξ)

▹

Adapt Morphology (Alg. 2)

15:end while

Algorithm 1 Co-Imitation Learning (CoIL)

ξ_{next}

, next candidate morphology

2:procedure Morpho-opt(

T^{E}

Ξ

)

3: Define observations

X = {ξ_{n}}, \forall ξ_{n} \in Ξ

4: Compute

Y = {y_{n}}, \forall (ξ_{n}, T_{n}^{I}) \in Ξ

▹

Using Eq. (12)

5: Fit GP

g (ξ)

using

X

and

Y

μ_{g} (~ ξ), σ_{g} (~ ξ) = p (g (~ ξ) | X, Y)

▹

Compute posterior

α (~ ξ) = μ_{g} (~ ξ) + β σ_{g} (~ ξ)

▹

Compute UCB

ξ_{next} = {a r g m i n}_{~ ξ} α (~ ξ)

▹

Provide next candidate

9:end procedure

Algorithm 2 Bayesian Morphology Optimization

5.1 Behaviour Adaptation

Given the current morphology $ξ$ of an imitating agent, the first task is to optimize the imitator policy $π^{I}$ with

π_{next}^{I} = a r g m i n π^{I} D (q (ϕ (s^{E})), p (ϕ (s^{I}) | π^{I}, ξ)) .

(8)

The goal is to find an improved imitator policy $π_{next}^{I}$ which exhibits behaviour similar to the given set of demonstration trajectories $T^{E}$ . This policy improvement step is performed in lines 4–11 in Algorithm 1. We experiment with two algorithms: GAIL and SAIL, which learn discriminators as reward functions $r (s_{t}, s_{t + 1})$ . Following orsini2021matters we use SAC, a sample-efficient off-policy model-free algorithm as the reinforcement learning backbone for both imitation learning algorithms (line 10 in Alg. 1). To ensure that the policy transfers well to new morphologies, we train a single policy $π^{I}$ conditioned both on $s_{t}^{I}$ and on $ξ$ . Data from previous morphologies is retained in the SAC replay buffer. Further details about the changes made to the these algorithms for behaviour adaptation in the co-imitation setting are stated in the Appendix.

5.2 Morphology Adaptation

Adapting the morphology of an agent requires a certain exploration-exploitation trade-off: new morphologies need to be considered, but changing it too radically or too often will hinder learning. In general, co-imitation is challenging because a given morphology can perform poorly due to either it being inherently poor for the task, or because the policy has not converged to a good behaviour.

Previous approaches have focused on using either returns averaged over multiple episodes, (e.g ha2019reinforcement) or the Q-function of a learned policy luck2020data to evaluate the fitness of given morphology parameters. They then perform general-purpose black-box optimization along with exploration heuristics to find the next suitable candidate to evaluate. Since both approaches rely on rewards, in the imitation learning setting they correspond to maximizing the critic’s approximation of the distribution distance. This is because the rewards are outputs of a neural network that is continuously trained and, hence, inherently non-stationary. Instead, we propose to minimize in the co-imitation setting the true quantity of interest, i.e. the distribution distance for the given trajectories.

Given the current imitator policy $π^{I} (a_{t}^{I} | s_{t}^{I}, ξ)$ our aim is to find a candidate morphology minimizing the objective

ξ_{next} = a r g m i n ξ D (q (ϕ (s^{E})), p (ϕ (s^{I}) | π^{I}, ξ)) .

(9)

Bayesian Morphology Optimization

In order to find the optimal morphology parameters we perform Bayesian Optimization (BO), which is a sample-efficient optimization method that learns a probabilistic surrogate model frazier2018botutorial. Here, we use a Gaussian Process (GP) rasmussen_2006_gpbook as surrogate to learn the relationship between the parameters $ξ$ and the distance $D (q (ϕ (s^{E})), p (ϕ (s^{I}) | π^{I}, ξ))$ . This relationship is modeled by the GP prior

g (ξ) = G P (μ (ξ), k (ξ, ξ^{'})),

(10)

where $μ (\cdot)$ defines the mean function, and $k (\cdot, \cdot)$ the kernel (or covariance) function. We show that adapting the morphology in CoIL via this approach increases performance over the co-adaptation and imitation baselines in Section 6.

Modelling the relationship between the parameters $ξ$ and the distance $D (\cdot, \cdot)$ is surprisingly challenging because the policy evolves over time. This means that morphologies evaluated early in training are by default worse than those evaluated later, and thus should be trusted less. The BO algorithm alleviates this problem by re-fitting the GP at each iteration using only the most recent episodes. By learning the surrogate GP model $g (ξ)$ we can explore the space of morphologies and estimate their performance without gathering new data. The optimization problem can be defined as

ξ_{next} = a r g m i n ξ g (ξ),

(11)

where $ξ_{next}$ is the next proposed morphology to evaluate. The GP model is trained using as observations the set of morphologies used in behaviour adaptation $X = {ξ_{n}}, \forall ξ_{n} \in Ξ$ , and as targets $Y = {y_{0}, \dots, y_{N}}$ the mean distribution distance for each morphology, that is

y_{n} = \frac{1}{N_{ξ}} N_{ξ} \sum k = 0 D (q (τ^{E}), p (τ_{k, ξ}^{I} | π^{I}, ξ)) .

(12)

The predictive posterior distribution is given by $p (g (~ ξ) | X, Y) = N (~ ξ | μ_{g} (~ ξ), σ_{g} (~ ξ))$ , where $~ ξ$ is the set of test morphologies and $μ_{g} (~ ξ)$ and $σ_{g} (~ ξ)$ are the predicted mean and variance. In order to trade-off between exploration and exploitation we use the Upper Confidence Bound (UCB) as acquisition function $α (~ ξ) = μ (~ ξ) + β σ (~ ξ)$ , where $β$ is a parameter that controls the exploration. The morphology optimization procedure is depicted in Algorithm 2. The GP is optimized by minimizing the negative marginal log-likelihood (MLL). Then, the posterior distribution is computed for the set of test morphologies $~ ξ$ . The values of $~ ξ$ for each task are described in the Appendix. Finally, the acquisition function is computed and used to obtain the next proposed morphology. We provide in the Appendix a study of the morphology optimization routine comparing the proposed BO approach to Random Search (RS) bergstra2012randomsearch, and covariance matrix adaptation evolutionary strategy (CMA-ES) hansen2001cmaes.

Figure 3: Left: Markers used for matching the MuJoCo Humanoid to motion capture data. Right: Markers used for the Cheetah tasks. Green markers are used as data, while blue markers serve as reference points for green markers (more details in Section 6.1).

6 Experiments

Our experimental evaluation aims at answering the following research questions:
(Q1) Does imitation learning benefit from co-adapting the imitator’s morphology?
(Q2) How does the choice of the imitation learning algorithm used with CoIL impact the imitator’s morphology?
(Q3) Is morphology adaptation with CoIL able to compensate for major morphology differences, such as a missing joint or the transfer from a real to a simulated agent?
To answer these questions, we devised a set of experiments across a range of setups and imitation learning methods.

6.1 Experimental Setup

In all our experiments, we use the MuJoCo physics engine Todorov2012MuJoCoAP for simulating the dynamics of agents. As discussed in Algorithm 1, the policies are trained using the same morphology for $N_{ξ} = 20$ episodes. The BO algorithm details as well as more detailed technical information can be found in the Appendix.

Joint feature space

As discussed in Section 4 our method assumes that demonstrator and imitator states are in different state-spaces. To address this mismatch, the proposed method maps the raw state observations from the demonstrator and the imitator to a common feature space. The selection of the feature space can be used to influence which parts of the behaviour are to be imitated. In our setups, we manually selected the relevant features by placing markers along each of the limbs in both experimental setups, as shown in Figure 3). The feature space is then composed of velocities and positions of these points relative to the base of their corresponding limb (marked in blue in the figure).

Evaluation Metric

Evaluating the accuracy of imitation in a quantitative manner is not straightforward, because—in general—there does not exist an explicit reward function that we can compare performance on. While most imitation learning works use task-specific rewards to evaluate imitation performance, it is not a great proxy for e.g. learning similar gaits. Recently, previous work in state-marginal matching has used forward and reverse KL divergence as a performance metric ni2021f. However, rather than evaluating the KL divergence, we opted for using the Wasserstein distance villani2009optimal as the evaluation metric. The main motivation behind this choice was that this metric corresponds to the objective optimized by SAIL and PWIL, two state-of-the-art imitation learning algorithms. Additionally, it constitutes a more intuitive quantity for comparing 3D positions of markers than KL divergence—the Wasserstein distance between the expert and imitator feature distributions corresponds to the average distance by which markers of the imitator need to be moved in order for the two distributions to be aligned. Therefore, for both morphology optimization and evaluation we use the exact Wasserstein distance between marker position samples from the demonstrator $q (ϕ (s^{E}))$ and imitator $p (ϕ (s^{I}) | π^{I}, ξ)$ state marginal distributions. This also allows us to avoid an additional scaling hyperparameter when optimizing for morphologies, since velocities and positions have different scales. The Wasserstein distances are computed using the pot package flamary2021pot. For all runs we show the mean and standard deviation of 3 seeds represented as the shaded area.

6.2 Co-Imitation from Simulated Agents

Figure 4: Wasserstein distance for three seeds between demonstrator and imitator trajectories on the *3to2* Cheetah task on co-imitation (CoIL) and pure imitation learning algorithms (SAIL, GAIL).

(a) Imitation of a 2-joint Cheetah using a 3-joint Cheetah.

We adapt the HalfCheetah setup from OpenAI Gym openaigym by creating a version with two leg-segments instead of three (see Fig. 3). We then collect the demonstration datasets by generating expert trajectories from a policy trained by SAC using the standard running reward for both variants of the environment. We refer to these tasks as 3to2 and 2to3 corresponding to imitating a 3-segment demonstrator using a 2-segment imitator and vice versa. For both experiments we used 10 episodes of 1000 timesteps as demonstration data. Further details can be found in the Appendix.

First, we answer RQ1 by investigating whether co-adapting the imitator’s morphology is at all beneficial for their ability to replicate the demonstrator’s behaviour, and—if so—how different state marginal matching imitation learning algorithms perform at this task (RQ2). To this end, we analyze the performance of two imitation learning algorithms, GAIL and SAIL on the HalfCheetah setup, with and without co-adaptation. We use BO as the morphology optimizer, as it consistently produced good results in preliminary experiments (see Appendix). The performance for both imitation algorithms on the 3to2 task is shown in Figure 4. We observe that SAIL outperforms GAIL both with and without morphology adaptation. Our results indicate that this task does not benefit from morphology optimization as SAIL and CoIL achieve similar performance. However, it is encouraging to note that CoIL does not decrease performance even when the task does not benefit from co-adaptation. Based on these results we select SAIL as the main imitation learning algorithm due to its higher performance over GAIL.
Figure 5 shows the results in the two HalfCheetah morphology transfer scenarios. To address RQ3, we compare CoIL to two other co-imitation approaches: using the cheetah without morphology adaptation, as well as to using the Q-function method adapted from luck2020data. Since this method is designed for the standard reinforcement learning setting, we adapt it to the imitation learning scenario by using SAIL to imitate the expert trajectories, and iteratively optimizing the morphology using the Q-function. See the Appendix for further details of this baseline. In the 3to2 domain transfer scenario (Figure 4(b)), where the gait of a more complex agent is to be reproduced on a simpler setup, the results are even across the board. All methods are able to imitate the demonstrator well, which indicates that this task is rather easy, and that co-adaptation does not provide much of a benefit. On the other hand, in the 2to3 scenario shown in Figure 4(a), after co-adaptation with CoIL, the more complex Cheetah robot is able to reproduce the gait of the simpler, two-segment robot very closely. A closer look at the results reveals that the morphology adaptation algorithm achieves this by setting the length of the missing link in each leg to a very small, nearly zero value (see Appendix). Thus, at the end of training, CoIL can recover the true morphology of the demonstrator.
While the Q-function optimization procedure from luck2020data also optimizes for the Wasserstein distance metric via the reward signal, the final performance is somewhat worse. We hypothesize that with more interaction time the Q-function version would reach the performance of CoIL on this simple task.

6.3 Co-Imitation from Human Behaviour

Next, we address RQ3 by evaluating CoIL in a more challenging, high-dimensional setup, where the goal is to co-imitate demonstration data collected from a real-world human using a simplified simulated agent. Here, we use a Humanoid robot adapted from OpenAI Gym openaigym together with the CMU motion capture data cmu as our demonstrations. This setup uses a similar marker layout to HalfCheetah’s, with markers placed at each joint of each limb, with additional marker in the head (see Figure 3 for a visualization). We follow the same relative position matching as in the Cheetah setup. We also include the absolute velocity of the torso in the feature space to allow modelling forward motion.

The performance of the Humanoid agent on imitating three tasks from CMU motion capture dataset: walking, jogging, and soccer kick, is shown in Figure 6. We observe that, in all three tasks, CoIL reproduces the demonstrator behaviour most faithfully. A comparison of the morphology and behaviour learned by CoIL vs standard imitation learning (here SAIL) in the jogging task is shown in Figure 2. In the soccer kick task, CoIL’s performance matches the distance between individual demonstrations, while for the two locomotion tasks—jogging and walking—there is still a noticeable performance gap between CoIL and the individual expert demonstrations (with $p = 0.0076$ , Wilcoxon signed rank test).

We also observe that, in all three setups, not performing co-adaptation at all (and using the default link length values for the OpenAI Gym Humanoid instead) outperforms co-adaptation with the Q-function objective. We hypothesize that this counter-intuitive result might stem from the increased complexity of the task—learning a sensible Q-function in the higher-dimensional morphology- and state feature-space of Humanoid is likely to require a much larger amount of data, and thus a longer interaction time. In contrast, optimizing the morphologies using the Wasserstein distance directly makes the optimization procedure easier, since it does not rely on the Q-function ”catching up” with changes both to policy and to the adversarial reward models used in GAIL and SAIL.

7 Conclusion

In this paper we presented Co-Imitation Learning (CoIL): a methodology for co-adapting both the behaviour of a robot and its morphology to best reproduce the behaviour of a demonstrator. This is, to the best of our knowledge, the first deep learning method to co-imitate both morphology and behaviour using only demonstration data with no pre-defined reward function. We discussed and presented a version of CoIL using state distribution matching at its core for co-imitating a demonstrator in the special case of mismatching state and action spaces. The capability of CoIL to better co-imitate behaviour and morphology was demonstrated in a difficult task where a simulated humanoid agent has to imitate real-world motion capturing data of a human.

Although we were able to show that CoIL outperforms non-morphology-adapting imitation learning techniques in the presented experiment using real-world data, we did not consider or further investigate the inherent mismatch between physical parameters (such as friction, contact-forces, elasticity, etc.) of simulation and real world or the use of automatic feature-extraction mechanisms. We think that these challenges present interesting avenues for future research and that the presented co-imitation methodology opens up a new exciting research space in the area of co-adaptation of agents.

8 Acknowledgments

This work was supported by the Academy of Finland Flagship programme: Finnish Center for Artificial Intelligence FCAI and by Academy of Finland through grant number 328399. We acknowledge the computational resources provided by the Aalto Science-IT project. The data used in this project was obtained from mocap.cs.cmu.edu and was created with funding from NSF EIA-0196217.

References

Appendix

Appendix A Imitation Learning algorithms

Here we describe the core RL and imitation algorithms we used in this work. For consistency and following the large ablation study conducted by orsini2021matters, we used SAC haarnoja2018soft as the RL algorithm for all methods. SAC is an actor-critic algorithm which optimizes a soft Q-function with the loss

L (Q) = E_{s_{t}, a_{t}, s_{t + 1} \sim τ_{π}} [\frac{1}{2} (Q (s_{t}, a_{t}) -^Q)^{2}],

(13)

with the target

^Q = r (s_{t}, s_{t + 1}) + γ E_{a_{t} \sim π} Q (s_{t + 1}, a_{t}),

(14)

and the policy loss

L (π) = E_{s_{t} \in τ_{π}} [α_{SAC} log π (a_{t} | s_{t}) - Q (s_{t}, a_{t})] .

including the automatic entropy tuning of $α_{SAC}$ introduced in haarnoja2018softb.

Generative Adversarial Imitation Learning

The authors of GAIL used TRPO to maximize the adversarial reward generated by the GAIL discriminator. Adapting it to SAC is straightforward, as only the reward depends on the discriminator output.

State-Alignment Imitation Learning

In addition to the adversarial-style rewards SAIL uses a modified policy objective which adjusts the policy towards a policy prior

π_{p} (a_{t} | s_{t}) \propto exp ⎛ ⎝ - {∥ ∥ ∥ \frac{g_{inv} (s, f (s))}{σ} ∥ ∥ ∥}^{2} ⎞ ⎠

(15)

where $g_{inv}$ is an inverse dynamics model trained using transitions sampled from the policy, $f (s)$ is a $β -$ Variational Auto Encoder (VAE) trained using demonstration data, and $σ$ is a constant (see liu2019state for details). We use 50K timesteps worth of data using random actions to pretrain the inverse dynamics and pretrain the VAE.

The authors used on-policy PPO as the base RL algorithm. In order to use SAC, we make the following adjustments to the SAC policy objective:


	$+ (π^{I} (a_{t} \| s_{t}) - π_{p} (a_{t} \| s_{t}))^{2}],$		(16)

where $α_{SAC}$ is the entropy scalar tuned as in haarnoja2018softb. In this work, we include the gradient penalty term introduced in gulrajani2017improved to the discriminator loss.

Appendix B Further Discussions of the Co-Imitation Framework

In this section we give further intuition for matching the proposed trajectory distributions of expert and imitator by using two variations of the Kullback-Leibler (KL) divergence as example divergence to minimize. Our aim is to provide further intuition and discuss and shed a light at the unique properties of the selected co-imitation problems presented in this paper. We will start the discussion with a look at the standard imitation learning problem (i.e. behaviour cloning) in the context of co-adaptation. Thereafter, we discuss further the co-imitation setting selected for CoIL, namely to match the state distributions between imitator and expert. As a reminder, the trajectory distribution of the expert is given by

q (τ) = q (s_{0}) T - 1 \prod t = 0 q (s_{t + 1} | s_{t}, a_{t}) π^{E} (a_{t} | s_{t}),

(17)

while the imitator trajectory distribution is dependent on the imitator policy $π^{I} (a | s, ξ)$ and chosen morphology $ξ$

p (τ | π^{I}, ξ) = p (s_{0} | ξ) T - 1 \prod t = 0 p_{I} (s_{t + 1} | s_{t}, a_{t}, ξ) π^{I} (a_{t} | s_{t}, ξ) .

(18)

To improve readability and for the purpose of this discussion we will assume a shared state $S$ and action $A$ spaces for both imitator and expert.

b.1 Behavioural Cloning in the Co-Imitation Setting

While classic imitation learning is concerned with the problem of minimizing

min π^{I} D (q (τ), p (τ | π^{I}))

(19)

for a measure or divergence $D (\cdot, \cdot)$ with matching transition probabilities, we consider in this work the extension where we aim to optimize

min π^{I}, ξ D (q (τ), p (τ | π^{I}, ξ)),

(20)

assuming a parameterization of the imitator transition probability of $p (s_{t + 1} | s, a, ξ)$ . In our setting, we assume the variable $ξ$ to be morphological parameters such as lengths, sizes, weights or other shape inducing variables. These parameters are observable and not latent as we assume them to be needed for the instantiation of the simulation²²2I.e. these parameters are needed for construction of the URDF/XML files needed for simulation, or production of the hardware components via 3D-printing, for example.

While many potential choices for $D (\cdot, \cdot)$ exist (see e.g. ghasemipour2020divergence for some options) we will provide some further intuition behind the proposed co-imitation learning problem by utilizing the KL-divergence. Applying the KL-divergence to the problem in Eq. 20 results in

	$D_{KL} (q (τ) \| \| p (τ \| π^{I}, ξ)) = \int_{τ} q (τ) ln \frac{q (τ)}{p (τ \| π^{I}, ξ)}$
	$= E_{q (τ)} [ln (\frac{q (s_{0})}{p (s_{0} \| ξ)})]      match % initial state distribution + E_{q (τ)} [ln [\frac{\prod q (s^{'} \| s, a)}{\prod p (s^{'} \| s, a, ξ)}]]      match transition distribution$
	$+ E_{q (τ)} [ln [\frac{\prod π^{E} (a \| s)}{\prod π^{I} (a \| s)}]]      match expert policy,$		(21)

where we can see that the equation can be rearranged into three problems: (1) matching the initial state-distribution, (2) matching the transition distributions, and (3) matching the policies of imitator and expert. For better readability the notation of $s$ for the current state and $s^{'}$ for the following state is used. The expectation is in respect to the trajectory distribution $q (τ)$ of the expert, which in practice is replaced by a set of sampled trajectories. One can re-formulate Eq. 21 by using the distribution of state, action and next state, induced by the trajectory distribution of the expert leading to a simplified form with

	$\int_{s, a, s^{'}} q (s, a, s^{'}) (ln (\frac{q (s^{'} \| s, a)}{p (s^{'} \| s, a, ξ)}) + ln (\frac{π^{E} (a \| s)}{π^{I} (a \| s)})]$
	$\approx E_{q (s, a, s^{'})} [- p (s^{'} \| s, a, ξ) - π^{I} (a \| s)] + C,$		(22)

where we show the resulting morphology-behaviour optimization objective plus the constant term. It is worth to note that thus far we have been using $π^{I} (a | s)$ for the policy distribution, i.e. a policy $π^{I}$ which does not depend on $ξ$ . However, in practice³³3As we do in the proposed co-imitation learning method utilizing state-distribution matching. one may want to utilize a policy $π^{I} (a | s, ξ)$ which is capable of predicting optimal (imitation) actions given both the current state and morphology of the imitator⁴⁴4Note that different actions may be optimal depending on the current morphology of the agent.. Thus, leading to the alternative objective function

E_{q (s, a, s^{'})} [- p (s^{'} | s, a, ξ) - π^{I} (a | s, ξ)] .

(23)

b.2 Co-Imitation Learning by State Distribution Matching

While the equation above depends on the knowledge of the imitator transition distribution, an alternative to the chosen KL-divergence for trajectories as objective is to consider state-distribution matching divergences. Here, we want to compute a measure for

D (q (s | π^{E}), p (s | π^{I})) .

(24)

It is straight forward to show that knowledge of the trajectory distributions $q (τ)$ and $p (τ | π^{I}, ξ)$ allows us to derive the state distributions given the respective policy with

	$p (s \| π) = \int_{τ} p (s, τ \| π) = \int_{τ} p (s \| τ, π) p (τ \| π)$
	$= E_{p (τ \| π)} [p (s \| τ)] = E_{p (τ \| π)} [\frac{1}{T} T \sum t = 0 1 (s_{t} = s)] .$		(25)

Intuitively, the state distribution $p (s | π)$ can be computed by sampling trajectories with the policy $π$ and computing the expected occurrence of a state $s$ in a trajectory $τ$ . This insight allows us to develop the presented co-imitation learning method using state-distribution matching without requiring access to the true transition probability $p (s^{I} | s, a, ξ)$ of the imitator like in Eq. 23.

While the main paper proposes to use a Wasserstein distance, we will present here some analysis using the KL-divergence as before. Using our state distributions defined above we arrive at the following objective with

	$D(q(s\|πE),p(s\|πI))def=\joinrel=KL(q(s\|πE)∥p(s\|πI))$
	$=\footnotesizeKL(q(s\|πE)∥∥ ∥∥Ep(τ\|πI,ξ)[1TT∑t=01(st=s)])$		(26)
	$= - E_{q (s \| π^{E})} [ln E_{p (τ \| π^{I}, ξ)} [\frac{1}{T} T \sum t = 0 1 (s_{t} = s)]] + C,$		(27)

where we move terms which do not depend on $π^{I}$ or $ξ$ into a constant $C$ . We can now rearrange this objective with

		$- E_{q (s \| π^{E})} [ln (\frac{1}{T} T \sum t = 0 E_{s_{0} \sim p (s_{0} \| ξ)} [$
		$\footnotesizeEa1∼πI(a\|s0,ξ)[Es1∼p(s\|s0,a0,ξ)[⋯$
		$\footnotesizeEat−1∼πI(a\|st−1,ξ)[⋯$
		$E_{s_{t} \sim p (s \| s_{t - 1}, a_{t - 1}, ξ)} [1 (s_{t} = s)]] \dots]])],$		(28)

which computes the expected probability of being in state $s$ at the end of a sub-trajectory unrolled for $t$ timesteps, where $s$ is sampled from the expert state distribution. To simplify this objective and make it tractable for optimization we can apply Jensen’s inequality and arrive at a lower bound with

$-$	$T \cdot Eq.~{} (???) \geq T \sum t = 0 E_{q (s \| π^{E})} [ln (E_{s_{0} \sim p (s_{0} \| ξ)} [$
	$E_{a_{1} \sim π^{I} (a \| s_{0}, ξ)} [E_{s_{1} \sim p (s \| s_{0}, a_{0}, ξ)} [\dots$
	$E_{a_{t - 1} \sim π^{I} (a \| s_{t - 1}, ξ)} [\dots$
	$E_{s_{t} \sim p (s \| s_{t - 1}, a_{t - 1}, ξ)} [1 (s_{t} = s)]] \dots]])],$	(29)

where we removed the negative sign and consider this as a maximization problem in respect to the imitator policy $π^{I}$ and imitator morphology $ξ$ . The main insight we get from this exercise is that, unlike in the simpler behavioural cloning case discussed above in Equations 21 and previous, the behavioural policy $π^{I}$ and morphology $ξ$ are here inherently entangled and have to be optimized concurrently, i.e. a separation is not possible without further assumptions or simplifications. While in Eq. 21 we were able to separate the optimization problem into clearly defined components for matching transition probabilities (depending on $ξ$ ) and matching the expert and imitator policies, we find that this is not the case for state-distribution matching as derived⁵⁵5 Here for the case of the KL divergence. in Eq. 29.

Appendix C Experimental set-up

c.1 Q-function baseline

Here we describe how we adapt the morphology optimization procedure described in luck2020data to the imitation learning setting to serve as a baseline. To do this, we use the Q-function learned by SAC on the SAIL reward as a surrogate for computing returns from entire episodes and optimizing the morphology using those returns. We use a linearly decreasing $ϵ$ -greedy exploration schedule and Particle Swarm Optimization eberhart1995pso (as proposed by the original authors) to find the best morphology according to the Q-function in the case of exploitation. For exploration episodes, we sample morphology parameters uniformly from within the bounds described by table 6. The $ϵ$ is linearly reduced over 1 million timesteps.

c.2 Tasks

We extract marker positions and velocities from the CMU motion capture data cmu and apply preprocessing, such as resampling from 120hz to the MuJoCo speed of 66.67hz. For all tasks we include the preprocessed demonstrator data in the code supplement, as well as the preprocessing code.

c.3 HalfCheetah tasks

In the HalfCheetah tasks we directly optimize the lengths of each limb. The lower bound for optimization is $1 e- 6$ and the upper bound is twice the original value. When the imitator has three leg segments, there are a total of six parameters to optimize, while for the other case there are four. Figure 8 gives the learned parameters for the 2to3 task, as well as the meaning of each parameter.

Humanoid tasks

For the Humanoid tasks we optimize a scaling factor for the torso, legs and arms. The standard MuJoCo Humanoid corresponds to scaling factors $[1, 1, 1]$ . Table 6 gives the bounds for this optimization. The specific motions and subject IDs are detailed in the code supplement. Due to variable amount of data available, all three Humanoid tasks have a different amount of demonstrator trajectories and all episode lengths are different. For ”Soccer kick” the data includes multiple subjects.

Hyperparameter	Value
Batch size	$1024$
$γ$	$0.97$
Use transitions	False
disc decay	0.00001
All networks	MLP
Network layers	3
Hidden nodes	200
Activation	ReLU
normalize obs	False
Updates per step	1
Q weight decay	$1 e- 5$
Optimizer	Adam
Entropy tuning	True
SAC $τ$	$0.005$
learning rate	0.0003

Table 1: Hyper-parameters values shared throughout all experiments.

Hyperparameter	Value
Cheetah episode length	1000
Humanoid max episode length	300
Cheetah early termination	False
Humanoid early termination	True
Max steps	1.5M

Table 2: Task-specific hyper-parameter values

Hyperparameter	Value
Morphology Optimizer	PSO
PSO particles	250
PSO iters	250
$ϵ$ decay over	1M steps

Table 3: Q-function baseline-specific hyper-parameters values.

Hyperparameter	Value
VAE scaler	$1$
$β$ of VAE	0.2

Table 4: SAIL-specific hyper-parameters values.

Hyperparameter	Value
Reward style	AIRL
Use $log r (\cdot)$	True

Table 5: GAIL-specific hyper-parameters values.

c.4 Co-Imitation Learning Hyper-parameters

Table 1 gives the common hyper-parameters used for all experiments, while Table 4 gives SAIL-specific hyper-parameters and Table 5 gives GAIL-specific hyper-parameters. Tabl3 gives hyper-parameters for the Q-function baseline adapted from luck2020data. Batch size corresponds to the size of mini-batch used for discriminator and policy updates. $γ$ is the discount factor. VAE scaler is the balancing term between the policy prior and the SAC policy loss (see Eq. (A)). Disc decay is the weight decay multiplier for the discriminator weights. Updates per step means how many policy and discriminator updates we take for each environment timestep. SAC $τ$ is the soft target network update constant.

c.5 Morphology Optimization details

As described in Section 5.2 we use Bayesian Optimization for proposing the next candidate morphology. As surrogate model we use a basic Gaussian Process regression model. We use the Matern52 kernel and a constant mean function. The kernel and mean function hyper-parameters are trained by minimizing the negative marginal log-likelihood (MLL), where the kernel function uses automatic relevance determination (ARD) rasmussen_2006_gpbook. The optimization method selected is L-BFGS algorithm.

In order to compute the next morphology candidate $ξ_{next}$ we compute the predictive posterior distribution for the test morphologies $~ ξ \in R^{M \times E_{ξ}}$ , where $M$ is the number of test candidates to evaluate and $E_{ξ}$ is the dimension of morphology attributes. The range of values for the test morphologies is shown in Table 6, where the bounds column indicates the range for each of the morphology parameters in the respective environment.

The basic Gaussian Process regression model complexity is $O (N^{3})$ , where $N$ is the number of samples used for training rasmussen_2006_gpbook. Thus, the GP suffers from poor scalability and using the entire data-set of collected imitation trajectories and morphologies $Ξ$ would highly increase the algorithm training process. In addition, as discussed in Section 5.2 policies that were evaluated early in the training have worse performance than the recent ones, and hence, we have lower confidence on the performance of those morphologies. For this reason, we limit the number of previous morphologies to consider to $N = 200$ , so that only the most recent policies and morphologies are taken into account for modeling the relationship between the distance distribution and the morphology parameters.

Environment	$E_{ξ}$	Bounds	Description
Cheetah	$6$	$0 - 2 x$	Two L, three S
2-seg Cheetah	$4$	$0 - 2 x$	Two L, two S
Humanoid	$3$	$0.5 x - 2 x$	T, H and L scale

Table 6: Amount of morphology parameters

E_{ξ}

and their bounds in each environment. Here ”

2 x

” means the optimization is bounded to twice the default size. We use (T) for torso, (L) for legs, (S) for segments and (H) for hands.

Appendix D Hardware and software

d.1 Hardware and runtime

We ran our experiments on a Dell PowerEdge C4140 with 2x8 core Intel Xeon Gold 6134 3.2GHz and a V100 32GB GPU. The Humanoid experiments took around 24h to finish on this setup, while the HalfCheetah took 22h.

d.2 Software versions used

The experiments were ran on a cluster running CentOS 7. The exact software packages used are reported in Table 7

Software	Version used
Python	3.8
PyTorch	1.12
NumPy	1.23
GPy	1.10.0
Gym	0.24
MuJoCo	2.1
mujoco-py	2.1.2
GPy	1.10
GPyOpt	1.2.6

Table 7: Software packages used

Appendix E Experimental results

e.1 Choice of the morphology optimizer

In addition to the proposed Algorithm 2 based on BO, we evaluate two other algorithms to optimise the morphology parameters: Random Search (RS) bergstra2012randomsearch, and covariance matrix adaptation evolutionary strategy (CMA-ES) hansen2001cmaes. As discussed in Algorithm 1, the policies are trained using the same morphology for $N_{ξ} = 20$ episodes, as changing it too often would hinder learning. For both CMA-ES and RS we follow a similar procedure where the optimization algorithms take as input the observations $X = {ξ_{n}}, \forall ξ_{n} \in Ξ$ and use as targets $Y = {y_{n}}, \forall (ξ_{n}, T_{n}^{I}) \in Ξ$ , where $y$ is computed following Eq. (12). As opposed to BO, we use the entire data-set $Ξ$ as data points in CMA-ES since its performance does not suffer from the number of samples. By contrast, our implementation of RS keeps the best previous morphology and randomly proposes a new morphology.

Figure 7: Wasserstein distance for the jogging experiment using CMA-ES (blue), Random Search (orange) and Bayesian Optimization (green). Both RS and BO perform comparably on average, but BO has much lower variance.

Figure 8: Morphology parameters learned by CoIL for three seeds in the 2to3 task. Note how all seeds set either the shin or the foot to close to zero which allows very close matching of the real demonstrator.

The results for each optimization method in the Humanoid jogging experiment are shown in Figure 7. Since the number of different morphologies we can evaluate is relatively low, the BO approach benefits from the low data regime, presenting low mean and variance. Similarly, the RS results present a low mean but higher variance due to the randomness inherent in the algorithm. By contrast, we can observe that the performance of CMA-ES is lower than both BO and RS as it suffers from the low number of morphologies evaluated, getting stuck in a local optima. These results shows that exploring the morphologies using the BO algorithm is beneficial for the task of co-imitation, as we can find an optimal solution while evaluating a low number of morphologies and keeping a relatively low variance.

e.2 Analysis of morphology parameters

Here we show the evolution of the morphology as a function of time for each seed of the main CoIL experiment. Figure 8 gives the plots for each parameter, where the parameter value is the argmax of the GP mean, i.e the best morphology so far according to the GP.

In this task we imitate a simpler 2-leg-segment cheetah using a 3-leg-segment cheetah. It is possible for the imitator to adapt its morphology in such a way that that it matches the demonstrator exactly, by setting either both shins, both feet or one of each to zero. We can see in 8 that all seeds set either the shin or the foot to close to zero, meaning they are able to closely replicate the demonstrator morphology.

	$D_{KL} (q (τ) \| \| p (τ \| π^{I}, ξ)) = \int_{τ} q (τ) ln \frac{q (τ)}{p (τ \| π^{I}, ξ)}$
	$= E_{q (τ)} [ln (\frac{q (s_{0})}{p (s_{0} \| ξ)})]      match % initial state distribution + E_{q (τ)} [ln [\frac{\prod q (s^{'} \| s, a)}{\prod p (s^{'} \| s, a, ξ)}]]      match transition distribution$
	$+ E_{q (τ)} [ln [\frac{\prod π^{E} (a \| s)}{\prod π^{I} (a \| s)}]]      match expert policy,$		(21)

	$D(q(s\|πE),p(s\|πI))def=\joinrel=KL(q(s\|πE)∥p(s\|πI))$
	$=\footnotesizeKL(q(s\|πE)∥∥ ∥∥Ep(τ\|πI,ξ)[1TT∑t=01(st=s)])$		(26)
	$= - E_{q (s \| π^{E})} [ln E_{p (τ \| π^{I}, ξ)} [\frac{1}{T} T \sum t = 0 1 (s_{t} = s)]] + C,$		(27)

		$- E_{q (s \| π^{E})} [ln (\frac{1}{T} T \sum t = 0 E_{s_{0} \sim p (s_{0} \| ξ)} [$
		$\footnotesizeEa1∼πI(a\|s0,ξ)[Es1∼p(s\|s0,a0,ξ)[⋯$
		$\footnotesizeEat−1∼πI(a\|st−1,ξ)[⋯$
		$E_{s_{t} \sim p (s \| s_{t - 1}, a_{t - 1}, ξ)} [1 (s_{t} = s)]] \dots]])],$		(28)

$-$	$T \cdot Eq.~{} (???) \geq T \sum t = 0 E_{q (s \| π^{E})} [ln (E_{s_{0} \sim p (s_{0} \| ξ)} [$
	$E_{a_{1} \sim π^{I} (a \| s_{0}, ξ)} [E_{s_{1} \sim p (s \| s_{0}, a_{0}, ξ)} [\dots$
	$E_{a_{t - 1} \sim π^{I} (a \| s_{t - 1}, ξ)} [\dots$
	$E_{s_{t} \sim p (s \| s_{t - 1}, a_{t - 1}, ξ)} [1 (s_{t} = s)]] \dots]])],$	(29)