TarGF: Learning Target Gradient Field for Object Rearrangement

Mingdong Wu^{* 1, 3}, Fangwei Zhong^{* 2, 3}, Yulong Xia¹, Hao Dong^{1, 4}

^{1}

Center on Frontiers of Computing Studies, School of Computer Science, Peking University

^{2}

School of Artificial Intelligence, Peking University

^{3}

Beijing Institute for General Artificial Intelligence (BIGAI)

^{4}

Peng Cheng Laboratory
{wmingd, zfw, hao.dong}@pku.edu.cn

Abstract

Object Rearrangement is to move objects from an initial state to a goal state. Here, we focus on a more practical setting in object rearrangement, i.e., rearranging objects from shuffled layouts to a normative target distribution without explicit goal specification. However, it remains challenging for AI agents, as it is hard to describe the target distribution (goal specification) for reward engineering or collect expert trajectories as demonstrations. Hence, it is infeasible to directly employ reinforcement learning or imitation learning algorithms to address the task. This paper aims to search for a policy only with a set of examples from a target distribution instead of a handcrafted reward function. We employ the score-matching objective to train a Target Gradient Field (TarGF), indicating a direction on each object to increase the likelihood of the target distribution. For object rearrangement, the TarGF can be used in two ways: 1) For model-based planning, we can cast the target gradient into a reference control and output actions with a distributed path planner; 2) For model-free reinforcement learning, the TarGF is not only used for estimating the likelihood-change as a reward but also provides suggested actions in residual policy learning. Experimental results in ball rearrangement and room rearrangement demonstrate that our method significantly outperforms the state-of-the-art methods in the quality of the terminal state, the efficiency of the control process, and scalability. The code and demo videos are on sites.google.com/view/targf. {NoHyper} ^†^†* indicates equal contribution

\UseRawInputEncoding\UseRawInputEncoding

1 Introduction

Figure 1: We study the object rearrangement task without explicit goal specification.

As shown in Fig. 1, we consider object rearrangement without explicit goal specification Jiang et al. (2012); Abdo et al. (2015), where the agent is required to manipulate a set of objects from an initial layout to a normative distribution. This task is taken for granted by humans Ricciuti (1965) and widely exists in our daily life, such as tidying up a table Kapelyukh and Johns (2021), placing furniture Wang et al. (2020), and sorting parcels Kim et al. (2020).

However, compared with conventional robot tasks Yu et al. (2020); Zhu et al. (2017); Luo et al. (2019); Kiran et al. (2021), there are two critical challenges in rearrangement tasks: 1) Hard to clearly define the target state (goal), as the target distribution is diverse and the patterns are difficult to describe in the program, e.g., how to evaluate the cleanness quantitatively? As a result, it is difficult to manually design a reward/objective function, which is essential to modern control methods, e.g., deep reinforcement learning (DRL) Mnih et al. (2015). 2) In the physical world, the agents also need to find an executable path to reach the target state safely and efficiently, i.e., finding a short yet collision-free path to reach the target distribution. To this end, the agents are required to jointly learn to evaluate the quality of the state (reward) and take actions to move objects. In other words, the agents should answer the two questions: What kind of layouts does the user prefer, and How to manipulate objects to meet the target distribution?

Thus, it is necessary to find a way to build an agent without the dependence on the explicit goal state and shaped reward. Imitation learning (IL) can implicitly learn the reward function from expert trajectories Ho and Ermon (2016); Reddy et al. (2019); Liu et al. (2020). However, collecting a large number of expert trajectories is expensive. Recently, example-based reinforcement learning (RL) Fu et al. (2018); Eysenbach et al. (2021) has tried to reduce the reliance on expert trajectories, i.e., learning a policy only with examples of successful outcomes. However, it is still difficult to learn a classifier to provide accurate reward signals for learning in the case of high-dimensional state and action space, as there are many objects to control in object rearrangements.

In this work, we separate the task into two stages: 1) learning a score function to estimate the target-likelihood-change between adjacent states, and 2) learning/building a policy to rearrange objects with the score function. Inspired by the recent advances in score-based generative models Song and Ermon (2019); Song et al. (2020); Shi et al. (2021); Luo and Hu (2021); Meng et al. (2021), we employ the score-matching objective to learn a target gradient field (TarGF) by perturbing the well-arranged examples. The target gradient field can be viewed as object-wise guidance that provides a pseudo velocity on each object, pointing to regions with a higher density of the target distribution, e.g., cleaner configurations. To manipulate objects in the physical world, we further demonstrate two ways to leverage the target gradient field in control. 1) We integrate the guidance with a model-based path planner to rearrange objects without collision. 2) We further derive the target gradient to a delta log-likelihood as a reward function, key ingredients to reinforcement learning. With the TarGF-based reward and a TarGF-guided residual policy network Silver et al. (2018), we can train an effective RL-based policy from interactions.

In the experiments, we evaluate the effectiveness of our methods on two typical rearrangement tasks: Ball Rearrangement and Room Rearrangement. First, we introduce three metrics to analyse the quality of the terminal state and the control process. In Ball Rearrangement, we demonstrate that both of our control methods can efficiently reach a target distribution with high diversity and realism. We also observe that our methods can find a path with fewer collisions and fewer steps to converge compared with learning-based baselines Reddy et al. (2019); Eysenbach et al. (2021) and planning-based baselines Berg et al. (2011). In Room Rearrangement, we further evaluate the feasibility of our methods in complex scenes where there are multiple heterogeneous objects.

Our contributions are summarised as follows:

We introduce a score-based method to learn a target gradient field (TarGF) by perturbing the state of well-arranged target examples.
We point out two usages of the target gradient field in object rearrangement: 1) providing guidance in moving direction; 2) deriving a reward for learning.
We conduct experiments to demonstrate that both traditional planning and RL methods can benefit from the target gradient field and significantly outperform the state-of-the-art methods in sample diversity, realism and safety.

2 Related Works

2.1 Object and Scene (Re-)Arrangements

The object arrangement has been previously studied by researchers from the robotics community Schuster et al. (2012); Abdo et al. (2015, 2016) and the graphics community Fisher et al. (2012); Yu et al. (2011); Yeh et al. (2012); Handa et al. (2016); Qi et al. (2018). However, most of them focus on manually designing rules/ energy functions to find a goal Schuster et al. (2012); Abdo et al. (2015, 2016) or synthesising a scene configuration Yu et al. (2011); Yeh et al. (2012); Handa et al. (2016); Qi et al. (2018) that satisfies the preference of the user. The most recent work Kapelyukh and Johns (2021) tries to learn a GNN to output an arrangement tailored to human preferences. However, they neglect the physical constraints during the process. Hence, the accessibility and transition cost from the initial and target states are not guaranteed. Considering the physical interaction, the recent works on scene rearrangement Batra et al. (2020); Weihs et al. (2021) mitigate the goal specification problem by providing a goal scene to the agent, making the reward shaping feasible. Concurrent works Kant et al. (2022); Sarch et al. (2022) also notice the necessity of automatic goal inference for tying rooms and exploit the commonsense knowledge from Large Language Model (LLM) or memex graph to infer rearrangements goals when the goal is unspecified. Another concurrent work Netanyahu et al. (2022) aims at discovering generalisable spatial goal representations via graph-based active reward learning for 2D object rearrangement. In this paper, we focus on a more fundamental problem in rearrangement: how to estimate the similarity (i.e., the target likelihood) between the current state and the example sets and manipulate objects to maximise it. Here, only a set of positive target examples are required to learn the gradient field, rather than the prior knowledge about specific scenarios.

2.2 Learning without Reward Engineering

Without reward, previous works explore algorithms to learn a control policy from expert demonstrations via imitation learning Pomerleau (1991); Ross and Bagnell (2010); Ross et al. (2011); Ho and Ermon (2016) or inverse reinforcement learning Fu et al. (2017); Liu et al. (2020); Kostrikov et al. (2018); Ziebart et al. (2008). Imitation learning (IL) Pomerleau (1991); Ross and Bagnell (2010); Ross et al. (2011) aims to directly optimise the policy by cloning the behaviour from the expert trajectories. The inverse reinforcement learning (IRL) Fu et al. (2017); Liu et al. (2020); Kostrikov et al. (2018); Ziebart et al. (2008) tries to learn a reward function from data for reinforcement learning. Considering the difficulty in collecting expert trajectories, there are some example-based methods proposed to learn policy only with a set of successful examples Fu et al. (2018, 2017); Reddy et al. (2019); Eysenbach et al. (2021). The most recent work is RCE Eysenbach et al. (2021) which directly learns a recursive classifier from successful outcomes and interactions for policy optimisation. We also try to find an example-based control method for rearrangement with a set of examples. Our method can learn the target gradient fields from the examples, without any additional efforts in reward engineering. The target gradient fields can provide meaningful reward signals and action guidance for reinforcement learning and traditional planners. To the best of our knowledge, such usage of gradients fields is not explored in previous works.

3 Problem Statement

We aim to learn a policy that moves objects to reach states close to a target distribution without relying on reward engineering. Similar to the example-based control Eysenbach et al. (2021), the grounded target distribution $p_{t a r} (s)$ is unknown to the agent, and the agent is only given a set of target examples $S^{*} = {s^{*}}$ where $s^{*} \sim p_{t a r} (s)$ . In practice, these examples can be provided by the user without accessing the dynamics of the robot. The agent starts from an initial state $s_{0} \sim p_{0} (s_{0})$ , where $N$ objects are randomly placed in the environment. At time step $t$ , the agent takes action $π (a_{t} | s_{t})$ imposed on the objects with dynamics $p (s_{t + 1} | s_{t}, a_{t})$ . Here, the goal of the agent is to search for a policy $π^{*} (a_{t} | s_{t})$ that maximises the discounted log-target-likelihood of the future states:

π^{*} = arg max π E_{ρ (s_{0}), τ \sim π} [\sum s_{t} \in τ γ^{t} log p_{t a r} (s_{t})]

(1)

where $γ \in (0, 1]$ denotes the discount factor. Notably, we assume that $p_{t a r} (s) > 0$ everywhere, since we can perturb the original data with a tiny noise (e.g., $N (0, 0.00001)$ ) to ensure the perturbed density is always positive. Song and Ermon (2019) also used this trick to tackle manifold hypothesis issue.

This objective reveals three challenges: 1) Inaccessibility problem: The grounded target distribution $p_{t a r}$ in Eq. 1 is inaccessible. Thus, we need to learn a function to approximate the log-target-likelihood for policy search. 2) Sparsity problem: The log-target-likelihood is sparse in the state space, as mentioned in Song and Ermon (2019). Hence, even if we have access to the target distribution $p_{t a r}$ , it is still difficult to explore the high-density region due to the Manifold Hypothesis mentioned in Song and Ermon (2019). 3) Adaptation problem: The dynamics $p (s_{t + 1} | s_{t}, a_{t})$ is also inaccessible. To maximise the cumulative sum in Eq. 1, the agent is required to adapt to the dynamics to efficiently increase the target likelihood.

To address these problems, we partition the task into 1) learning to estimate the log-target-likelihood (reward) of the state and 2) learning/building a policy to rearrange objects to adapt to the dynamics.

4 Method

Figure 2: An overview of our method. (a) We train the target score network via the score-matching objective. The target examples from the target distribution are first disturbed by a Gaussian kernel. The target score network is forced to match the denoising direction of the perturbation. (b) Our framework is based on the trained target score network for rearrangement tasks. In the model-free setting, the TarGF provides exploration guidance and reward estimation for RL. In the model-based setting, the TarGF provides reference control for a model-based planner based on ORCA. The planner then outputs a collision-free action based on the reference control.

4.1 Motivation

We argue that estimating the gradients of the log-target-likelihood $\nabla_{s} log p_{t a r} (s)$ can tackle the first two difficulties mentioned in Sec. 3: For the inaccessibility problem, we can approximate the reward increment between two adjacent states by the first-order Taylor expansion using the gradient field $log p_{t a r} (s_{t + 1}) - log p_{t a r} (s_{t}) \approx ⟨ \nabla_{s} log p_{t a r} (s_{t}), s_{t + 1} - s_{t} ⟩$ . As shown in Sec. 4.4, this helps to derive a surrogate objective of Eq. 1.

For the sparsity problem, we can leverage the gradient field to help with exploration since the gradient $\nabla_{s} log p_{t a r} (s)$ indicates the direction that points to the higher density region of the target distribution $p_{t a r}$ . To estimate the gradients of the log-target-likelihood $\nabla_{s} log p_{t a r} (s)$ , noted as the target gradient field (TarGF) $Φ_{t a r}$ , we adopt the deep score-matching technique Song et al. (2020) that has achieved impressive results in many research areas recentlyShi et al. (2021); Meng et al. (2021); Luo and Hu (2021).

To address the adaptation problem, we further demonstrate two ways to incorporate the trained target gradient field $Φ_{t a r}$ with control algorithms for object rearrangement: 1) For the model-based setting, our framework casts the target gradient into a reference control and outputs actions with a distributed path planner. 2) For the model-free setting (reinforcement learning), we leverage the target gradient to estimate reward and provide suggested actions in residual policy learning Silver et al. (2018).

In the followings, we will first introduce how to train the target gradient field $Φ_{t a r}$ via score-matching in Sec. 4.2 and then introduce our rearrangement framework under the model-based and model-free settings in Sec. 4.3 and Sec. 4.4 respectively.

4.2 Learning the Target Gradient Field from Examples

The target score network $Φ_{t a r} : S \to S$ is the core module of our framework, which aims at estimating the score function(i.e.the gradient of log-density) of the target distribution $\nabla_{s} log p_{t a r} (s)$ .

To train the target score network, we adopt the Denoising Score-Matching (DSM) objective proposed by Vincent (2011), which can guarantee a reasonable estimation of the $\nabla_{s} log p_{t a r} (s)$ . In training, we pre-specify a noise distribution $q_{σ} (˜ s | s)$ on the collected target state $s$ . Then, we match the output of the target gradient field with the score of the perturbed target distribution $q_{σ} (˜ s) = \int q_{σ} (˜ s | s) p_{t a r} (s) d s$ :

\frac{1}{2} E_{q_{σ} (˜ s | s), p_{t a r} (s)} [| | Φ_{t a r} (˜ s) - \nabla_{˜ s} log q_{σ} (˜ s | s) | |_{2}^{2}] = \frac{1}{2} E_{q_{σ} (˜ s | s), p_{t a r} (s)} [| | Φ_{t a r} (˜ s) - \frac{s - ˜ s}{σ^{2}} | |_{2}^{2}]

(2)

The DSM objective guarantees the optimal score network satisfies $Φ_{t a r}^{*} (s) = \nabla_{s} q_{σ} (s)$ almost surely. When $σ$ is small enough, we have $\nabla_{s} q_{σ} (s) \approx \nabla_{s} log p_{t a r} (s)$ , so that $Φ_{t a r}^{*} (s) \approx \nabla_{s} log p_{t a r} (s)$ .

In practice, we adopt a variant of DSM proposed by Song et al. (2020), which conducts DSM under different noise scales simultaneously. We implement the target score network as a graph-based network $Φ_{t a r} (s, t)$ that is conditioned on a noise level $t \in (0, 1]$ . The input state $s = [s^{1}, s^{2}, . . ., s^{N}]$ is constructed as a fully-connected graph that takes each object’s state(e.g.position, orientation) $s^{i} \in R^{f_{s}}$ and static properties (e.g.category, bounding-box) $p^{i} \in R^{f_{p}}$ as the node feature $[s^{i}, p^{i}] \in R^{f_{s} + f_{p}}$ . The input graph is pre-processed by linear initial feature extraction layers and then passed through two Edge-Convolution layers. After message-passing, the output on each node serves as the component of the target gradient on the i-th object’s state $Φ_{t a r}^{i} (s) \in R^{f_{s}}$ .

4.3 Model-based Planning with the Target Gradient Field

The target gradient $g = Φ_{t a r} (s)$ can be translated into a gradient-based action $a^{g} = G (g)$ , where $G$ denotes the gradient-to-action translation. A gradient over the state space can be viewed as velocities imposed on each object. For instance, if the state of each object is a 2-dimensional position $s_{i} = [x, y]$ , then the target gradient component on the i-th object can be viewed as linear velocities on two axes $Φ_{t a r} (s)^{i} = [v_{x}, v_{y}]$ . When the action space is also velocities, we can construct $a^{g}$ by simply projecting the target gradient $a^{g} = G (g) = \frac{v_{m a x}}{| | g | |_{2}} \cdot g$ into the action space, where the $v_{m a x}$ denotes the speed limit. Following the gradient-based action, objects will move towards directions that may increase the likelihood of the next state, which meets the need of the rearrangement task.

However, the gradient-based action is imperfect(e.g.leading to collisions between the objects), as the target gradient has no access to the information of environment dynamics. This may be inefficient and unsafe due to the severe collisions between objects.

To adapt the gradient-based action to the environment dynamics, we incorporate the target gradient field with a model-based off-the-shelf collision-avoidance algorithm ORCA. The ORCA planner $ϕ_{θ}$ takes the gradient-based action $a^{g}$ as reference velocities and then outputs the collision-free velocities $a = ϕ_{θ} (a^{g})$ with the least modification to the reference velocities. In this way, the adapted action can lead objects to move towards a higher likelihood region efficiently and safely. For more details of the ORCA planner, we defer to Appendix B.2.

4.4 Learning Policy with the Target Gradient Field

Qualitative results of each method on ball rearrangement tasks.
We selected the two most representative results for each method. — Figure 3: Qualitative results of each method on ball rearrangement tasks. We selected the two most representative results for each method.

We further propose a framework based on the model-free reinforcement learning (RL) algorithm to accomplish the rearrangement tasks in the model-free scenario, where the agent needs to adapt the dynamics via interacting with the environment.

To tackle the inaccessibility problem mentioned in Sec. 3, we first derive an equivalent $J^{*} (π)$ of the original objective in Eq. 1 and further obtain a surrogate objective $^J (π)$ to approximate this equivalent:

	$J (π) ⟺$	$E_{ρ (s_{0}), τ \sim π} [\sum s_{t} \in τ γ^{t} log p_{t a r} (s_{t})] - \frac{E_{ρ (s_{0})} [log p_{t a r} (s_{0})]}{1 - γ}      c o n s t a n t C$		(3)
	$=$	$E_{ρ (s_{0}), τ \sim π} [\sum 1 \leq t \leq T γ^{t} \sum 1 \leq k \leq t [log p_{t a r} (s_{k}) - log p_{t a r} (s_{k - 1})]] d e f = J^{*} (π)$		(3)

Further, we derive a surrogate objective $^J (π)$ to approximate the equivalent objective $J^{*} (π)$ . We notice that in most physical simulated tasks, the distance between two adjacent states $| | s_{t} - s_{t - 1} | |_{2}$ is quite small, which inspires us to approximate the delta log-likelihood using the Taylor expansion:

J^{*} (π) \approx E_{ρ (s_{0}), τ \sim π} [\sum 1 \leq t \leq T γ^{t} \sum 1 \leq k \leq t ⟨ Φ_{t a r} (s_{k - 1}), s_{k} - s_{k - 1} ⟩      r_{t}] d e f =^J (π)

(4)

In this way, we can optimise the surrogate objective $^J (π)$ by assigning $\sum 1 \leq k \leq t ⟨ Φ_{t a r} (s_{t}), s_{k} - s_{k - 1} ⟩$ as the immediate reward. In practice, we simply assign the last term of the cumulative summation as the immediate reward $r_{t} = ⟨ Φ_{t a r} (s_{t}), s_{t} - s_{t - 1} ⟩$ , and this version is shown to be the most effective.

To tackle the sparsity problem mentioned in Sec. 3, our idea is to build our policy $π_{θ}$ upon the gradient-based action $a_{t}^{g}$ mentioned in Sec. 4.3, as $a_{t}^{g}$ is the best direction to increase the likelihood:

a_{t} = π (s_{t}, g_{t}) + a_{t}^{g}      o u t p u t a c t i o n, a_{t}^{g} = G (g) = \frac{v_{m a x}}{| | g_{t} | |_{2}} \cdot g_{t}      g r a d i e n t - b a s e d a c t i o n, g_{t} = Φ_{t a r} (s_{t})      t a r g e t g r a d i e n t

(5)

This approach is similar to the residual policy learning proposed by Silver et al. (2018). The $a_{t}^{g}$ serves as an initial policy, and the agent’s policy network $π_{θ}$ takes the target gradient $g_{t}$ and the current state $s_{t}$ as input and outputs $π (s_{t}, g_{t})$ to ‘polish’ the $a_{t}^{g}$ . The agent finally outputs ‘polished’ action $a_{t} = π (s_{t}, g_{t}) + a_{t}^{g}$ . In this way, the agent can not only benefit from the efficient exploration but also have less training burden as it ‘stands on the giant’s shoulder’ (i.e.the gradient-based action $a_{t}^{g}$ ).

We implement the multi-agent Soft-Actor-Critic(SAC) as the reinforcement learning algorithm backbone. Each object is regarded as an agent, and each agent can communicate with all the other agents. At each time step $t$ , the reward for the i-th agent is $r_{t} = r_{t}^{t a r} + λ * r_{t}^{i}$ , where $r_{t}^{t a r} = \sum 1 \leq k \leq t ⟨ Φ_{t a r} (s_{t}), s_{k} - s_{k - 1} ⟩$ is a centralised target likelihood reward, $λ > 0$ is a hyperparameter and $r_{t}^{i} = \sum_{j \neq i} c_{i, j}$ is a decentralised collision-aware reward where $c_{i, j}$ is a collision penalty. Specifically, when a collision is detected between object $i$ and $j$ , $c_{i, j} = - 1$ , otherwise $c_{i, j} = 0$ .

5 Experiment Setups

5.1 Environment

We design two object rearrangement tasks without explicit goal specification for evaluation: Ball Rearrangement and Room Rearrangement, where the former uses controlled environments for better numerical analysis and the latter is built on real data with implicit target distribution.

Ball Rearrangement includes three environments with increasing difficulty of rearrangement: Circling, Clustering, and the hybrid of the first two, Circling + Clustering. There are 21 balls with the same geometry bounded in a square in each environment. In Clustering and Circling+Clustering, the balls are divided equally into three colours. The goal of each environment is as follows: Circling requires all balls to form into a circle. The centre of the circle can be anywhere in the environment; Clustering requires all balls to form into three clusters by colour. The joint centre of each cluster has two choices; Circling + Clustering requires all balls to form into a circle where balls of the same type are adjacent. To slightly increase the complexity of the physical dynamics, we enlarge the lateral friction coefficient of each ball and make each ball with different physical properties, e.g., friction coefficient and mass. The target examples of each task are sampled from an explicit process. Thus, we can define a ‘pseudo likelihood’ on a given state according to the sampling program to measure the similarity between the state and the target distribution.

Room Rearrangement is built on a more realistic dataset, 3D-FRONT Fu et al. (2021). After cleaning the data, we get a dataset with 839 rooms, 756 for training(e.g.target examples), and 83 for testing. In each episode, we sample a room from the test dataset and shuffle the room with a Brownian Process at the beginning to get an initial state. The agent can assign velocities for each object in the room at each time step. The agent has to learn implicit knowledge learned from the target examples to tidy up the room. More details are in the Appendix A.

Qualitative results on planar room rearrangement tasks.
We visualised the results of each method closest to the ground truth state (the nearest target example to the initial state) in the eight rearrangement results of a given room. — Figure 4: Qualitative results on planar room rearrangement tasks. We visualised the results of each method closest to the ground truth state (the nearest target example to the initial state) in the eight rearrangement results of a given room.

5.2 Evaluation

For both baseline comparison and ablation studies, we collect a fixed number of trajectories $T = {τ_{i}}$ starting from the same initial states for each of 5 different random seeds. For each random seed, we collect 100 and $83 \times 8$ trajectories for ball and room rearrangement respectively. Then we calculate the following metrics on the trajectories, where the PL is only reported in ball rearrangement:

Pseudo-Likelihood (PL) measures the efficiency of the rearrangement process and the sample quality of the results. We specify a pseudo-likelihood function $F_{p r o x y} : S \to R^{+}$ for each environment that measures the similarity between a given state and the target examples. For each time step $t$ , we report the mean PL over trajectories $E_{τ \sim T} [F_{p r o x y} (s_{t})]$ and the confidence interval over 5 random seeds. We do not report PL on room rearrangement as it is hard to program the human preferences.

Coverage Score (CS): measures the diversity and fidelity of the rearrangement results (i.e.the terminal states $S_{T} = {s_{T}}$ of the trajectories) by calculating the Minimal-Matching-Distance(MMD) Achlioptas et al. (2018) between $S_{T}$ and a fixed set of examples $S_{g t}$ from $p_{t a r}$ : $\sum s_{g t} \in S_{g t} min s_{T} \in S_{T} | | s_{g t} - s_{T} | |$ .

Averaged Collision Number (ACN) reflects the efficiency of the rearrangement process since object collision will lead to object blocking and deceleration. ACN is calculated as $\sum_{τ \in T} \sum_{s_{t} \in τ} c_{t}$ , where $c_{t}$ denotes the total collision number at time step $t$ . More details are in the Appendix D.

Comparative results of all the methods and ablated versions of our method in ball rearrangement.
From top to down, we report pseudo-likelihood (PL), coverage score (CS), and averaged collision number (ACN).
The PL curves of each task are normalised by the oracle’s performance of this task.
For CS and ACN, we report the mean and confidence intervals for five random seeds. — Figure 5: Comparative results of all the methods and ablated versions of our method in ball rearrangement. From top to down, we report pseudo-likelihood (PL), coverage score (CS), and averaged collision number (ACN). The PL curves of each task are normalised by the oracle’s performance of this task. For CS and ACN, we report the mean and confidence intervals for five random seeds.

5.3 Baselines

We compare our framework (Ours(SAC) and Ours(ORCA)) with planning-based baselines under the model-based setting and with learning-based baselines under the model-free setting. Besides, we design heuristic baselines serving as the upper bounds for ball rearrangement tasks.

Leaning-based Baselines: GAIL: A classical inverse RL method that trains a discriminator as the reward function. RCE: The SOTA example-based reinforcement learning method. SQIL: A competitive example-based imitation learning method. Goal-RL: An ‘open-loop’ method that first generates a goal at the beginning of the episode via a VAE trained from the target examples and then trained to reach the goal using goal-conditioned RL.

Planning-based Baselines: Goal(ORCA): This method first generates a goal via the same VAE as in Goal(SAC). Then the agent assigns an action pointing to the goal as reference control and adjusts the preference velocity using ORCA, similar to Sec. 4.3.

Heuristic-based Baseline: Oracle: We slightly perturb the samples from the target distribution and take these samples as the rearrangement results.

6 Result Analysis

6.1 Baseline Comparison

Ball Rearrangement: We present the quantitative results on ball rearrangement tasks in the first three columns of Fig. 5. As seen from the top row of Fig. 5, the PL curves of our model-free framework(Ours(SAC)) and model-based framework (Ours(ORCA)) are significantly higher and steeper than the baselines in all three tasks, which shows the effectiveness of our method in arranging objects towards the target distribution efficiently. In the middle row of Fig. 5, Ours(ORCA) achieves the best performance across all three tasks, while Goal(SAC) outperforms Ours(SAC) on Clustering and Circling+Clustering since the goal is proposed by a generative model that can naturally generate diverse and realistic goals. From the qualitative results shown in Fig. 3, our methods also significantly outperform the baselines in realism. Our methods also achieve the lowest ACN in most cases, except that Goal(SAC) is slightly better than Ours(SAC) in Circling.

The Goal(SAC) is the strongest baseline yet loses to our framework in efficiency since the generated goal may be far away from the initial state. As shown in Fig. 3, the Goal(SAC) and Goal(ORCA) can rearrange objects in the right trend, but the objects may block each other’s way. Besides, in the Appendix E.4, we show the VAE used in Goal(SAC) is not guaranteed to generate a legal state(e.g.the balls overlap in goal). Therefore, the balls may have unexpected behaviour during the process.

The classifier-based reward learning method, GAIL, RCE, and SQIL, basically fails in most tasks, as shown in Fig. 3 and Fig. 5. This is due to the over-exploitation problem that commonly exists in classifier-based reward learning methods. We defer the experiments to Appendix E.2.

Room Rearrangement: The dynamics of the room rearrangement task are too complex to apply explicit planning-based methods, so we only compare our model-free framework with learning-based baselines. As shown in Fig. 4 and the last column of Fig. 5, our method significantly outperforms other baselines in sample realism of rearrangement results and the safety of the rearrangement process, which proves our method can also work on a more complex task.

6.2 Ablation Studies and Analysis

We conduct ablation studies on ball rearrangement tasks to investigate: 1) The effectiveness of the exploration guidance provided from the target gradient field and 2) The necessity of combining the target gradient field with control algorithms to adapt to the environment dynamics. To this end, we construct an ablated version named Ours w/o Residual that drops the residual policy learning and another one named Ours w/o RL that only takes gradient-based action $a^{g}$ as policy.

Ours w/o Residual: The CS-bar in Fig. 5 shows that Ours(SAC) significantly outperforms the Ours w/o Residual in CS, which means the agent is easy to over-exploit few modes with high density. We also show qualitative results in Appendix E.1 that without the residual policy learning, the objects are usually arranged into a single pattern.

Ours w/o RL: The PL in Fig. 5 shows that without the control algorithm serving as the backbone, the rearrangement efficiency of Ours w/o RL is significantly lower than Ours(SAC) and Ours(ORCA). Meanwhile, the ACN of Ours w/o RL is significantly larger than Ours(SAC) and Ours(ORCA), which indicates the gradient-based action may cause collisions which hurts the efficiency.

Figure 6: Generalisation Anaslysis: Each policy is trained with $3 \times 7$ balls and is zero-shot transferred to environments with different ball numbers.

Generalisation: We further evaluate the generalisation of our methods to different numbers of objects. To be specific, we train all the policies in the ball environment with $3 \times 7 = 21$ balls. During the testing phase, we directly transfer each policy to environments with increased ball number, i.e., $3 \times 8$ , $3 \times 9$ , and $3 \times 10$ , without any fine-tuning. As shown in Fig. 6, we report the averaged PL increment from initial to target states of each policy in each configuration. The results indicate that our framework generalises well to different numbers of balls with the least performance drop compared with baselines in all three ball rearrangement tasks.

7 Conclusion

In this study, we first analyse the challenges of object rearrangement without explicit goal specification and then incorporate a target gradient field with either model-based planning or model-free learning approaches to achieve this task. Experiments demonstrate our effectiveness by comparisons with learning-based and planning-based baselines.

Limitation and Future Work. This work only considers the 2-D state-based rearrangement with simplified dynamics where the agent can directly control the velocities of objects. Future works may extend our framework to more realistic scenarios: 1) Visual Observation: We may explore how to conduct efficient and accurate score-based reward estimation from image-based examples and leverage the recent progress on visual state representation learning Linsley et al. (2021); Kipf et al. (2019); Lin et al. (2020) to develop the visual gradient-based action. 2) Complex Dynamics: When the agent can only move one object at a time, we may explore hierarchical methods where high-level policy chooses which object to move and low-level policy leverage the gradient-based action for motion planning. When we can only apply force to objects instead of velocity, we may explore a hierarchical policy where a high-level policy outputs expected velocities and a low-level employ a velocity controller, such as PID controller, to minimises the velocity errors.

Ethics Statement. Our method has the potential to build home-assistant robots. We evaluate our method in simulated environments, which may introduce data bias. However, similar studies also have such general concerns. We do not see any possible major harm of our study.

Acknowledgment

This work was supported by National Natural Science Foundation of China —Youth Science Fund (No.62006006): Learning Visual Prediction of Interactive Physical Scenes using Unlabelled Videos. Fangwei Zhong was supported by China National Post- doctoral Program for Innovative Talents (Grant No. BX2021008). We also thank Tianhao Wu and Yali Du for insightful discussions.

References

N. Abdo, C. Stachniss, L. Spinello, and W. Burgard (2015) Robot, organize my shelves! tidying up objects by predicting user preferences. In 2015 IEEE international conference on robotics and automation (ICRA), pp. 1557–1564. Cited by: §1, §2.1.
N. Abdo, C. Stachniss, L. Spinello, and W. Burgard (2016) Organizing objects by predicting user preferences through collaborative filtering. The International Journal of Robotics Research 35 (13), pp. 1587–1608. Cited by: §2.1.
P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas (2018) Learning representations and generative models for 3d point clouds. In International conference on machine learning, pp. 40–49. Cited by: §5.2.
D. Batra, A. X. Chang, S. Chernova, A. J. Davison, J. Deng, V. Koltun, S. Levine, J. Malik, I. Mordatch, R. Mottaghi, et al. (2020) Rearrangement: a challenge for embodied ai. arXiv preprint arXiv:2011.01975. Cited by: §2.1.
J. v. d. Berg, S. J. Guy, M. Lin, and D. Manocha (2011) Reciprocal n-body collision avoidance. In Robotics research, pp. 3–19. Cited by: §1.
E. Coumans and Y. Bai (2016) PyBullet, a python module for physics simulation for games, robotics and machine learning. Note: http://pybullet.org Cited by: §A.2.
B. Eysenbach, S. Levine, and R. Salakhutdinov (2021) Replacing rewards with examples: example-based policy search via recursive classification. arXiv preprint arXiv:2103.12656. Cited by: §C.2, §1, §1, §2.2, §3.
M. Fisher, D. Ritchie, M. Savva, T. Funkhouser, and P. Hanrahan (2012) Example-based synthesis of 3d object arrangements. ACM Transactions on Graphics (TOG) 31 (6), pp. 1–11. Cited by: §2.1.
H. Fu, B. Cai, L. Gao, L. Zhang, J. Wang, C. Li, Q. Zeng, C. Sun, R. Jia, B. Zhao, et al. (2021) 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10933–10942. Cited by: §A.2, §5.1.
J. Fu, K. Luo, and S. Levine (2017) Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248. Cited by: §2.2.
J. Fu, A. Singh, D. Ghosh, L. Yang, and S. Levine (2018) Variational inverse control with events: a general framework for data-driven reward definition. arXiv preprint arXiv:1805.11686. Cited by: §1, §2.2.
A. Handa, V. Pătrăucean, S. Stent, and R. Cipolla (2016) Scenenet: an annotated model generator for indoor scene understanding. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 5737–5743. Cited by: §2.1.
J. Ho and S. Ermon (2016) Generative adversarial imitation learning. Advances in neural information processing systems 29, pp. 4565–4573. Cited by: §1, §2.2.
Y. Jiang, M. Lim, and A. Saxena (2012) Learning object arrangements in 3d scenes using human context. arXiv preprint arXiv:1206.6462. Cited by: §1.
Y. Kant, A. Ramachandran, S. Yenamandra, I. Gilitschenski, D. Batra, A. Szot, and H. Agrawal (2022) Housekeep: tidying virtual households using commonsense reasoning. arXiv preprint arXiv:2205.10712. Cited by: §2.1.
I. Kapelyukh and E. Johns (2021) My house, my rules: learning tidying preferences with graph neural networks. In 5th Annual Conference on Robot Learning, Cited by: §1, §2.1.
J. Kim, H. Choi, G. Hwang, K. Kim, Y. Hong, and Y. Han (2020) Sortation control using multi-agent deep reinforcement learning in n-grid sortation system. Sensors 20 (12), pp. 3401. Cited by: §1.
T. Kipf, E. van der Pol, and M. Welling (2019) Contrastive learning of structured world models. arXiv preprint arXiv:1911.12247. Cited by: §7.
B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez (2021) Deep reinforcement learning for autonomous driving: a survey. IEEE Transactions on Intelligent Transportation Systems. Cited by: §1.
I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and J. Tompson (2018) Discriminator-actor-critic: addressing sample inefficiency and reward bias in adversarial imitation learning. arXiv preprint arXiv:1809.02925. Cited by: §2.2.
Z. Lin, Y. Wu, S. Peri, B. Fu, J. Jiang, and S. Ahn (2020) Improving generative imagination in object-centric world models. In International Conference on Machine Learning, pp. 6140–6149. Cited by: §7.
D. Linsley, G. Malik, J. Kim, L. N. Govindarajan, E. Mingolla, and T. Serre (2021) Tracking without re-recognition in humans and machines. Advances in Neural Information Processing Systems 34, pp. 19473–19486. Cited by: §7.
M. Liu, T. He, M. Xu, and W. Zhang (2020) Energy-based imitation learning. arXiv preprint arXiv:2004.09395. Cited by: §1, §2.2.
S. Luo and W. Hu (2021) Score-based point cloud denoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4583–4592. Cited by: §1, §4.1.
W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang (2019) End-to-end active object tracking and its real-world deployment via reinforcement learning. IEEE transactions on pattern analysis and machine intelligence 42 (6), pp. 1317–1332. Cited by: §1.
C. Meng, Y. He, Y. Song, J. Song, J. Wu, J. Zhu, and S. Ermon (2021) SDEdit: guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, Cited by: §1, §4.1.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §1.
A. Netanyahu, T. Shu, J. Tenenbaum, and P. Agrawal (2022) Discovering generalizable spatial goal representations via graph-based active reward learning. In International Conference on Machine Learning, pp. 16480–16495. Cited by: §2.1.
D. A. Pomerleau (1991) Efficient training of artificial neural networks for autonomous navigation. Neural computation 3 (1), pp. 88–97. Cited by: §2.2.
S. Qi, Y. Zhu, S. Huang, C. Jiang, and S. Zhu (2018) Human-centric indoor scene synthesis using stochastic grammar. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5899–5908. Cited by: §2.1.
S. Reddy, A. D. Dragan, and S. Levine (2019) Sqil: imitation learning via reinforcement learning with sparse rewards. arXiv preprint arXiv:1905.11108. Cited by: §1, §1, §2.2.
H. N. Ricciuti (1965) Object grouping and selective ordering behavior in infants 12 to 24 months old. Merrill-Palmer Quarterly of Behavior and Development 11 (2), pp. 129–148. Cited by: §1.
S. Ross and D. Bagnell (2010) Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 661–668. Cited by: §2.2.
S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §2.2.
G. Sarch, Z. Fang, A. W. Harley, P. Schydlo, M. J. Tarr, S. Gupta, and K. Fragkiadaki (2022) TIDEE: tidying up novel rooms using visuo-semantic commonsense priors. arXiv preprint arXiv:2207.10761. Cited by: §2.1.
M. J. Schuster, D. Jain, M. Tenorth, and M. Beetz (2012) Learning organizational principles in human environments. In 2012 IEEE International Conference on Robotics and Automation, pp. 3867–3874. Cited by: §2.1.
B. Shen, F. Xia, C. Li, R. Martín-Martín, L. Fan, G. Wang, C. Pérez-D’Arpino, S. Buch, S. Srivastava, L. P. Tchapmi, M. E. Tchapmi, K. Vainio, J. Wong, L. Fei-Fei, and S. Savarese (2021) IGibson 1.0: a simulation environment for interactive tasks in large realistic scenes. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. accepted. Cited by: §A.2.
C. Shi, S. Luo, M. Xu, and J. Tang (2021) Learning gradient fields for molecular conformation generation. arXiv preprint arXiv:2105.03902. Cited by: §1, §4.1.
T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling (2018) Residual policy learning. arXiv preprint arXiv:1812.06298. Cited by: §1, §4.1, §4.4.
Y. Song and S. Ermon (2019) Generative modeling by estimating gradients of the data distribution. arXiv preprint arXiv:1907.05600. Cited by: §1, §3, §3.
Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole (2020) Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456. Cited by: §B.1, §1, §4.1, §4.2.
P. Vincent (2011) A connection between score matching and denoising autoencoders. Neural computation 23 (7), pp. 1661–1674. Cited by: §4.2.
H. Wang, W. Liang, and L. Yu (2020) Scene mover: automatic move planning for scene arrangement by deep reinforcement learning. ACM Transactions on Graphics (TOG) 39 (6), pp. 1–15. Cited by: §1.
L. Weihs, M. Deitke, A. Kembhavi, and R. Mottaghi (2021) Visual room rearrangement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5922–5931. Cited by: §2.1.
Y. Yeh, L. Yang, M. Watson, N. D. Goodman, and P. Hanrahan (2012) Synthesizing open worlds with constraints using locally annealed reversible jump mcmc. ACM Transactions on Graphics (TOG) 31 (4), pp. 1–11. Cited by: §2.1.
L. F. Yu, S. K. Yeung, C. K. Tang, D. Terzopoulos, T. F. Chan, and S. J. Osher (2011) Make it home: automatic optimization of furniture arrangement. ACM Transactions on Graphics (TOG)-Proceedings of ACM SIGGRAPH 2011, v. 30,(4), July 2011, article no. 86 30 (4). Cited by: §2.1.
T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine (2020) Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pp. 1094–1100. Cited by: §1.
Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3357–3364. Cited by: §1.
B. D. Ziebart, A. L. Maas, J. A. Bagnell, A. K. Dey, et al. (2008) Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8, pp. 1433–1438. Cited by: §2.2.

\UseRawInputEncoding

Appendix A Details of Rearrangement Tasks

a.1 Ball Rearrangement

The three tasks in ball rearrangement only differ in state representation, pseudo likelihood function and the target distribution:

Circling: The state is represented as a full-connected graph where each node contains the two-dimensional position of each ball $s^{i} = [x, y]$ . The target examples are drawn from an explicit process with an intractable density function: We first uniformly sample a legal centre $o = [x_{o}, y_{o}]$ that allows all balls to place in a circle around this centre without overlap. Then we uniformly sample a radius $r_{o}$ from the legal radius range based on the sampled centre. After sampling the centre and radius, we uniformly place all balls in a circle centred in $o$ with radius $r_{o}$ . The pseudo likelihood function is defined as $F_{p r o x y} (s) = {exp}^{- (σ_{θ} + σ_{r})}$ , where $σ_{θ}$ and $σ_{r}$ denote the standard deviation of the angle between two adjacent balls and the distances from each ball to the centre of gravity of all balls, respectively. Intuitively, if a set of balls are arranged into a circle, then the $σ_{r}$ and $σ_{θ}$ should be close to zero, achieving higher pseudo likelihood.

Clustering: The state is represented as a full-connected graph where each node contains the two-dimensional position of each ball $s^{i} = [x, y]$ and the one-dimensional category feature of each ball $c^{i} \in {0, 1, 2}$ . We generate the target examples in two stages: First, we sample the positions of each ball from a Gaussian Mixture Model(GMM) $p_{G M M} : R^{K * 2} \to R^{+}$ with two modes:

$p_{G M M} (s) = \frac{1}{2}$	$\prod 1 \leq i \leq \frac{K}{3} N ((0.18 sin (\frac{0}{3} π), 0.18 cos (\frac{0}{3} π)), 0.05 I) (s^{i})$	(6)
	$\prod \frac{K}{3} \leq i \leq \frac{2 K}{3} N ((0.18 sin (\frac{1}{3} π), 0.18 cos (\frac{1}{3} π)), 0.05 I) (s^{i})$
	$\prod \frac{2 K}{3} \leq i \leq K N ((0.18 sin (\frac{2}{3} π), 0.18 cos (\frac{2}{3} π)), 0.05 I) (s^{i})$
$+ \frac{1}{2}$	$\prod 1 \leq i \leq \frac{K}{3} N ((0.18 sin (\frac{0}{3} π), 0.18 cos (\frac{0}{3} π)), 0.05 I) (s^{i})$
	$\prod \frac{K}{3} \leq i \leq \frac{2 K}{3} N ((0.18 sin (\frac{2}{3} π), 0.18 cos (\frac{2}{3} π)), 0.05 I) (s^{i})$
	$\prod \frac{2 K}{3} \leq i \leq K N ((0.18 sin (\frac{1}{3} π), 0.18 cos (\frac{1}{3} π)), 0.05 I) (s^{i})$

where $K = 21$ denotes the total number of balls. In the second stage, we slightly adjust the positions sampled from the GMM to eliminate overlaps between these positions by stepping the physical simulation. Since the results mainly depend on the GMM, we take the density function of GMM as the pseudo likelihood function $F_{p r o x y} (s) = p_{G M M} (s)$

Circling+Clustering: The state representation is the same as that of Clustering. To generate the target examples, we first generate a circle using the same process in Circling. Then, starting with any ball in the circle, we colour the ball three colours in turn, $\frac{K}{3}$ for each. The order is randomly selected from red-yellow-blue and red-blue-yellow. This sampling is also explicit yet with an intractable density similar to Circling. The pseudo likelihood function is defined as $F_{p r o x y} (s) = {exp}^{- (σ_{θ} + σ_{r})} \cdot {exp}^{- (σ_{R} + σ_{G} + σ_{B}) - σ_{C}}$ where $σ_{R}$ denotes the standard deviation of the angle between two adjacent red balls, and $σ_{G}$ and $σ_{B}$ and $σ_{C}$ denotes the standard deviation of the positions of red, green and blue centres. Intuitively, the first term ${exp}^{- (σ_{θ} + σ_{r})}$ measures the pseudo likelihood of balls forming a circle and the next term ${exp}^{- (σ_{R} + σ_{G} + σ_{B}) + σ_{C}}$ measures the pseudo likelihood of balls being clustered into three piles.

The common settings shared by each task are summarised as follows:

Horizon: Each training episode contains 100 steps.

Initial Distribution: We first uniformly sample rough locations for each ball, then we eliminate overlaps between these positions by stepping the physical simulation.

Dynamics: The floor and wall are all absolutely smooth planes. All the balls are bounded in an 0.3m x 0.3m area, with the radius being 0.025m. We set the friction coefficients of all balls to 100.0 since we observe that setting a small(e.g.not larger than 1.0) friction coefficient does not significantly affect the dynamics. Besides, to increase the complexity of the dynamics, we set masses and restitution coefficients of all green and blue balls to 0.1 and 0.99, respectively. The masses and restitution coefficients of all red balls are set to be 10 and 0.1, respectively. We observe that under these dynamics, the collision may significantly harm the efficiency of the rearrangement process. Hence, the agent have to adapt to the dynamics for more efficient object rearrangement.

Target Examples: We collect 100,000 examples for each task as target examples.

a.2 Room Rearrangement

Dataset and Simulator: We clean the 3D-Front dataset [9] to obtain bedrooms that consist of four walls and three to eight objects. We augment each room by flipping two times and rotating four times to get eight augmentations. We then import these rooms into iGibson [37] to run the physical simulation. The 756x8 rooms are used for target examples that are used for training of the target score network, the classifier-based baselines and the VAE in goal-conditioned baselines. The other 83x8 rooms are used to initialise the room in the test phase, so the room rearrangement task is performed on the test dataset. The rooms in the training dataset are only used for learning the prior knowledge to arrange the room.

State and Action Spaces: The state consists of a aspect ratio $r_{a} \in R^{+}$ and an object state $s_{o} \in R^{K \times 6}$ where $K$ denotes the number of objects. The aspect ratio $r_{a} = tanh (\frac{b_{x}}{b_{y}})$ where $b_{x}$ and $b_{y}$ denotes the horizontal and vertical wall bounds. The object state is the concatenation of sub-states of all the objects $s_{o} = [s^{1}, s^{2}, . . . s^{i}, . . ., s^{K}]$ where the sub-state of the i-th object $s^{i} \in R^{6}$ is consists of 2-D position, 1-D orientation, 2-D bounding bound and a 1-D category label. The action is also a concatenation of sub-actions of all the objects $a_{o} = [a^{1}, a^{2}, . . . a^{i}, . . ., a^{K}]$ . For the i-th object, the action $a^{i} \in R^{3}$ consists of a 2-D linear and an 1-D angular velocity. The whole action space is normalised into a $3 \times K$ dimensional unit-box $[- 1, 1]^{3 \times K}$ by the velocity bounds.

Horizon: Each training episode contains 250 steps.

Initial Distribution: To guarantee the initial state is accessible to the high-density region of the target distribution, we sample an initial state in two stages: First, we sample a room from the 83x8 rooms in the test dataset. Then we perturb this room by 1000 Brownian steps.

Dynamics: For more efficient environment reset and physical simulation, we build a ‘proxy simulator’ based on PyBullet [6] to replace the iGibson simulator. We use iGibson to load and save the metadata of each room. Then we reload these rooms in the proxy simulator, where each object is replaced by a box with the same geometry. We set the friction coefficient of all the objects in the room to zero, as the dynamics of the room are complex enough.

Appendix B Details of Our Method

b.1 Training the Target gradient Field

The complete training objective is the SDE-based score-matching objective proposed by [41]:

E_{t \sim U (0, 1)} E_{s (0) \sim p_{t a r} (s)} E_{s (t) \sim p_{0 t} (s (t) ∣ s (0))} [Φ_{t a r} (s (t), t) - \nabla_{s (t)} log p_{0 t} (s (t) ∣ s (0)) ∥_{2}^{2}] .

(7)

where $p_{0 t} (s (t) ∣ s (0)) = N (s (t); s (0), \frac{1}{2 log σ} (σ^{2 t} - 1) I)$ and $σ = 25$ is a hyper-parameter.

Using this objective, we can obtain the estimated score w.r.t. different levels of the noise-perturbed target distribution $p_{t a r}^{t} (s (t)) = \int p_{0 t} (s (t) ∣ s (0)) p_{t a r} (s (0)) d s (0)$ simultaneously. In this way, we can assign different noise-level $t$ for different tasks conveniently. The choice of $t$ will be described in sub-section. B.4

b.2 Details of ORCA and Planning-based Framework

We set $τ = 0.1$ and the simulation duration of each timestep $Δ t = 0.02$ . For each agent(object), ORCA only considers the 2-nearest agents as neighbours, since in our experiments ORCA often has no solution when the number of neighbours is larger than 2.

In all ball rearrangement tasks, we choose $t = 0.1$ as the initial noise scale for the target score network. The noise level decays to 0 within an episode.

b.3 Complete Derivation of Surrogate Objective

$J (π) ⟺$	$E_{ρ (s_{0}), τ \sim π} [\sum s_{t} \in τ γ^{t} log p_{t a r} (s_{t})] - \frac{E_{ρ (s_{0})} [log p_{t a r} (s_{0})]}{1 - γ}      c o n s t a n t$	(8)
$=$	$E_{ρ (s_{0}), τ \sim π} [\sum s_{t} \in τ γ^{t} log p_{t a r} (s_{t})] - E_{ρ (s_{0}), τ \sim π} [\sum s_{t} \in τ γ^{t} log p_{t a r} (s_{0})]$
$=$	$E_{ρ (s_{0}), τ \sim π} [\sum s_{t} \in τ γ^{t} [log p_{t a r} (s_{t}) - log p_{t a r} (s_{0})]]$
$=$	$E_{ρ (s_{0}), τ \sim π} [\sum 1 \leq t \leq T γ^{t} \sum 1 \leq k \leq t [log p_{t a r} (s_{k}) - log p_{t a r} (s_{k - 1})]] d e f = J^{*} (π)$
$J^{*} (π) \approx$	$E_{ρ (s_{0}), τ \sim π} [\sum 1 \leq t \leq T γ^{t} \sum 1 \leq k \leq t ⟨ \nabla_{s} log p (s_{k - 1}), s_{k} - s_{k - 1} ⟩]$
$\approx$	$E_{ρ (s_{0}), τ \sim π} [\sum 1 \leq t \leq T γ^{t} \sum 1 \leq k \leq t ⟨ Φ_{t a r} (s_{k - 1}), s_{k} - s_{k - 1} ⟩      r_{t - 1}] d e f =^J (π)$

b.4 Details of Learning-based Framework

For all the ball rearrangement tasks, we choose $t = 0.1$ for the target score network when outputting the gradient-based action $a_{t}^{g} = G (Φ_{t a r} (s_{t}, 0.1))$ and $t = 0.01$ when estimating the immediate reward . At each time step $t$ , each object also receives a collision penalty $c_{t}^{i}$ . So the total reward for the i-th object at timestep $t$ is $r_{t}^{i} = ⟨ Φ_{t a r} (s_{t}, 0.01), s_{t + 1} - s_{t} ⟩ + λ * c_{t}^{i}$ where $λ$ denotes a hyper-parameter to balance the immediate reward and the collision penalty. We choose $λ = 5$ for Clustering and Circling+Clustering, $λ = 3$ for Circling and $λ = 0.2$ for room rearrangement. We conduct reward normalisation for both immediate reward and the collision penalty, which maintains a running mean $μ_{t}$ and a standard deviation $σ_{t}$ of a given reward sequence and returns a z-score $z_{t} = \frac{r_{t} - μ_{t}}{σ_{t}}$ as the normalised reward.

For room rearrangement, we choose $t = 0.01$ for both outputting the gradient-based action and estimating the immediate reward.

Appendix C Details of Baselines

c.1 Goal-based Baselines

These baselines refer to the Goal-SAC and Goal-ORCA in experiments.

Goal Proposal: This type of baselines first train a VAE on the target examples and then leverage the trained VAE for the goal proposal. The VAE is implemented as a GNN, and the model capacity is similar to our target score network for a fair comparison. We choose $λ_{k l} = 0.01$ for Circling and Clustering and $λ_{k l} = 0.02$ for Circling+Clustering.

Execution: At the beginning of each episode, the agent first proposes a goal of this episode using the VAE and then reaches the goal via a control algorithm. In Goal-ORCA, the agent reaches the goal by a planning-based method: The agent first assigns velocities for each ball that points to the corresponding goal. Then these velocities are updated by the ORCA planner $ϕ_{θ}$ to be collision-free. In Goal-SAC, the agent trains a multi-agent goal-conditioned policy via goal-conditioned RL to reach the goal: Similar to Ours-SAC, the reward of the i-th object at timestep $t$ is $r_{t}^{i} = | | s_{t} - s_{g o a l} | |_{1} + λ * c_{t}^{i}$ where $s_{g o a l}$ denotes the goal proposal, $λ$ is a hyper-parameter and $c_{t}^{i} = \sum_{j \neq i} c o l_{i, j}$ is the total number of collisions between the i-th object and the others. Here $c o l_{i, j} = 1$ when We choose $λ = 3$ for ball rearrangement and $λ = 0.2$ for room rearrangement.

c.2 Classifier-based Baselines

These baselines refer to the RCE, SQIL and GAIL in experiments.

RCE and SQIL are implemented based on the codes [7] released by RCE’s authors. We only modify $γ = 0.95$ , the training steps decrease to 0.5 million (i.e., the same number of training steps as other methods) and the model architecture. The architecture of actor and critic networks is implemented the same as ours(i.e., the same feature extraction layers and edge-convolutional layers).

GAIL is actually a modification of our learning-based framework: Keeping the RL agent the same(i.e., multi-agent SAC), GAIL’s reward is given by a discriminator. The discriminator takes only one state instead of two adjacent states as input, since the ‘ground-truth’ reward(i.e., likelihood) is defined on the current state. At each training step, we update the discriminator by distinguishing between the agents’ and the expert’s states(for one step) and then update the RL policy under the reward given by the discriminator(for one step). The agent also receives a collision reward during training similar to Ours-SAC and Goal-SAC: $r_{t}^{i} = D (s_{t + 1}) + λ * c_{t}^{i}$ where $D$ denotes the classifier trained by GAIL. We do not conduct reward normalisation for GAIL as the learned reward is unstable. We choose $λ = 10$ for room rearrangement, Circling and Circling+Clustering, $λ = 100$ for Clustering.

Appendix D Details of Evaluations

d.1 Ball rearrangement

We collect 100 trajectories for each task starting from the same set of initial states. To calculate the coverage score, we sample fixed sets examples from the target distribution serving as $S_{g t}$ for Circling, Clustering, and Circling+Clustering, respectively. We sample 20 examples for for Circling and Circling+Clustering and 50 examples for Clustering. Since the balls in the same category can actually be viewed as a two-dimensional point cloud, we measure the distance between two states by summing the CDs between each pile of balls by category.

d.2 Room rearrangement

For each room in 83 test rooms, we collect eight initial states. Then we collect 83x8 trajectories starting from these initial states for each method. The coverage score is calculated by averaging the coverage score in each room condition, since the state dimension differs in different rooms. For each room in 83 test rooms, we calculate the coverage score between the eight ground truth states and eight rearrangement results and then the averaged coverage score over the 83 rooms is taken as the final coverage score for a method. We measure the distance between two states by calculating the average L2-distance between the positions of the corresponding objects.

Appendix E Additional Results

e.1 Single-mode Problem of Ours w/o Residual

In Figure. 7 we show qualitative results of Ours w/o Residual. Apparently, the balls are arranged into a single pattern in all three ball rearrangement tasks, while the examples from the target distribution are diverse. In the most difficult task Circling + Clustering, the agent cannot even reach a terminal state with a high likelihood. This result indicates that Ours w/o Residual failed to explore the high-density region of target distribution without residual learning.

Figure 7: We visualise rearrangement results of Ours w/o Residual to demonstrate the ‘single pattern’ phenomenon.

e.2 Effectiveness of Reward Learning

In each ball rearrangement task, we collect 100 trajectories, each of which is run by a hybrid policy. The hybrid policy takes the first 50 steps using the Ours(ORCA) and the next 50 steps using random actions. To evaluate the effectiveness of our method on reward learning, we compare the estimated reward curve of Ours(SAC) and GAIL with the pseudo likelihood(PL) curve of the trajectories.

As shown in 8, reward curve of Ours(SAC) best fits the trend of pseudo likelihood curve, which shows the effectiveness of our reward estimation method.

Figure 8: Given a set of identical trajectories, we compare the estimated reward(ER) of different methods and pseudo likelihood(PL). All curves are normalised to the range of [- 1,1].

e.3 Efficiency Problem of Goal-conditioned Baselines

In figure. 1, we report the comparative results of our framework and goal-based baselines(e.g., Goal(SAC), Goal(ORCA)) on a new metric named absolute state change(ASC). The ASC measures the sum of the absolute paths of all small balls in the rearrangement process.

ASC = \sum 1 \leq t \leq T \sum 1 \leq k \leq K | | s_{t}^{k} - s_{t - 1}^{k} | |_{1}

(9)

As shown in figure. 1, Ours(SAC) and Ours(ORCA) are significantly better than Goal(SAC) and Goal(ORCA), respectively. This result explains why our method’s likelihood curves are better than the goal-based baselines’: The proposed goal is far away from the initial state, which harms the efficiency of goal-based approaches.

\UseRawInputEncoding

Method	Circling		Clustering		Circling+Clustering
Method	21 balls	30 balls	21 balls	30 balls	21 balls	30 balls
Ours-ORCA	23.16 +- 2.42	29.08 +- 2.43	13.72 +- 1.40	16.60 +- 1.68	19.54 +- 2.19	23.29 +- 2.18
Goal-ORCA	28.05 +- 2.82	31.98 +- 2.90	27.57 +- 2.90	33.37 +- 4.06	27.77 +- 2.86	32.47 +- 2.99
Ours-SAC	56.76 +- 4.83	78.18 +- 6.75	65.35 +- 4.64	110.06 +- 4.25	48.93 +- 4.68	80.01 +- 6.19
Goal-SAC	108.25 +- 7.53	139.26 +- 8.24	118.15 +- 5.84	142.05 +- 7.56	122.72 +- 5.93	161.01 +- 6.84

Table 1: Quantitative comparison results of our framework(Ours(ORCA), Ours(SAC)) and goal-based baselines(Goal(ORCA), Goal(SAC)) in rearrangement efficiency. For each method, we report the mean and standard deviation of absolute state change over 100 episodes on each ball rearrangement task.

e.4 Visualisations of Goals Proposed by the VAE

We demonstrate the visualisations of goal proposals by the VAE used in goal-based baselines. Typically, the proposed goals indeed form a reasonable shape that is similar to the target examples. However, it is hard to generate a fully legal goal since the VAE is not accessible to the dynamics of the environment or has enough data to infer the physical constraints of the environment. As shown in figure. 9, there exists many overlaps between balls in generated goals, which causes the balls to have conflicting goals and thus harms the efficiency of goal-based baselines.

Figure 9: We visualise the goals proposed by the VAE used in Goal(ORCA) and Goal(SAC).

e.5 Six-modes Clustering

Task Settings: This task is an six-modes extension of Clustering, where the centres of clusters are located in the following six pattern.

\UseRawInputEncoding

Locations	Mode1	Mode2	Mode3	Mode4	Mode5	Mode6
$(0.18 cos (\frac{2 π}{3}), 0.18 sin (\frac{2 π}{3}))$	R	R	B	B	G	G
$(0.18 cos (\frac{4 π}{3}), 0.18 sin (\frac{4 π}{3}))$	G	B	G	R	B	R
$(0.18 cos (2 π), 0.18 sin (2 π))$	B	G	R	G	R	B

Defining the joint centres’ positions as a latent variable $C = (C_{r}, C_{g}, C_{b})$ where $C_{r}$ , $C_{g}$ and $C_{b}$ denotes centres of red, green and blue balls respectively and the above six modes as ${c_{i}}_{1 \leq i \leq 6}$ , the $C$ obeys a categorical distribution $p (C = c_{i}) = \frac{1}{6}$ .

The target distribution of six-modes clustering is a Gaussian Mixture Model:

p_{G M M} (s) = \sum 1 \leq k \leq 6 p (C = c_{k}) \prod 1 \leq i \leq \frac{K}{3} N (C_{r}, 0.05 I) (s^{i}) \prod \frac{K}{3} \leq i \leq \frac{2 K}{3} N (C_{g}, 0.05 I) (s^{i}) \prod \frac{2 K}{3} \leq i \leq K N (C_{b}, 0.05 I) (s^{i})

(10)

Notably, the ‘mean mode’ of above six modes is the origin, i.e., $\frac{1}{6} \sum_{i = 1}^{6} c_{i} = ((0, 0), (0, 0), (0, 0))$ . If the policy arranges the balls into this mean pattern, then the balls should be centred around $(0, 0)$ (i.e., all positions of balls obey $N (0, 0.05 I)$ ). As shown in shown in Figure 10 (a), we illustrate an example for each mode.

Implementation Details: We evaluate Ours(SAC) and Ours(ORCA) on this task. The performances are normalised by the Oracle’s.

Results: We report the pseudo likelihood curves of our methods. As shown in Figure 10 (b), the rearrangement results is close to ground truth examples from the target distribution.

Figure 10: Pseudo likelihood curves and qualitative results of six-modes clustering.

We further compute the latent distributions for rearrangement results of Ours(SAC) using Bayesian Theorem:

(11)

$p (C = c_{i} | s)$ indicates which mode a state belongs to. To demonstrate our rearrangement results is close to one of the six modes instead of concentrating on a ‘mean’ pattern, we evaluate the average entropy of the latent distributions $E [H (p (C | s))]$ . As shown in Figure E.5, the average entropy of our methods are even lower than ground truths’. This indicates each state in our rearrangement results distinctly belonging to one of categories.

We also report the average latent distribution $E [p (C | s)]$ . As shown in Figure E.5, the averaged latent distribution (overall rearrangement results) of our methods achieve comparable orders of magnitude in different modes. This shows the rearrangement of our methods can cover all the mode centres.

\UseRawInputEncoding

Methods	Average Entropy $↓$	Mode1	Mode2	Mode3	Mode4	Mode5	Mode6
GT	0.0066	0.17	0.17	0.17	0.17	0.17	0.17
Ours(SAC)	0.0037	0.15	0.18	0.31	0.14	0.10	0.13
Ours(ORCA)	0.0056	0.17	0.17	0.16	0.11	0.19	0.19

e.6 Move One-ball at a Time

Task Settings: This task is an extension of Clustering where the policy can only move one ball at a time. We increase the horizon of each episode from 100 to 300. At each time step, the agent can choose one ball to take a velocity-based action for 0.1 second. Implementation Details: We design a bi-level approach based on our method as shown in Figure 11 (b): Every 20 time steps, the high-level planner outputs an object index $i_{t}$ with the largest target-gradient’s component:

i_{t} = argmax i | | g_{t}^{i} | |_{2}

(12)

where $g_{t}^{i} \in R^{2}$ denotes the component of the target gradient on the i-th object. In the following 20 steps, the ORCA planner computes the target velocity according to $g_{t}^{i}$ and masks all the other object’s velocity to zero.

We compare our method with a goal-based baseline where the agent first generates goals for each object via the VAE used in Goal(ORCA). Then the high-level planner choose the object with the farthest distance to goal as $i_{t}$ . The low-level planner of this baseline is same as ours.

Results: As shown in Figure 6 (a), (c), our method achieves more appealing results, better efficiency and performance compared with goal-based baseline.

Figure 11: (a): Quantitative results. (b): Illustration of our bi-level policy. (c): Qualitative results.

e.7 Force-based Dynamics

Task Settings: This task is an extension of Circling + Clustering where the dynamics is based on force. At each time step, the agent can assign a two dimensional force $f_{i} \in R^{2}$ on each object.

Implementation Details: Similar to ‘one at a time’ experiment, we design a bi-level approach to tackle this task, as Figure 7.(b) illustrates. The high-level policy outputs a target velocity every 8 steps. In the following 8 steps after the target velocities are outputted, the low-level PID-controller receives the target velocity and outputs the force-based action to minimise the velocity error. We set $K_{P} = 10.0, K_{I} = 0.0, K_{D} = 0.0$ for PID-controller.

This policy is compared with Ours(SAC) in the main paper.

Results: As shown in Figure 12. (a) and (c), this force-based policy achieves comparable performance with Ours(SAC) yet suffers from slight efficiency drop due to the control error of PID.

e.8 Image-based Reward Learning

Figure 13: Quantitative results of Ours(Image), Ours(State) and Goal(State) on image-based reward learning experiment.

Task Settings: In this task, the target score network is trained on image-based target examples. We simply render the state-based target example set used in Ours(SAC) to 64x64x3 images and

Implementation Details: The target score network of Ours(Image) is trained on the image-based examples set. The policy network and gradient-based action is state-based, since we focus on the image-based reward learning instead of visual-based policy learning.

This approach is compared with Ous(SAC) and Goal(SAC) in the main paper.

Results: Results in Figure 13 show that Ours(Image) achieves slightly better performance than Goal(SAC) yet is lower than Ours(SAC) due to the increment of the dimension. This indicates that our reward learning method is still effective in image-based setting.