The Magni Human Motion Dataset: Accurate, Complex,
Multi-Modal, Natural, Semantically-Rich and Contextualized

Tim Schreiter, Tiago Rodrigues de Almeida, Yufei Zhu, Eduardo Gutierrez Maestro,
Lucas Morillo-Mendez, Andrey Rudenko, Tomasz P. Kucner, Oscar Martinez Mozos,
Martin Magnusson, Luigi Palmieri, Kai O. Arras, and Achim J. Lilienthal
Örebro University, Sweden {tim.schreiter, tiago.almeida, eduardo.gutierrez-maestro, yufei.zhu, lucas.morillo, oscar.mozos, martin.magnusson, achim.lilienthal}@oru.seRobert Bosch GmbH, Corporate Research, Stuttgart, Germany {andrey.rudenko, luigi.palmieri, kaioliver.arras}@de.bosch.comMobile Robotics Group, Department of Electrical Engineering and Automation, Aalto University, Finland tomasz.kucner@aalto.fiThis work was supported by the European Union’s Horizon 2020 research and innovation program under grant agreement No. 101017274 (DARKO) and the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.
Abstract

Rapid development of social robots stimulates active research in human motion modeling, interpretation and prediction, proactive collision avoidance, human-robot interaction and co-habitation in shared spaces. Modern approaches to this end require high quality datasets of human motion trajectories for training and evaluation. However, the majority of available datasets suffers from either inaccurate tracking data or unnatural, scripted behavior of the tracked people. This paper attempts to fill this gap by providing high quality tracking information from motion capture, eye-gaze trackers and on-board robot sensors in a semantically-rich environment. To induce natural behavior of the recorded participants, we utilise loosely scripted task assignment, which induces the participants to navigate through the dynamic laboratory environment in a natural and purposeful way towards the randomly assigned targets. The motion dataset, presented in this paper, sets a high quality standard, as the realistic and accurate data is enhanced with semantic information, enabling development of new algorithms which rely not only on the tracking information but also on contextual cues of the moving agents, static and dynamic environment.

I Introduction

In recent years, the topics of human motion prediction and human-robot interaction have been rapidly growing, driven by the human-aware robotics research and industry interests. Most approaches require plentiful motion data recorded in diverse environments and settings to train on, as well as for the evaluation [12]. Among the growing number of human trajectory datasets, most focus on capturing interactions between the moving agents in indoor [3], outdoor [10] and automated driving [2] settings. These datasets are designed to study the geometric and velocity aspects of human motion.

Human motion is influenced by a large amount of contextual cues, which include semantic attributes of the static and dynamic environment, space topology and its activity patterns, social roles, relations and preferences of the target agents. Studies of these contextual aspects of human motion are gaining traction, creating the need for new datasets containing relevant cues.

Laboratory room layout including the floor markings (1) in Scenario 1B. The environment contains various static obstacles, including a narrow corridor (2) in the right with entry limited by a no-entry sign. The table displays the motion capture helmets (3).
Fig. 1: Laboratory room layout including the floor markings (1) in Scenario 1B. The environment contains various static obstacles, including a narrow corridor (2) in the right with entry limited by a no-entry sign. The table displays the motion capture helmets (3).

In this work, we follow on and further develop the THÖR protocol for human motion data collection introduced in [11]. There we proposed a weakly-scripted indoor scenario for generating diverse, natural, and goal-driven human motion in crowded social spaces with static obstacles and a moving robot. The THÖR dataset111http://thor.oru.se/, recorded according to the proposed procedure, includes 9 participants, moving alone and in groups, whose positions and head orientations are tracked with a motion capture system222https://www.qualisys.com/. The THÖR dataset also includes first-person gaze information for a subset of participants. To diversify the recorded motion patterns, participants in THÖR move between fixed goal positions in the environment, receiving at each goal a random card with the next target. The recording features over 60 minutes of motion and over 600 individual and group trajectories. THÖR is gaining attention in the scientific community, for instance in robotics [16, 15] and predictive motion modeling [17], and serves as a building block for the Atlas motion prediction benchmark [14].

In this paper, we extend THÖR in many aspects. The new recording, which we call Magni, includes 160 minutes of motion on 4 acquisition days with a total of 30 unique participants. In addition to the static obstacles in the room, we augment the environment with semantic context, such as one-way passages and yellow tape markings for areas of caution.

The introduction of the semantic context further enriches the recorded data. Moreover, capturing semantic features enables explainability of motion flow models [8] or enhances the downstream tasks which require semantics [13]. To further diversify the recorded motion patterns, in addition to cards indicating the next motion goal of the participants, we introduce remote instructions via voice command (using Discord [5]). In addition to the gaze directions in the 2D eye-tracker image plane, we also provide 3D gaze vectors in the environment reference frame. In addition to the motion capture and eye-gaze data, we record on-board robot sensors (LiDAR, RGB fish-eye, and RGB-D cameras). Lastly, we propose two variations in the teleoperated robot motion, namely the “differential drive” and “omnidirectional” motion, which enables the study of human-robot collision avoidance under varying conditions.

This paper presents the data collection procedure, describes sensors, scenarios, and the participants’ priming (Sec. II), as well as highlights a portion of the recorded data (Sec. III). We will make the full dataset available in the near future. Once the post-processing is complete, we will systematically describe the recorded data and analyze its application in HRI research.

Ii Data Collection

Ii-a Room Setup

The room for data collection is the robot lab at Örebro University – the same as in the THÖR dataset [11], which creates continuity between the recordings, while allowing to study human motion in the presence of varying contextual factors and obstacle layouts. Fig. 2 depicts the room layout. Seven goal positions are placed specifically to drive purposeful navigation through the room, generating frequent interactions between groups in the center. Several static obstacles (robotic manipulators and tables) are placed in the room to prevent walking between goals in a straight path.

Apart from static obstacles, two robots are placed in the room. One is a static robotic arm placed near the podium, as shown on the right in Fig. 2. The other one is on the left in Fig. 2: an omnidirectional mobile robot with a robotic arm on top (DARKO Robot). In some scenarios, as described in Section II-B, the mobile robot is also used for data collection. The robot base is RB-Kairos+ and the arm is the Collaborative Robot Panda from Franka Emika. The robot base dimensions are 760665690 mm. The maximum reach height of the robot arm is . The robot has one Ouster OS0-128 LiDAR, two Azure Kinect RGB-D cameras (one used in these recordings), two Basler fish-eye RGB cameras, and two Sick MicroScan 2D safety LiDARs. The Azure Kinect camera has a 75-degree horizontal field of view and a tracking range of up to .

In one scenario, floor markings to indicate the areas of caution, and stop signs to indicate one-way passages, are added. With black and yellow warning tapes, floor markings are placed around the mobile robot and the robot arm. Two stop signs are placed near the right permanent obstacle, indicating that the passage from right to left is blocked.

Ii-B Scenarios Description

We designed three scenarios for diverse data collection, which differ in the room layout, motion mode of the robot and the tasks executed by the participants. In all scenarios, we randomly divided the participants into individuals or groups of two or three people who share the navigation goal. Every group navigates towards their goal point, where it takes a random card, indicating the next goal. Each group takes one card at a time.

Scenario 1 is designed as a baseline to capture “regular” social behavior of walking people in a static environment. It has two variations: 1A which only includes static obstacles, and 1B which additionally includes floor markings and stop signs in a one-way corridor. Scenario 1B is designed with the focus on Maps of Dynamics (MoDs) [8]. MoDs are maps that encode dynamics as a feature of the environment, containing information about motion patterns in an environment. MoDs can provide information for planning and navigation in populated environments. The Scenario 1B provides motion data affected by invisible obstacles (areas of caution) and flow controlling signs (one-way passages).

Scenario 2 features the same room layout as Scenario 1A (i.e., without semantics). In addition to the basic goal-driven navigation, this scenario introduces people performing different tasks. These tasks aim to emulate regular activities performed in industrial contexts, such as transporting stacks of different objects between various goal locations. Therefore, in each recording session we assign one participant to carry small objects (i.e., a bucket), and another one to carry medium objects (i.e., a box) between two different goal points. Finally, a group of two people moves a large object (i.e., a poster stand) instructed over Discord [5].

Room layout for Scenario 1B with the focus on Maps of Dynamics. For the other scenarios, we remove the black-yellow striped lane markings. Also, in Scenario 3, the mobile robot on the left becomes a moving obstacle.
Fig. 2: Room layout for Scenario 1B with the focus on Maps of Dynamics. For the other scenarios, we remove the black-yellow striped lane markings. Also, in Scenario 3, the mobile robot on the left becomes a moving obstacle.

In Scenario 3, the robot (which remained stationary in the previous scenarios, see its position on the left in Fig. 2) navigates in the room. Scenario 3 has two variations: 3A, in which the teleoperated robot moves as a regular differential drive robot, and 3B, where the robot moves in an omnidirectional way. In both cases, an operator drives the mobile robot using a remote controller.

Scenario Description Mobile robot Duration
1A Baseline motion Static Obstacle 8 minutes
1B Semantic features Static Obstacle 8 minutes
2 People with tasks Static Obstacle 8 minutes
3A People with tasks Directional 8 minutes
3B People with tasks Omni-directional 8 minutes
TABLE I: Short description of the conducted scenarios, the motion mode of the robot, and the duration of recordings in one day

Ii-C Recording Procedure and Participants’ Priming

Recorded trajectories for one run in Scenario 1A ( Recorded trajectories for one run in Scenario 1A (
Fig. 3: Recorded trajectories for one run in Scenario 1A (left) and Scenario 1B (right), which includes the environment semantics. In both cases, the room contains various static obstacles, including a narrow corridor in the top right area. Trajectories show that most people would instinctively avoid the “areas of caution” around the robots, marked with yellow tape (see the layout in Fig. 2).
Maps of dynamics created from Acquisition I - IV (40 minutes) in Scenario 1A (
Fig. 4: Maps of dynamics created from Acquisition I - IV (40 minutes) in Scenario 1A (top) and Scenario 1B (down). CLiFF-map [8] is used to represent statistical information about flow patterns.

At the beginning of each session on the acquisition day, participants filled out a demographic questionnaire. For each scenario variation, on each acquisition day we recorded two runs with a length of 4 minutes. A summary of the scenarios and duration is given in Tab. I. We always started from Scenario 1B to avoid biasing the participants’ motion by letting them observe how the lane markings and the stop signs are prepared. After the two runs of this scenario, we followed with Scenario 1A and Scenario 2. Finally, we proceed with each variation from Scenario 3 in no particular order across the recording days.

After each run, participants fill the Raw version of the NASA-Task Load Index (RTLX) [7, 6]. The scale consists of a 21-point set of sub-scales [1=Low; 21=High], each of which assesses the mental demand, physical demand, temporal demand, and frustration produced by the task as reported by the participants, as well as their self-perceived performance and frustration. By the end of the session, after the last run of Scenario 3, participants fill out two extra questionnaires with regard to the mobile robot. First, the Godspeed Questionnaire Series [1], a semantic differential set of subscales [5-points] that measures the participants’ perception of the robot in terms of anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety, respectively. Second, a 5-point likert scale [1=Strongly disagree; 5=Strongly agree] to evaluate trust towards the robot in industrial human-robot collaborations [4]. The participants filled out all the questionnaires on paper.

Before recording each run, an instructor calibrates the three eye-trackers (Tobii Glasses 2 and 3) and adjusts the gazes for the Pupil Invisible Glasses. The instructor then returns to the stage and sets a 4-minute alarm. We check with the participants if everyone is ready to begin the measurements. If so, we start the recordings of the motion capture system and the eye trackers simultaneously as the instructor counts down to three to signal the participants the start of a run. Additionally, we record rosbag files including sensor data from the robot platform, like the image feed of its onboard RGB and RGB-D cameras and the point cloud recorded by the LiDAR, as well as topics regarding people tracking. After 4 minutes, we simultaneously stop all recordings and the ringing of the alarm signalizes the participants the end of a run.

Between each run, while the participants fill out the questionnaires, we prepare the next run; i.e., we remove the floor markings (after the last run of 1A), set up a phone for the Discord voice chat (before 2 and 3), check on the batteries of the eye trackers and potentially change them and finally prepare the robots for Scenario 3. As the participants finish filling out the questionnaire, we shuffle the roles in Scenario 2 and 3 and always assign new groups consisting out of one to three participants for the next run, hereby we always follow the rule, that for groups there can only be one participant with an eye tracker. We assign each group a new goal point to start from at the next run. For the scenarios with roles (2 and 3) we also give a short recap on the task connected with each role, if that participant has not been assigned this role before.

Iii Recorded Data

We recorded data on 4 acquisition days for a total of 30 unique participants (9 on Day I, 7 on Days II-IV). As described in Sec. II-B, each acquisition day consists of three different scenarios, and two of them have two different variants. Furthermore, we recorded two 4-minute runs per scenario. Therefore, each acquisition comprises ten runs comprehending all scenarios and yielding 40 minutes of multi-modal data: 3D motion patterns, eye-gaze data from 3 eye trackers, and robot sensor data.

Fig. 3 shows 2D motion trajectories, collected during one 4-minute run in Scenario 1A (left) and Scenario 1B (right). It shows the difference between the two scenarios in areas delimited by the lane markings (see Fig. 2 for the layout reference). Specifically, participants in Scenario 1B tended to navigate farther from the delimited static objects than in Scenario 1A. In addition, Fig. 4 shows the maps of dynamics [8] generated from the collected trajectories from all runs in Scenario 1A and 1B. It shows that in Scenario 1B the flow is less intensive near the “areas of caution” around the robots. Also, one-way passage flow pattern in the top right corner from Scenario 1B is clearly visible.

Fig. 5 shows the example eye-gazes, recorded for two participants wearing the tracking glasses in the same frame. The 2D gaze direction is provided in the first-person video frame, furthermore we calculate the 3D gaze coordinates in global map frame. Finally, Fig. 6 provides an example of the data recorded with the on-board robot sensors (LiDAR, RGB and fish-eye cameras), displayed in RViz.

Eye-gaze vectors, recorded for the participants wearing eye-tracking glasses. Eye-gaze vectors, recorded for the participants wearing eye-tracking glasses. Eye-gaze vectors, recorded for the participants wearing eye-tracking glasses.
Fig. 5: Eye-gaze vectors, recorded for the participants wearing eye-tracking glasses. Top: gaze-vectors mapped into the 3D global map frame for participants 2 (dark blue arrow) and 9 (light blue arrow). Bottom: corresponding first-person views for participants 2 and 9. Red line displays the gaze history in the past 2 seconds, and the red circle shows the gaze point in the current frame.
Data collected from the moving robot.
Fig. 6: Data collected from the moving robot. Top: 3D visualization of sensor data in RViz: Ouster point cloud shown as red/yellow points, an occupancy grid map for the laboratory and the output of the people tracking module (yellow bounding box) [9]. Bottom left: Fish-eye RGB camera image. Bottom Right: Azure Kinect RGB-D camera image.

Iv Conclusion and Future Work

In this paper we present a new contextually-rich recording of human-robot co-navigation in an indoor environment. The multi-modal data on human motion, collected from the motion capture system, eye-gaze trackers and the on-board sensors of a moving robot, aims to supply the research on human motion prediction, obstacle avoidance, maps of dynamics and human-robot interaction.

In future work we plan to extend the co-navigation scenarios with explicit forms of human-robot communication, for instance by signalling the robot’s intentions, and collaboration, for instance in loading, transporting and unloading the boxes.

References

  • [1] C. Bartneck, D. Kulić, E. Croft, and S. Zoghbi (2009) Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots. Int. J. Soc. Robot. 1 (1), pp. 71–81. Cited by: §II-C.
  • [2] J. Bock, R. Krajewski, T. Moers, S. Runde, L. Vater, and L. Eckstein (2020) The ind dataset: a drone dataset of naturalistic road user trajectories at german intersections. In IV, pp. 1929–1934. Cited by: §I.
  • [3] D. Brščić, T. Kanda, T. Ikeda, and T. Miyashita (2013) Person tracking in large public spaces using 3-d range sensors. IEEE Trans. on Human-Machine Systems 43 (6), pp. 522–534. Cited by: §I.
  • [4] G. Charalambous, S. Fletcher, and P. Webb (2016-04) The development of a scale to evaluate trust in industrial human-robot collaboration. International Journal of Social Robotics 8, pp. 193–209. External Links: Document, ISSN 18754805 Cited by: §II-C.
  • [5] Discord External Links: Link Cited by: §I, §II-B.
  • [6] H. et al. (2006) NASA-task load index (nasa-tlx) 20 years later. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, pp. 904–908. Cited by: §II-C.
  • [7] S. G. Hart and L. E. Staveland (1988) Developement of nasa-tlx (task load index). Adv. Psychol. 52, pp. 139–183. External Links: ISSN 01664115 Cited by: §II-C.
  • [8] T. P. Kucner, A. J. Lilienthal, M. Magnusson, L. Palmieri, and C. S. Swaminathan (2020) Probabilistic mapping of spatial motion patterns for mobile robots. Springer. Cited by: §I, Fig. 4, §II-B, §III.
  • [9] T. Linder, K. Y. Pfeiffer, N. Vaskevicius, R. Schirmer, and K. O. Arras (2020) Accurate detection and 3d localization of humans using a novel yolo-based rgb-d fusion approach and synthetic training data. In Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), Vol. . External Links: Document Cited by: Fig. 6.
  • [10] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese (2016) Learning social etiquette: human trajectory understanding in crowded scenes. In Proc. of the Europ. Conf. on Comp. Vision (ECCV), pp. 549–565. Cited by: §I.
  • [11] A. Rudenko, T. P. Kucner, C. S. Swaminathan, R. T. Chadalavada, K. O. Arras, and A. J. Lilienthal (2020) THÖR: human-robot navigation data collection and accurate motion trajectories dataset. IEEE Robotics and Automation Letters 5 (2), pp. 676–682. Cited by: §I, §II-A.
  • [12] A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila, and K. O. Arras (2020) Human motion trajectory prediction: a survey. Int. J. of Robotics Research 39 (8), pp. 895–935. Cited by: §I.
  • [13] A. Rudenko, L. Palmieri, J. Doellinger, A. J. Lilienthal, and K. O. Arras (2021) Learning occupancy priors of human motion from semantic maps of urban environments. IEEE Robotics and Automation Letters 6 (2), pp. 3248–3255. Cited by: §I.
  • [14] A. Rudenko, L. Palmieri, W. Huang, A. J. Lilienthal, and K. O. Arras (2022) The atlas benchmark: an automated evaluation framework for human motion prediction. In Proc. of the IEEE Int. Symp. on Robot and Human Interactive Comm. (RO-MAN), Cited by: §I.
  • [15] H. Yoon and S. Sankaranarayanan (2021) Predictive runtime monitoring for mobile robots using logic-based bayesian intent inference. In Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), Vol. . External Links: Document Cited by: §I.
  • [16] W. Zhi, T. Lai, L. Ott, and F. Ramos (2021) Anticipatory navigation in crowds by probabilistic prediction of pedestrian future movements. In Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), Vol. . External Links: Document Cited by: §I.
  • [17] W. Zhi, L. Ott, and F. Ramos (2021) Probabilistic trajectory prediction with structural constraints. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 9849–9856. External Links: Document Cited by: §I.