Modelbased reinforcement learning with neural network dynamics
By Anusha Nagabandi and Gregory Kahn
Enabling robots to act autonomously in the realworld is difficult. Really, really difficult. Even with expensive robots and teams of worldclass researchers, robots still have difficulty autonomously navigating and interacting in complex, unstructured environments.
Fig 1. A learned neural network dynamics model enables a hexapod robot to learn to run and follow desired trajectories, using just 17 minutes of realworld experience.
Why are autonomous robots not out in the world among us? Engineering systems that can cope with all the complexities of our world is hard. From nonlinear dynamics and partial observability to unpredictable terrain and sensor malfunctions, robots are particularly susceptible to Murphy’s law: everything that can go wrong, will go wrong. Instead of fighting Murphy’s law by coding each possible scenario that our robots may encounter, we could instead choose to embrace this possibility for failure, and enable our robots to learn from it. Learning control strategies from experience is advantageous because, unlike handengineered controllers, learned controllers can adapt and improve with more data. Therefore, when presented with a scenario in which everything does go wrong, although the robot will still fail, the learned controller will hopefully correct its mistake the next time it is presented with a similar scenario. In order to deal with complexities of tasks in the real world, current learningbased methods often use deep neural networks, which are powerful but not data efficient: These trialanderror based learners will most often still fail a second time, and a third time, and often thousands to millions of times. The sample inefficiency of modern deep reinforcement learning methods is one of the main bottlenecks to leveraging learningbased methods in the realworld.
We have been investigating sampleefficient learningbased approaches with neural networks for robot control. For complex and contactrich simulated robots, as well as realworld robots (Fig. 1), our approach is able to learn locomotion skills of trajectoryfollowing using only minutes of data collected from the robot randomly acting in the environment. In this blog post, we’ll provide an overview of our approach and results. More details can be found in our research papers listed at the bottom of this post, including this paper with code here.
Sample efficiency: modelfree versus modelbased
Learning robotic skills from experience typically falls under the umbrella of reinforcement learning. Reinforcement learning algorithms can generally be divided into categories: modelfree, which learn a policy or value function, and modelbased, which learn a dynamics model. While modelfree deep reinforcement learning algorithms are capable of learning a wide range of robotic skills, they typically suffer from very high sample complexity, often requiring millions of samples to achieve good performance, and can typically only learn a single task at a time. Although some prior work has deployed these modelfree algorithms for realworld manipulation tasks, the high sample complexity and inflexibility of these algorithms has hindered them from being widely used to learn locomotion skills in the real world.
Modelbased reinforcement learning algorithms are generally regarded as being more sample efficient. However, to achieve good sample efficiency, these modelbased algorithms have conventionally used either relatively simple function approximators, which fail to generalize well to complex tasks, or probabilistic dynamics models such as Gaussian processes, which generalize well but have difficulty with complex and highdimensional domains, such as systems with frictional contacts that induce discontinuous dynamics. Instead, we use mediumsized neural networks to serve as function approximators that can achieve excellent sample efficiency, while still being expressive enough for generalization and application to various complex and highdimensional locomotion tasks.
Neural Network Dynamics for ModelBased Deep Reinforcement Learning
In our work, we aim to extend the successes that deep neural network models have seen in other domains into modelbased reinforcement learning. Prior efforts to combine neural networks with modelbased RL in recent years have not achieved the kinds of results that are competitive with simpler models, such as Gaussian processes. For example, Gu et. al. observed that even linear models achieved better performance for synthetic experience generation, while Heess et. al. saw relatively modest gains from including neural network models into a modelfree learning system. Our approach relies on a few crucial decisions. First, we use the learned neural network model within a model predictive control framework, in which the system can iteratively replan and correct its mistakes. Second, we use a relatively short horizon lookahead so that we do not have to rely on the model to make very accurate predictions far into the future. These two relatively simple design decisions enable our method to perform a wide variety of locomotion tasks that have not previously been demonstrated with generalpurpose modelbased reinforcement learning methods that operate directly on raw state observations.
A diagram of our modelbased reinforcement learning approach is shown in Fig. 2. We maintain a dataset of trajectories that we iteratively add to, and we use this dataset to train our dynamics model. The dataset is initialized with random trajectories. We then perform reinforcement learning by alternating between training a neural network dynamics model using the dataset, and using a model predictive controller (MPC) with our learned dynamics model to gather additional trajectories to aggregate onto the dataset. We discuss these two components below.
Fig 2. Overview of our modelbased reinforcement learning algorithm.
Dynamics Model
We parameterize our learned dynamics function as a deep neural network, parameterized by some weights that need to be learned. Our dynamics function takes as input the current state $s_t$ and action $a_t$, and outputs the predicted state difference $s_{t+1}s_t$. The dynamics model itself can be trained in a supervised learning setting, where collected training data comes in pairs of inputs $(s_t,a_t)$ and corresponding output labels $(s_{t+1},s_t)$.
Note that the “state” that we refer to above can vary with the agent, and it can include elements such as center of mass position, center of mass velocity, joint positions, and other measurable quantities that we choose to include.
Controller
In order to use the learned dynamics model to accomplish a task, we need to define a reward function that encodes the task. For example, a standard “x_vel” reward could encode a task of moving forward. For the task of trajectory following, we formulate a reward function that incentivizes staying close to the trajectory as well as making forward progress along the trajectory.
Using the learned dynamics model and task reward function, we formulate a modelbased controller. At each time step, the agent plans $H$ steps into the future by randomly generating $K$ candidate action sequences, using the learned dynamics model to predict the outcome of those action sequences, and selecting the sequence corresponding to the highest cumulative reward (Fig. 3). We then execute only the first action from the action sequence, and then repeat the planning process at the next time step. This replanning makes the approach robust to inaccuracies in the learned dynamics model.
Fig 3. Illustration of the process of simulating multiple candidate action sequences using the learned dynamics model, predicting their outcome, and selecting the best one according to the reward function.
Results
We first evaluated our approach on a variety of MuJoCo agents, including the swimmer, halfcheetah, and ant. Fig. 4 shows that using our learned dynamics model and MPC controller, the agents were able to follow paths defined by a set of sparse waypoints. Furthermore, our approach used only minutes of random data to train the learned dynamics model, showing its sample efficiency.
Note that with this method, we trained the model only once, but simply by changing the reward function, we were able to apply the model at runtime to a variety of different desired trajectories, without a need for separate taskspecific training.
Fig 4: Trajectory following results with ant, swimmer, and halfcheetah. The dynamics model used by each agent in order to perform these various trajectories was trained just once, using only randomly collected training data.
What aspects of our approach were important to achieve good performance? We first looked at the effect of varying the MPC planning horizon H. Fig. 5 shows that performance suffers if the horizon is too short, possibly due to unrecoverable greedy behavior. For halfcheetah, performance also suffers if the horizon is too long, due to inaccuracies in the learned dynamics model. Fig. 6 illustrates our learned dynamics model for a single 100step prediction, showing that openloop predictions for certain state elements eventually diverge from the ground truth. Therefore, an intermediate planning horizon is best to avoid greedy behavior while minimizing the detrimental effects of an inaccurate model.
Fig 5: Plot of task performance achieved by controllers using different horizon values for planning. Too low of a horizon is not good, and neither is too high of a horizon.
Fig 6: A 100step forward simulation (openloop) of the dynamics model, showing that openloop predictions for certain state elements eventually diverge from the ground truth.
We also varied the number of initial random trajectories used to train the dynamics model. Fig. 7 shows that although a higher amount of initial training data leads to higher initial performance, data aggregation allows even lowdata initialization experiment runs to reach a high final performance level. This highlights how onpolicy data from reinforcement learning can improve sample efficiency.
Fig 7: Plot of task performance achieved by dynamics models that were trained using differing amounts of initial random data.
It is worth noting that the final performance of the modelbased controller is still substantially lower than that of a very good modelfree learner (when the modelfree learner is trained with thousands of times more experience). This suboptimal performance is sometimes referred to as “model bias,” and is a known issue in modelbased RL. To address this issue, we also proposed a hybrid approach that combines modelbased and modelfree learning to eliminate the asymptotic bias at convergence, though at the cost of additional experience. This hybrid approach, as well as additional analyses, are available in our paper.
Learning to run in the real world
Fig 8: The VelociRoACH is 10 cm in length, approximately 30 grams in weight, can move up to 27 bodylengths per second, and uses two motors to control all six legs.
Since our modelbased reinforcement learning algorithm can learn locomotion gaits using orders of magnitude less experience than modelfree algorithms, it is possible to evaluate it directly on a realworld robotic platform. In other work, we studied how this method can learn entirely from realworld experience, acquiring locomotion gaits for a millirobot (Fig. 8) completely from scratch.
Millirobots are a promising robotic platform for many applications due to their small size and low manufacturing costs. However, controlling these millirobots is difficult due to their underactuation, power constraints, and size. While handengineered controllers can sometimes control these millirobots, they often have difficulties with dynamic maneuvers and complex terrains. We therefore leveraged our modelbased learning technique from above to enable the VelociRoACH millirobot to do trajectory following. Fig. 9 shows that our modelbased controller can accurately follow trajectories at high speeds, after having been trained using only 17 minutes of random data.
Fig 9: The VelociRoACH following various desired trajectories, using our modelbased learning approach.
To analyze the model’s generalization capabilities, we gathered data on both carpet and styrofoam terrain, and we evaluated our approach as shown in Table 1. As expected, the modelbased controller performs best when executed on the same terrain that it was trained on, indicating that the model incorporates knowledge of the terrain. However, performance diminishes when the model is trained on data gathered from both terrains, which likely indicates that more work is needed to develop algorithms for learning models that are effective across various task settings. Promisingly, Table 2 shows that performance increases as more data is used to train the dynamics model, which is an encouraging indication that our approach will continue to improve over time (unlike handengineered solutions).
Table 1: Trajectory following costs incurred for models trained with different types of data and for trajectories executed on different surfaces.
Table 2: Trajectory following costs incurred during the use of dynamics models trained with differing amounts of data. legs.
We hope that these results show the promise of modelbased approaches for sampleefficient robot learning and encourage future research in this area.
This article was initially published on the BAIR blog, and appears here with the authors’ permission.
We would like to thank Sergey Levine and Ronald Fearing for their feedback.
This post is based on the following papers:

Neural Network Dynamics Models for Control of Underactuated Legged Millirobots
A Nagabandi, G Yang, T Asmar, G Kahn, S Levine, R Fearing
Paper 
Neural Network Dynamics for ModelBased Deep Reinforcement Learning with ModelFree FineTuning
A Nagabandi, G Kahn, R Fearing, S Levine
Paper, Website, Code
July 12, 2021
Need help spreading the word?
Join the Robohub crowdfunding page and increase the visibility of your campaign