about

contribute

republish

AIhub resources

AIhub events

☰

Robohub.org

Unsupervised meta-learning: learning to learn without supervision

by BAIR Blog

06 May 2020

Meta-Learning Optimizes Knobs of the Learning Procedure

While machine learning practitioners often carefully tune these knobs by hand, if we are going to solve many tasks, it may be useful to automatic this process. The process of setting the knobs of learning procedures via optimization is called meta-learning [Thrun 1998]. Algorithms that perform this optimization problem automatically are known as meta-learning algorithms. Explicitly tuning the knobs of learning procedures is an active area of research, with various researchers looking at tuning the update rules [Andrychowicz 2016, Duan 2016, Wang 2016], weight initialization [Finn 2017], network weights [Ha 2016], network architectures [Gaier 2019], and other facets of learning procedures.

To evaluate a setting of knobs, meta-learning algorithms consider not one task but a distribution over many tasks. For example, a distribution over supervised learning tasks may include learning a dog detector, learning a cat detector, and learning a bird detector. In reinforcement learning, a task distribution could be defined as driving a car in a smooth, safe, and efficient manner, where tasks differ by the weights they place on smoothness, safety, and efficiency. Ideally, the task distribution is designed to mirror the distribution over tasks that we are likely to encounter in the real world. Since the tasks in a task distribution are typically related, information from one task may be useful in solving other tasks more efficiently. As you might expect, a knob setting that works best on one distribution of tasks may not be the best for another task distribution; the optimal knob setting depends on the task distribution.

An illustration of meta-learning, where tasks correspond to arranging blocks into different types of towers. The human has a particular block tower in mind and rewards the robot when it builds the correct tower. The robot’s aim is to build the correct tower as quickly as possible.

In many settings we want to do well on a task distribution to which we have only limited access. For example, in a self-driving car, tasks may correspond to finding the optimal balance of smoothness, safety, and efficiency for each rider, but querying riders to get rewards is expensive. A researcher can attempt to manually construct a task distribution that mimics the true task distribution, but this can be quite challenging and time consuming. Can we avoid having to manually design such task distributions?

To answer this question, we must understand where the benefits of meta-learning come from. When we define task distributions for meta-learning, we do so with some prior knowledge in mind. Without this prior information, tuning the knobs of a learning procedure is often a zero-sum game: setting the knobs to any configuration will accelerate learning on some tasks while slowing learning on other tasks. Does this suggest there is no way to see the benefit of meta-learning without the manual construction of task distributions? Perhaps not! The next section presents an alternative.

Optimizing the Learning Procedure with Self-Proposed Tasks

If designing task distributions is the bottleneck in applying meta-learning algorithms, why not have meta-learning algorithms propose their own tasks? At first glance this seems like a terrible idea, because the No Free Lunch Theorem suggests that this is impossible, without additional knowledge. However, many real-world settings do provide a bit of additional information, albeit disguised as unlabeled data. For example, in regression, we might have access to an unlabeled dataset and know that the downstream tasks will be labeled versions of this same image dataset. In a RL setting, a robot can interact with its environment without receiving any reward, knowing that downstream tasks will be constructed by defining reward functions for this very environment (i.e. the real world). Seen from this perspective, the recipe for unsupervised meta-learning (doing meta-learning without manually constructed tasks) becomes clear: given unlabeled data, construct task distributions from this unlabeled data or environment, and then meta-learn to quickly solve these self-proposed tasks.

In unsupervised meta-learning, the agent proposes its own tasks, rather than relying on tasks proposed by a human.

How can we use this unlabeled data to construct task distributions which will facilitate learning downstream tasks? In the case of regression, prior work on unsupervised meta-learning [Hsu 2018, Khodadadeh 2019] clusters an unlabeled dataset of images and then randomly chooses subsets of the clusters to define a distribution of classification tasks. Other work [Jabri 2019] look at an RL setting: after exploring an environment without a reward function to collect a set of behaviors that are feasible in this environment, these behaviors are clustered and used to define a distribution of reward functions. In both cases, even though the tasks constructed can be random, the resulting task distribution is not random, because all tasks share the underlying unlabeled data — the image dataset for regression and the environment dynamics for reinforcement learning. The underlying unlabeled data are the inductive bias with which we pay for our free lunch.

Let us take a deeper look into the RL case. Without knowing the downstream tasks or reward functions, what is the “best” task distribution for “practicing” to solve tasks quickly? Can we measure how effective a task distribution is for solving unknown, downstream tasks? Is there any sense in which one unsupervised task proposal mechanism is better than another? Understanding the answers to these questions may guide the principled development of meta-learning algorithms with little dependence on human supervision. Our work [Gupta 2018], takes a first step towards answering these questions. In particular, we examine the worst-case performance of learning procedures, and derive an optimal unsupervised meta-reinforcement learning procedure.

Optimal Unsupervised Meta-Learning

To answer the questions posed above, our first step is to define an optimal meta-learner for the case where the distribution of tasks is known. We define an optimal meta-learner as the learning procedure that achieves the largest expected reward, averaged across the distribution of tasks. More precisely, we will compare the expected reward for a learning procedure $f$ to that of best learning procedure $f^*$, defining the regret of $f$ on a task distribution $p$ as follows:

Extending this definition to the case of unsupervised meta-learning, an optimal unsupervised meta-learner can be defined as a meta-learner that achieves the minimum worst-case regret across all possible task distributions that may be encountered in the environment. In the absence of any knowledge about the actual downstream task, we resort to a worst case formulation. An unsupervised meta-learning algorithm will find a single learning procedure $f$ that has the lowest regret against an adversarially chosen task distribution $p$:

$\min_f \max_p \;\; {\rm Regret}(f,p).$

Our work analyzes how exactly we might obtain such an optimal unsupervised meta-learner, and provides bounds on the regret that it might incur in the worst case. Specifically, under some restrictions on the family of tasks that might be encountered at test-time, the optimal distribution for an unsupervised meta-learner to propose is uniform over all possible tasks.

The intuition for this is straightforward: if the test time task distribution can be chosen adversarially, the algorithm must make sure it is uniformly good over all possible tasks that might be encountered. As a didactic example, if test-time reward functions were restricted to the class of goal-reaching tasks, the regret for reaching a goal at test-time is inverse related to the probability of sampling that goal during training-time. If any one of the goals $g$ has lower density than the others, an adversary can propose a task distribution solely consisting of reaching that goal $g$ causing the learning procedure to incur a higher regret. This example suggests that we can find an optimal unsupervised meta-learner using a uniform distribution over goals. Our paper formalizes this idea and extends it to broader classes task distributions.

Now, actually sampling from a uniform distribution over all possible tasks is quite challenging. Several recent papers have proposed RL exploration methods based on maximizing mutual information [Achiam 2018, Eysenbach 2018, Gregor 2016, Lee 2019, Sharma 2019]. In this work, we show that these methods provide a tractable approximation to the uniform distribution over task distributions. To understand why this is, we can look at the form of a mutual information considered by [Eysenbach 2018], between states $s$ and latent variables $z$:

$\mathcal{I}(s,z) = \mathcal{H}(s) - \mathcal{H}(s|z).$

In this objective, the first marginal entropy term is maximized when there is a uniform distribution over all possible tasks. The second conditional entropy term ensures consistency, by making sure that for each $z$, the resulting distribution of $s$ is narrow. This suggests constructing unsupervised task-distributions in an environment by optimizing mutual information gives us a provably optimal task distribution, according to our notion of min-max optimality.

While the analysis makes some limiting assumptions about the forms of tasks encountered, we show how this analysis can be extended to provide a bound on the performance in the most general case of reinforcement learning. It also provides empirical gains on several simulated environments as compared to methods which train from scratch, as shown in the Figure below.

Summary & Discussion

In summary:

Learning procedures are recipes for converting datasets into function approximators. Learning procedures have many knobs, which can be tuned by optimizing the learning procedures to solve a distribution of tasks.
Manually designing these task distributions is challenging, so a recent line of work suggests that the learning procedure can use unlabeled data to propose its own tasks for optimizing its knobs.
These unsupervised meta-learning algorithms allow for learning in regimes previously impractical, and further expand that capability of machine learning methods.
This work closely relates to other works on unsupervised skill discovery, exploration and representation learning, but explicitly optimizes for transferability of the representations and skills to downstream tasks.

A number of open questions remain about unsupervised meta-learning:

Unsupervised learning is closely connected to unsupervised meta-learning: the former uses unlabeled data to learn features, while the second uses unlabeled data to tune the learning procedure. Might there be some unifying treatment of both approaches?
Our analysis only proves that task proposal based on mutual information is optimal for memoryless meta-learning algorithms. Meta-learning algorithms with memory, which we expect will perform better, may perform best with different task proposal mechanisms.
Scaling unsupervised meta learning to leverage large-scale datasets and complex tasks holds the promise of acquiring learning procedures for solving real-world problems more efficiently than our current learning procedures.

Check out our paper for more experiments and proofs: https://arxiv.org/abs/1806.04640

Acknowledgments

Thanks to Jake Tyo, Conor Igoe, Sergey Levine, Chelsea Finn, Misha Khodak, Daniel Seita, and Stefani Karp for their feedback.This article was initially published on the BAIR blog, and appears here with the authors’ permission.

These knobs are often known as hyperparameters, but we will stick with the colloquial “knob” to avoid having to draw a line between parameters and hyperparameters. ↩

BAIR Blog is the official blog of the Berkeley Artificial Intelligence Research (BAIR) Lab.

AUAI is supported by:

Wristband enables wearers to control a robotic hand with their own movements

MIT News 13 Jul 2026

By moving their hands and fingers, users can direct a robot to play the piano, shoot a basketball, or manipulate objects in a virtual environment.

#RoboCup2026 social media round-up

Lucy Smith 08 Jul 2026

Find out what the teams got up to at this year's RoboCup extravaganza in Incheon.

#RoboCup2026 – humanoid league knockout stages

Lucy Smith 06 Jul 2026

Find out who won the small, middle and large divisions in Incheon.

#RoboCup2026 – humanoid league day 2

Lucy Smith 03 Jul 2026

Find out the latest from day two of the competition.

Reflections from ICRA 2026

Ella Scallan 02 Jul 2026

From dancing robots to moral machines: our Assistant Editor reflects on ICRA 2026.

#RoboCup2026 – humanoid league day 1

Lucy Smith 02 Jul 2026

In the first of our round-ups from the humanoid league we introduce the competition, and report some preliminary results.

What’s coming up at #RoboCup2026?

Lucy Smith 29 Jun 2026

Find out what's in store at this year's international competition.

Robot Talk Episode 162 – The robot doctor will see you now

Robot Talk 26 Jun 2026

In this special live recording at the Great Exhibition Road Festival in London, Claire chatted to George Mylonas (Imperial College London), Antonia Tzemanaki (University of Bristol) and Tom Vercauteren (King’s College London) about robotics and AI in medicine and healthcare.