By Rohin Shah and Dmitrii Krasheninnikov
It would be great if we could all have household robots do our chores for us. Chores are tasks that we want done to make our houses cater more to our preferences; they are a way in which we want our house to be different from the way it currently is. However, most “different” states are not very desirable:
Surely our robot wouldn’t be so dumb as to go around breaking stuff when we ask it to clean our house? Unfortunately, AI systems trained with reinforcement learning only optimize features specified in the reward function and are indifferent to anything we might’ve inadvertently left out. Generally, it is easy to get the reward wrong by forgetting to include preferences for things that should stay the same, since we are so used to having these preferences satisfied, and there are so many of them. Consider the room below, and imagine that we want a robot waiter that serves people at the dining table efficiently. We might implement this using a reward function that provides 1 reward whenever the robot serves a dish, and use discounting so that the robot is incentivized to be efficient. What could go wrong with such a reward function? How would we need to modify the reward function to take this into account? Take a minute to think about it.