Design good carrots and sticks

Trash.jpg

British anthropologist Marilyn Strathern famously wrote “When a measure becomes a target, it ceases to be a good measure”. This law, known as Goodhart’s Law after British economist Charles Goodhart who was responsible for its original formulation, captures our tendency to focus on the explicit objective, ignoring other important (unsaid) goals and values.

For example, at some point, Soviet planners supposedly rewarded factories for the number of nails they produced. So the factory managers started producing millions of tiny, useless nails. When the planners figured out what was happening, they switched to a weight criterion. The factories started producing giant, equally useless nails.

It turns out that designing good ‘reward functions’ is a non-trivial matter. In the field of AI, this is called the Reward Shaping problem. This problem is non-trivial, because an AI can learn ingenious ways to achieve the goal given to it, often undermining the intended objective, or generating unnecessary negative side effects. For example, a robot rewarded every time it removes dirt from your house will be tempted to bring in trash from the street, only to clean it up in anticipation of additional reward. The trash continues to move between your house and the street in an endless cycle. This type of behavior is called Reward Hacking, and is considered one of the most urgent problems in AI safety today.

Many ideas have been proposed to prevent AIs from hacking their reward functions, or generating undesirable side effects from achieving their main goal. One of the most interesting ideas, championed by UC Berkeley computer scientist Stuart Russell and his team, is reward uncertainty. The AI should be programmed to maximize human goals, while never being completely certain about what those goals actually are. For example, say you ask an AI to make as many paperclips as possible. Before it starts a global war in order to control all means of production for the purpose of paperclip making, it would pause to wonder: what if maintaining peace is also an important goal for the humans? Let me check with them first.

References

  • Strathern, M. ‘Improving ratings’: audit in the British University system. Eur. Rev. 5, 305–321 (1997).

  • Goodhart, C. A. E. Problems of Monetary Management: The UK Experience. in Monetary Theory and Practice: The UK Experience (ed. Goodhart, C. A. E.) 91–121 (Macmillan Education UK, 1984).

  • Coy, P. Goodhart’s Law Rules the Modern World. Here Are Nine Examples. Bloomberg Businessweek https://www.bloomberg.com/news/articles/2021-03-26/goodhart-s-law-rules-the-modern-world-here-are-nine-examples (2021).

  • Amodei, D. et al. Concrete Problems in AI Safety. arXiv 1–29 (2016).

  • Christian, B. The Alignment Problem: Machine Learning and Human Values. (W. W. Norton & Company, 2020).

  • Russell, S. Human Compatible: Artificial Intelligence and the Problem of Control. (Penguin Publishing Group, 2019).

  • Bostrom, N. Superintelligence. (Dunod, 2017).

Previous
Previous

Avoid over-trusting machines

Next
Next

Do not always use representative training data