# Power-Seeking in RL: Quick Summary

The following is an adaptation of part of my (unsuccessful) application to the Winter 2022 cohort of SERI MATS, as prompted by Victoria Krakovna’s “Explain the power-seeking theorems in the reinforcement learning setting in your own words”

I realise that nowadays, high-level summaries of this kind provide little value as summarizing bots have reached a very acceptable level. I am publishing to get into the habit of posting more.

So, **Epistemic Status:** Rendered obsolete by ChatGPT et al.

An action or behaviour is *instrumental* to an objective when it helps achieve
that objective. When this action or behaviour is instrumental to a range of
objectives, it said to be *convergently instrumental.*

The power seeking theorem outlined in [1] formalises the notion that *power
seeking* is convergently instrumental. That is, it mathematically shows that
power seeking is helpful in achieving a wide range of objectives under certain
conditions, by considering the power-seeking tendencies of optimal policies in
finite MDPs.

Here, *power* is formalised as the ability to achieve a wide variety of goals.
To “seek” power, an action leads an agent to a state with greater power.
Mathematically, the authors express power as a modified version of average
optimal value. That is, the optimal *value* averaged over a range of goals.
*Value* is simply the expected return, or the expected sum of rewards when
acting under a given policy. The optimal value is then the maximum achievable
value, which is obtained when acting under an optimal policy. Note that a goal
is formalised as the maximisation of a reward function.

The authors’ formalisation rests on the use of optimal policies as the representation of intelligent agents, and considers the setting of a Markov Decision Process (MDP), i.e. assuming the environment is fully observable. Under these premises, the authors show that certain graphical symmetries in MDPs cause optimal policies to tend to seek power. They note that optimal policies will avoid visiting terminal states when possible, and prefer states with higher optionality.

## References

[1] A. Turner, L. Smith, R. Shah, A. Critch, and P. Tadepalli, ‘Optimal Policies
Tend To Seek Power’, in *Advances in Neural Information Processing Systems*,
2021, vol. 34, pp. 23063–23074. [Online]. Available:
https://proceedings.neurips.cc/paper/2021/hash/c26820b8a4c1b3c2aa868d6d57e14a79-Abstract.html