Reinforcement Learning — Controversy over Reward

How To Make Autonomous Cars Trustworthy — IEEE Standards Association (07 Apr 2021)

Lately, reinforcement learning has been a source of controversy as to whether reward is enough to take appropriate “intelligent” decisions. Reinforcement learning (RL), which does not require historical and/or labelled data, when compared to deep learning, is based on the reward paradigm where the agent (self-driving vehicle) is rewarded as it navigates through its environment. The reward can be computed by measuring the quality (value) of the overall performance of its navigation in its environment. The aim is to get the agent (the vehicle) to act in its environment so as to maximize its reward while also considering the long term results of its actions, such as the vehicle’s deceleration of speed ahead of a turn or avoidance of obstacles. The environment is modelled as a stochastic finite state machine actions sent from the agent (the vehicle), as inputs, and observations (data) and rewards sent back to the agent as outputs.

Figure 1: Action taken by the agent (as inputs) exploring its environment is being rewarded (observations as outputs).

Reinforcement learning is currently applied in a growing number of areas as researchers are figuring out how to get a computer to calculate the value or the potential future rewards that should be assigned to actions. Each value assigned is stored in a large table, and the computer updates all these values as it learns. These large tables are processed, in turn, using ML subset deep learning (DL), such as in the case for example of the positions on a Go board, or the pixels on a screen during a computer game.

RL has been used in AlphaGo, which is a computer developed by a subsidiary of Alphabet called DeepMind. It is now used in improving self-driving cars and helping robot to grasp objects it has never seen before, and in the optimal configuration for the equipment in a data center.

On overall RL approach addresses two types of problems:

- Prediction: How much reward can be expected for every combination of possible future states.

- Control: By exploring all possible combinations of the environment through its interaction of state space to find a combination of actions such as how to steer an autonomous vehicle.

When applied to a self-driving car an internal map will allow the car to place itself in the envisaged space (environment). To determine the best route to navigate through the map a method is used along with a system of obstacle avoidance.

Reward Controversy Issue

A recent paper by the DeepMind team argues that rewards are enough to take actions that exhibits most if not all attributes of intelligence. The argument put forward is that “rewards are all that is needed for agents in rich environments to develop multi-attribute intelligence of the sort needed to achieve artificial general intelligence”. The assumption made is that intelligence can be understood as a flexible ability to achieve goals and that RL formalises the problem of goal-seeking intelligence.

Few years earlier, there were similar controversial inferential issues with Big Data arguing that “the data deluge makes the scientific method obsolete”. This has led the Aspen Institute Roundtable on Information (AIRI) to convene a roundtable meeting in the year 2009. The AIRI roundtable meeting was attended by 25 people representing leaders, entrepreneurs and academics that are called upon to examine the implications of these Big Data inferential issues on science as well as people.

The outcome of the AIRI roundtable meeting was that Big Data alone is not enough to address inferential issues. Same could also be said about recent inferential issues based on rewards alone. By analogy to the evolutionary processes, where both genes and the environment interact, reward is the equivalent of the selection process in evolution, which is alone not sufficient.

These inferential issues are linked to reasoning, as we use reason to form inferences to take appropriate actions or decisions, however reasoning remains still a major challenge for AI. In a self-driving vehicle the number of actions may be limited, such as actions of steering, braking, changing gears and so on, but the decisions can be more complex requiring rational decisions not limited to learning and rewards based on interacting with the environment through trial and error alone.

RL Main Concepts

RL is a system in which success is not granted but learned by interacting with its environment through trial and error. The ‘Environment’ is a matrix of all possible alternative values or steps that can be taken, such all possible moves of all pieces in a game of checkers. The Agent begins to randomly explore alternative actions in the Environment and reinforces the Agent to exploit when the moves are successful. The ‘State’ is the current set of moves or values which is modified after each try and seeks to optimize the reward via the feedback loop. The Agent must thus learn from its experiences of the Environment as it explores the full range of possible States.

When compared with supervised or unsupervised learning, RL does not have any data (historical and/or labelled data) to learn from and it has to build its own data to learn from scratch through trial and error.

Q-Learning in RL

To evaluate the action taken by RL, a Q-learning function, which is an action-value function that determines the value of being in a certain state and takes in turn a certain action at that state. Q stands for the “quality” of an action taken by the agent in a given state. Q-Learning is by far the best known generalized algorithm in RL.

Q-Learning is thus based of a combination of state, denoted by s and an action denoted by a at time t, where rt is the reward “observed” value for the current state st, a is the learning rate, with values between 0 and 1, and g(gamma) is the discount factor, which controls the importance of subsequent rewards from the current state onwards to the new or current value and to the future value estimate.

Another function works on the top of Q-learning function to let the agent to choose which action to perform. This is known as ‘the agent policy function” and denoted by the notation. This function uses the current environment state to return an action.

Figure 2: Action taken by the agent exploring its environment involving state feedback.

The agent explore the state-space, and the state–action pair policies of are created in episodes in the state-space. The policy function selects the next action for the agent based on the to either explore or exploit the state-space. An exploit policy allows the function to identify the action with the largest Q-value and returns that action. Under explore approach the action is being identified probabilistically as a function of the Q-value, as a probability over the sum of Q-values for the state.

Figure 3: Example of an environment or state-pace being explored

RL systems do not require neural nets but increasingly the most interesting problems like self-driving cars represent such large and complex state spaces that the direct observation policy gradient approach is not practical.

These Q-Learning situations are also frequently defined by their use of images, in particular pixel fields, as unlabeled inputs which are classified using a convolutional neural net (CNN) with some differences from standard image classification.

Reinforcement Learning — Algorithm

Under RL algorithm, the machine is exposed to an environment (env_sa) where it trains itself based on using trial and error. RL learns by taking actions (a) under continuous changing conditions or states (s). It is trained thus to learn from past experience and tries to capture the best possible knowledge to make accurate decisions.

Example of RL algorithm is the Markov Decision Process (MDP) and there is a package for applying MDP. Other algorithms and packages are also under development such as the “ReinforcementLearning” package, which is intended to partially close this gap and offers the ability to perform model-free reinforcement learning in a highly customizable framework.

RL Walkthrough examples

Some of RL examples with codes are available at OperAI GitHub as well as the book on “Edge & Fog Analytics : The New Analytics Interface”.


1. Reinforcement Learning and AI by William Vorhies on September 13, 2016

2. Gregory Piatetsky (December 2017) Exclusive: Interview with Rich Sutton, the Father of Reinforcement Learning. KDnuggets.

3. Kevin Murphy (1998) A brief introduction to reinforcement learning, UBC, Canada

4. Will Knight Reinforcement Learning By experimenting… March/April 2017 MIT Technology Review

5. Reinforcement Learning and AI by William Vorhies on September 13, 2016

6. M. Tim Jones (2017) Train a software agent to behave rationally with reinforcement learning. Cognitive computing, IBM.

7. Nicolas Pröllochs & Stefan Feuerriegel (2017).

8. Training AI: Reward is not enough by Herbert Roitblat, The venturebeat, July 10, 2021.

9. Vermesan et al. (2014)

10. Chris Anderson (2008) The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired magazine.

11. David Silver, Satinder Singh, Doina Precup and Richard S. Sutton (2021), Reward is enough, Artificial Intelligence

OperAI develops IoTs with Math and AI Embedded Solutions to speed up and streamline operational processes at the edges of the cloud.