The world is evolving at a pace that was once considered to be unattainable for many years.
With the advent of world-class technologies and strategic means to leverage those, the concept of scalability was transformed into a reality. A good example includes the use of advanced machine learning techniques like reinforcement learning (RL) to allow robots (also called ‘agents’) to learn and adapt to complex decision-making tasks.
Most of us have heard of self-driving cars, which use similar techniques to operate without a human driver. There are even more technical and commercial applications in robotics, like robot-soccer-leagues, where a robot is trained to play soccer using RL techniques.
However, one of the major problems with these tasks is the need for a large number of training trials to learn complicated behaviors.
For example, in case of a self-driving car, in order to train the car effectively, one should ideally pass it through millions of training drives. However, with a real robot (in this case, the car), this is costly and impractical.
One way in which this problem can be solved is using human feedback, i.e., every time the robot takes action, a human teacher can provide feedback on whether the action is correct or not. The robot can then learn from this human teacher about its mistakes and eventually, learn to perform near-accurate decision-making.
In this blog, we will discuss COACH that employs corrective human feedback to learn a decision-making scenario and execute a set of desired actions.
What is COACH?
COrrective Advice Communicated by Humans (COACH) is a framework of reinforcement learning developed by Celemin and Ruiz-del Solar in 2015. It is an algorithm that allows human teachers to shape the optimal policy by providing corrective feedback. COACH uses two regressors or classifiers (when a deep neural network is used to learn the policy and human feedback, it’s called D-COACH), namely the policy classifier and the human feedback classifier.
The idea of COACH is to model the human classifier so that it can mimic the human teacher in providing corrective feedback to the policy classifier. Apart from taking advantage of insights of human teachers in the learning process, COACH also has an added benefit of not needing any reward function to learn. In other words, we do not need any external perception of reward to train COACH. The following is a basic architectural diagram for COACH:
Architecture for COACH (image taken from Celemin et al.)
The agent contains a supervised learner, which takes human feedback as the input and the state. The learner updates the policy based on these inputs, and learns to predict the action. [Source]
The basic version of COACH uses radial basis function networks for the prediction task. In the basic version of COACH, a human teacher can provide feedback ‘h’ of -1, if the teacher is not satisfied with the action taken, +1 if the action is encouraged, and 0 if there is no input from the teacher. Based on this corrective feedback, the policy classifier (that predicts actions) tunes itself to reach perfection.
Here, we present the basic version of the COACH algorithm, and elaborate it step by step:
How Does COACH work?
We are given four parameters in COACH. The policy learning rate ‘e’ is initialized to a fixed value of say 0.05. The human model’s learning rate is . The feature space for each state is represented by
This block contains the crux of the algorithm. Here, the algorithm takes a state sk and computes an action from its features using the policy classifier. The weight matrix of the policy classifier is ak is the action predicted by the policy classifier for the state sk. Next, the human teacher provides corrective advice h, which as we saw can have a value of -1/+1/0. In case, the advice is anything other than 0, we compute the action predicted by the human classifier (or human model) from the features of sk. This action is denoted by H(sk).
Note that the goal of the human classifier (let’s call it H) is to simulate the human advice. Thus, H tries to simulate h, the actual human feedback. In the next step, we obtain the gradient of the human classifier’s weight matrix.
Since the human classifier always tries to get as close as possible to h, this gradient is directly proportional to the difference between the predicted feedback (H) and the actual human feedback (h). Intuitively, the higher the difference between the two, more drastically the weights must change in the human classifier to make H reach faster to its goal, and hence the difference should be directly proportional to (the gradient or change of the weight matrix of the human model). In other words, the quantity h-H(sk) is the error term for this model. We multiply this quantity with the human model’s learning rate. The (sk) term here comes from the standard gradient descent equation. Please refer to () for details. We next update the human model’s weights (line 9).
Now comes one of the most important intuitions based on which COACH has been built. In line 10, we define the policy classifier’s learning rate. Ignore the constant term cc for now since it’s just a proportionality constant. The main feature to note here is that the policy classifier’s learning rate is directly proportional to |H|, i.e., the absolute value of the human model’s predicted output. This is where the two models — the human model and the policy classifier — connect.
The latter’s learning rate depends upon the former’s output. But why is it so? The answer to this question has been explained in detail in Daan’s thesis and I encourage you to take a peek into it. However, I shall attempt to provide an easy-to-understand, intuitive answer to it here.
A deeper exploration
COACH works primarily in continuous action spaces and where one can arrive at one action by simply adding or subtracting a certain quantity from another action. In other words, COACH works where the actions are spatially correlated. An example of this would be the Lunar Lander game defined in the gym environment. You can know more about Lunar Lander (LL) here (ref). The following animation shows a Lunar Lander being directed to land on the landing pad:
The primary concept to focus on here is that the player can move the space-vehicle upwards, downwards, left, or right in an attempt to guide the lander towards the landing pad. As you can guess, these actions are spatially correlated, since adding a certain positive quantity to the coordinates of the vehicle will lead to the action ‘go right’. On the other hand, subtracting the same quantity from any coordinate will lead to ‘go left’. Thus, the human teacher can increase or decrease an action to give rise to another action in LL. These are the environments where COACH works. Now, the reason we define the learning rate of the policy classifier as in line 9 can be understood as follows:
- The human classifier simulates the human feedback of -1/+1/0 on each input.
- When the human feedback is repetitively -1, the human classifier’s output H tends to approach -1. Similarly, when the human feedback is repetitively +1, the classifier’s output H tends to approach +1. Hence, in both cases, |H| approaches a high value, i.e., |H|~1. Since the learning rate of the policy classifier is dependent on |H|, the policy classifier learns fast (takes huge jumps), in this case. This is expected — when the human is constantly feedbacking -1 (or +1), it means he/she wants the lander to go much left (or right) in case of LL, say. In other words, the lander is at an extremely incorrect/far off position than the landing pad, and hence the policy classifier is supposed to learn fast and lead the lander to the extreme left or extreme right.
- When the human feedback alternates (-1, +1, -1, …), H cannot approach a very high value and oscillates around 0. Thus, the learning rate of policy classifier is also small. Again, this case arises when the lander is close to the landing pad, and the human is making fine adjustments to the trajectory. Here, the policy learner is supposed to take small steps, instead of maneuvering drastically to left/right.
Hence, making directly proportional to |H| makes sense, because in line 11, it is used to define the gradient of the policy classifier’s weight matrix, which in turn is used to update the policy classifier’s weights in line 12. Thus, higher the value of |H|, faster the policy classifier learns (cases where an extreme left or extreme right actions are suggested), and vice versa.
Modifications of COACH
There are various papers that have used modified versions of COACH. Among these, the two major techniques are D-COACH (or Deep-COACH) and GP-COACH (or Gaussian Process COACH). In the former, Deep Neural Networks are used to implement COACH, while in the latter, Gaussian Process Regression is used. We briefly describe D-COACH here.
D-COACH is a lot similar to the basic version of COACH. One of the differences lies in its usage of a buffer B. This buffer is used to keep past data, for the purpose of continuous learning, i.e., to avoid catastrophic forgetting. Catastrophic forgetting is the property of a neural network to completely forget previously learned information upon learning new information. This occurs when there is a continuous stream of data coming as input to a neural network, and the current data has a completely different distribution than the past data that the neural network has learnt from. The buffer B acts as a lookup table for past data, so that the neural network can keep in touch with old data even while learning new data.
The algorithm is quite similar to COACH till line 7. In line 8, an error term for the tth timestamp is calculated by taking a product of the human feedback h and a static error magnitude term e. The error magnitude simply discounts the feedback given by the human teacher. If h=+1 or -1, and e=0.05, errort becomes +0.05 or -0.05. This errort is the update term that is added (in line 9) to the action computed by the policy classifier in line 5. Note that this is the same as asking the lander to go further left/right in our LL example.
We next update the policy network’s weights using the new action ylabel(t) and the state s. In line 11, we apply continuous learning and once again update the weights of the network using a past example from the lookup buffer B, and append the current (state, action) pair to the buffer (it will become past data in the time to come). Lines 13 and 14 are simply used to keep the buffer to a fixed size of K past examples since maintenance of a bigger buffer of past examples might lead to increased cost. Lines 15 and 16 retrain the policy classifier after a fixed interval of time using past buffer data.
Enhancing the learning rate
Similar to the basic version of COACH, D-COACH can also use a simulated human teacher (or a human model) instead of the actual human feedback h. The details of its implementation can be found in the above research paper.
What this means for the human race is clear – we are on the verge of mastering the art of feeding algorithms to AI and creating self-learning solutions. At some point, just like how emails completely replaced the telegraph system, humans would act as co-pilots (or even “get off the wheel”) for every model that’s introduced in the market. The world is definitely heading towards a brighter future unless something like Skynet comes into the picture.
COACH: Celemin, C. and Ruiz-del Solar, J. (2015).
COACH: Learning continuous actions from corrective advice communicated by humans. In IEEE International Conference on Advanced Robotics (ICAR), pages 581–586.
Deep COACH: Arvix