Understanding Reinforcement Learning

What is Reinforcement Learning?

Reinforcement learning (RL) involves an RL agent (computer program) that is provided data related to the task we want the agent to learn. The RL agent with trial and error actions to try to achieve a particular goal. So let’s start with a simple example like winning a game of tic-tac-toe and then we will build up to how the RL agent learns to trade stocks.

At the beginning of the each turn, our RL agent is presented with a representation of the tic-tac-toe board. For the purposes of this example, we will use a python list structure as most of our code and research is done in python. So, if our player has to go first, our RL agent will see a blank list with positions. # will equal an empty field and X or O will play themselves.

board_state = [#,#,#,#,#,#,#,#,#]

Now, because our RL agent has never played tic-tac-toe before, but has to make a move, and all 9 spaces are open, the agent will randomly pick a number 0 to 8 (computers start counting at 0 not 1). Say the agent picks the number 0, that will be top left as the numbers go left to right, top to bottom. So the board state will now look like as our agent has X’s.

board_state = [X,#,#,#,#,#,#,#,#]

Now, to start learning, our agent will start keeping track of board states, actions, the count of how many times the state was encountered and rewards for those actions. The count will be used to keep an average reward for the board state/action. The agent will use a state/action/reward table (Q-table) to keep track of the state action information. The rewards will come at the end of the game, 5 points for a win, 0 points for a loss and 2 points for a draw. The Q-table will look like this generically.

q_table = [#,#,#,#,#,#,#,#,#,ACTION,COUNT,REWARD]

So after our agent’s first move, the agent q table would look like this.

q_table = [#,#,#,#,#,#,#,#,#,0,1,]

As you can see the reward location is still blank – we will talk about the reward at the end of the game below.

At the beginning of each turn, the agent will view the board state, then check its q-table to determine if there is a state like the current board where it has acted and received a good reward. If the agent finds the state and a good reward for an action, the agent can take that action. If there was no previous state like this observed or there is a previous action for this state, but the reward was not good i.e. a low number, the agent can act randomly again. The agent does this until the end of the game. If the agent wins, the reward is 5 and all the q-table entries are updated with a new average based on the count and the previous value – the count for that state, action pair is also updated. The same is done for a draw or loss.

The agent continues playing games and updating its state/action/reward table – soon without any specific training, the RL agent is playing tic-tac-toe as well as any human.