[From the Stanford link above]

Unsurprisingly, there's a neural network at the core of things. The neural network fθfθ is parameterised by θθ and takes as input the state ss of the board. It has two outputs: a continuous value of the board state vθ(s)∈[−1,1]vθ(s)∈[−1,1]from the perspective of the current player, and a policy →pθ(s)p→θ(s) that is a probability vector over all possible actions.

When training the network, at the end of each game of self-play, the neural network is provided training examples of the form (st,→πt,zt)(st,π→t,zt). →πtπ→t is an estimate of the policy from state stst (we'll get to how →πtπ→t is arrived at in the next section), and zt∈{−1,1}zt∈{−1,1} is the final outcome of the game from the perspective of the player at stst (+1 if the player wins, -1 if the player loses). The neural network is then trained to minimise the following loss function (excluding regularisation terms):l=∑t(vθ(st)−zt)2−→πt⋅log(→pθ(st))l=∑t(vθ(st)−zt)2−π→t⋅log(p→θ(st))The underlying idea is that over time, the network will learn what states eventually lead to wins (or losses). In addition, learning the policy would give a good estimate of what the best action is from a given state. The neural network architecture in general would depend on the game. Most board games such as Go can use a multi-layer CNN architecture. In the paper by DeepMind, they use 20 residual blocks, each with 2 convolutional layers. I was able to get a 4-layer CNN network followed by a few feedforward layers to work for 6x6 Othello.

Monte Carlo Tree Search for Policy Improvement

Given a state ss, the neural network provides an estimate of the policy →pθp→θ. During the training phase, we wish to improve these estimates. This is accomplished using a Monte Carlo Tree Search (MCTS). In the search tree, each node represents a board configuration. A directed edge exists between two nodes i→ji→j if a valid action can cause state transition from state ii to jj. Starting with an empty search tree, we expand the search tree one node (state) at a time. When a new node is encountered, instead of performing a rollout, the value of the new node is obtained from the neural network itself. This value is propagated up the search path. Let's sketch this out in more detail.

For the tree search, we maintain the following:

- Q(s,a)Q(s,a): the expected reward for taking action aa from state ss, i.e. the Q values
- N(s,a)N(s,a): the number of times we took action aa from state ss across simulations
- P(s,⋅)=→pθ(s)P(s,⋅)=p→θ(s): the initial estimate of taking an action from the state ss according to the policy returned by the current neural network.

From these, we can calculate U(s,a)U(s,a), the upper confidence bound on the Q-values asU(s,a)=Q(s,a)+cpuct⋅P(s,a)⋅√ΣbN(s,b)1+N(s,a)U(s,a)=Q(s,a)+cpuct⋅P(s,a)⋅ΣbN(s,b)1+N(s,a)Here cpuctcpuct is a hyperparameter that controls the degree of exploration. To use MCTS to improve the initial policy returned by the current neural network, we initialise our empty search tree with ss as the root. A single simulation proceeds as follows. We compute the action aa that maximises the upper confidence bound U(s,a)U(s,a). If the next state s′s′ (obtained by playing action aa on state ss) exists in our tree, we recursively call the search on s′s′. If it does not exist, we add the new state to our tree and initialise P(s′,⋅)=→pθ(s′)P(s′,⋅)=p→θ(s′) and the value v(s′)=vθ(s′)v(s′)=vθ(s′) from the neural network, and initialise Q(s′,a)Q(s′,a) and N(s′,a)N(s′,a) to 0 for all aa. Instead of performing a rollout, we then propagate v(s′)v(s′) up along the path seen in the current simulation and update all Q(s,a)Q(s,a) values. On the other hand, if we encounter a terminal state, we propagate the actual reward (+1 if player wins, else -1).

After a few simulations, the N(s,a)N(s,a) values at the root provide a better approximation for the policy. The improved stochastic policy →π(s)π→(s) is simply the normalised counts N(s,⋅)/∑b(N(s,b))N(s,⋅)/∑b(N(s,). During self-play, we perform MCTS and pick a move by sampling a move from the improved policy →π(s)π→(s). Below is a high-level implementation of one simulation of the search algorithm.