# A generic implementation of a Nested Monte Carlo Search for single player games

A initial SameGame board has an upper bound of reachable board positions.

Therefore it can be extremely difficult to solve, even for computers.

A very promising approach to solving complex problems such as SameGame is the Nested Monte Carlo Search (NMCS). It is a very simple variation from the family of Monte Carlo Search algorithms, especially suited for single player games.

In this post we will see a generic implementation of the NMCS in Java that can easily be adapted to different problem domains.

But before we come to that, let’s look at how the NMCS works.

## Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) algorithms are extremely relevant today for state-of-the-art artificial intelligence implementations.

One recent famous example is Google’s AI AlphaGo that mastered the game of Go. AlphaGo utilizes a MCTS for one of its core components.

The MCTS algorithms produce very good solutions to problems where other traditional methods fail.

These problems usually tend to have a very large search space such as games like Go, Chess or SameGame.

### What’s the general idea?

A MCTS is a heuristic algorithm which is based on running many random simulations.

During a simulation (sometimes also called playout, rollout or sample) random legal actions are applied to an initial state until a terminal state is reached.

When the number of simulations approaches infinity, the result will converge to an optimal solution.

Of course, this is impractical.

Therefore the simulation strategy is combined with a tree search to guide the search to more promising branches of the search tree. This often produces very good results within a reasonable amount of time.

## Generic application

Monte Carlo Searches don’t need any domain specific knowledge. They can be applied to any problem that can be described in terms of

• state
• list of legal actions
• score
• function that takes a state and an action and returns a new state
• simulation strategy

## Nested Monte Carlo Search

In this post we will focus on a very basic Monte Carlo algorithm called Nested Monte Carlo Search (NMCS).

The NMCS is especially suited for complex single player games such as SameGame.

### Basic algorithm

A NMCS repeats the following steps until time runs out or until the search terminates.

#### Selection

During the first iteration the initial state (root) is selected.

Otherwise the action leading to the current best score is selected.

#### Simulation

For the previously selected state all legal actions are determined.

For each of these next actions a simulation is played out.

#### Backpropagation

The best result found during the previous simulation step is stored in memory.

### Search levels

The specialty of the NMCS is that during the simulation step the simulations themselves can be nested applications of the NMCS algorithm to all the children of the current state.

The depth of nesting can be described in terms of search levels.

#### Recursive definition

Level Description
0 One playout starting at the current node.
1 During simulation steps a level-0 search is conducted for all children.
n>1 During simulation steps a recursive level-(n-1) search is conducted for all children.

#### Example of a level-2 search step

The following diagram shows one iteration of a level-2 search:

Explanation:

1. Node 1 is selected.
2. Next, a complete 2 level deep subtree of the search space with 1 as its root is traversed. This subtree has 5 leafs, 48.
3. For each leaf a normal simulation is played out.
4. The best result is found by the simulation from node 7. The result is propagated back and the next node from the best result (node 3) is selected for the next iteration.

#### What’s the optimal search level?

Finding a good search level is a matter of manual tuning and experimenting.

It varies from case to case and depends on parameters such as size of the search space, duration of the search, etc.

NMCSs with higher search levels will find better results during simulations, but they will take longer to get deep down the search path.

### Playout policy

A level-0 search is a normal simulation and defined as one playout starting from a given node until a terminal state is reached.

On this path every action is chosen by a playout policy.

Playout policies could either randomly choose actions or they could use domain specific strategies to improve the results.

## Generic implementation

The NMCS algorithm can be applied to every problem, puzzle or game that can be described in terms of the generic interface INmcsState<TState, TAction>.

### Generic interface

The generic type TState represents the state of the problem, e.g. the board position.

The generic type TAction represents an action, e.g. a coordinate on a SameGame board, that can be applied to a state.

The implementation has to provide:

• a score
• a function to get all legal next actions
• a function that takes a state and an action and returns a new state
• an implementation of a playout policy for simulations
public interface INmcsState<TState, TAction> {
double getScore();
boolean isTerminalPosition(); // for convenience (same as findAllLegalActions().size() == 0)
List<TAction> findAllLegalActions();
INmcsState<TState, TAction> takeAction(TAction action);
Pair<Double, List<TAction>> simulation();
}


The implementation must be immutable. That means none of the operations is allowed to change the state of the object that the operation is called on.

If the state has to be updated, instead of updating the internal state a new updated version will have to be returned.

Of course, the implementation could be a wrapper around an existing mutable class where a copy is created before mutating. This would be an application of the adapter design pattern.

### Generic algorithm

The NMCS algorithm is defined by the function executeSearch.

It takes three parameters:

state
of type INmcsState<TState, TAction> which represents the current state
level
of type int which specifies the search level
isCanceled
of type Supplier<Boolean> which allows to inject code for gracefully stopping the execution, e.g. a timer

The return value is a tuple of a score of type Double and a list of actions of type List<TAction>. The list of actions describes the path from the root to the terminal state.

public static <TState, TAction> Pair<Double, List<TAction>> executeSearch(INmcsState<TState, TAction> state,
final int level, final Supplier<Boolean> isCanceled) {

// terminating case
if (level <= 0)
return state.simulation();

Pair<Double, List<TAction>> globalBestResult = Pair.of(state.getScore(), new LinkedList<TAction>());
final List<TAction> visitedNodes = new LinkedList<TAction>();

while (!state.isTerminalPosition() && !isCanceled.get()) {

Pair<Double, List<TAction>> currentBestResult = Pair.of(0.0, new LinkedList<TAction>());
TAction currentBestAction = null;

for (TAction action : state.findAllLegalActions()) {
final INmcsState<TState, TAction> currentState = state.takeAction(action);
// recursion
final Pair<Double, List<TAction>> simulationResult = executeSearch(currentState, level - 1, isCanceled);

if (simulationResult.item1 >= currentBestResult.item1) {
currentBestAction = action;
currentBestResult = simulationResult;
}
}

if (currentBestResult.item1 >= globalBestResult.item1) {
globalBestResult = currentBestResult;
} else {
currentBestAction = globalBestResult.item2.get(visitedNodes.size());
}

state = state.takeAction(currentBestAction);
}
return globalBestResult;
}


The algorithm terminates either when time runs out or when a node is selected during the selection step that has no child node and is a terminal state.

Here is an example of calling the executeSearch function:

final int[][] board = BoardGenerator.generateBoard("1,1,1;2,2,2;3,3,3;");
final int level = 2;
final long maxRunningTimeMs = 2 * 60 * 1000;

final INmcsState<SGBoard, Point> state = new SGNmctsState(board);

final long endTimeMs = System.currentTimeMillis() + maxRunningTimeMs;
final Pair<Double, List<Point>> result = NestedMonteCarloSearch.executeSearch(state, level, () -> {
return System.currentTimeMillis() > endTimeMs;
});


The complete code is available on GitHub.

## Enhancements and improvements

Here is a selection of potential improvements:

• Root parallelization

By running independent searches from the root, the risk of getting trapped in a local maximum is reduced.

• Guided playout policy

For some problem domains guided playouts may lead to better results than random sampling.

• More emphasis on exploration as opposed to exploitation

It could be profitable to try unpromising actions occasionally because they still may lead to the total maximum.

• Parallelization and performance optimization

More efficient implementations will provide better results.

• Choose randomly between equally good results

The implementation shown above is intentionally kept simple. If multiple playouts yield the same score, the last playout will always be selected. A random selection between equally good paths would be the superior strategy.