Contents: CCRL 40/15 Downloads and Statistics January 16, 2021 Testing summary: Total: 1'206'633 games played by 2'770 programs White wins: 414'929 (34.4%) Black wins: 305'861 (25.3%) Draws: 485'843 (40.3%) White score: 54.5%

## Correlation between engines

### Introduction

We are doing analysis of all games played in our study, in order to compare the playing style of different engines. This is possible because we record not only the moves played, but also position evaluation, thinking time, and expected opponent's move. By comparing those parameters on consecutive moves in a game we can see if two engines are similar or different.

## Ponder hit statistics

Here you can see statistics of expected moves, also called "ponder hit", in CCRL games. When two engines can predict most of the moves in their match, it means that they share similar understanding of chess, similar thinking. Ponder hit statistics shows how exactly similar they are. This data can be collected from simply a database of played games, so it is convenient way to find what engines are similar or different from each other.

### The exact method of counting ponder hits

It looks simple — just count the predicted moves and divide by number of all moves. It is simple, just there are a few things to consider. First, there are opening moves, where engines don't think. We don't count such moves in this experiment. Second, there are forced moves, where there is no other choice. Such moves should not be counted too. We detect such moves by the time spend on them, so all moves made in 0:00 seconds are not used for this analysis.

Then, there are tablebase moves and mating lines. Such lines are characterized by many forced moves, but they also have many situations where it does not matter what to play. The result is that ponder hit statistics is not so meaningful in such lines. Ponder hit statistics is much more interesting in middlegame positions, where the move choice actually shows engine playing style and understanding. To limit this experiment to middlegame only we exclude all moves made with evaluation of +−9 pawns or more.

Finally, there are boring 50-move lines where engines don't know what to do, but still trying to avoid draw. In those lines engines play shuffle chess and any ponder hit analysis is meaningless. What's worse, just on the 50-th move they will move a pawn to avoid draw, and the shuffle chess continues for another 50 moves. Such cases are difficult to detect automatically, so after few experiments we decided to just ignore the drawn games completely. So, only decided games are used for correlation analysis in our study.

## Evaluation difference

Here you can see comparison of position evaluation reported by different engines. Each engine reports position evaluation when it makes the move. Then the opponent thinks, makes move and reports the evaluation as well. By comparing those evaluations we can see how similar is the thinking of two engines.

### The exact method of finding evaluation difference

It is easy to find the average evaluation difference for two engines - it is just mean of all differences in evaluation before more and after move, computed for all move in their games. For example, engine A moves e4 with evaluation of +0.15, then engine B moves c5, evaluating position as +0.08, then engine A moves Nf3, with evaluation +0.25. For this sequence of three moves the average evaluation difference can be computed as ((0.15-0.08) + (0.25-0.08))/2 = 0.12 (in pawns).

So far so good, but of course we should not just use all moves. We don't use opening moves and forced moves, and we don't use the moves where either side has evaluation more than +-9 pawns. We also don't use drawn games at all, because of 50-moves sequences. So, we are limiting this study to the same set of moves we use for ponder hit analysis.

There is one more issue to consider here. Ideally we should compare how two engines evaluate exactly same position. This is not possible when we use the game database as our input data - each engine evaluates position on its turn, after the opponent already moved. But what happens if unexpected move is played? Suppose engine A moves, and reports evaluation. A expectes a certain move from engine B, so A's evaluation is based on assumption that B will make that move. If B makes different move, different position occurs on board, not the one that A was expecting. So it seems not right to compare the evaluation of A and B in that case, because they were thinking about different lines. Because of this we use only expected moves in this experiment. This is about two times less than number of moves used for ponder hit statistics.

### Evaluation difference cross-tables

 Created in 2005-2013 by CCRL team Last games added on January 16, 2021