Now it makes two tortured devices working simultaneously since 3 months at full speed, almost without interruption to help build the unique Android rapid chess rating list.
Working in a safe zone, inside Android, staying away from Windows and from any competition doesn't mean i can allow myself giving up accuracy of the experiment. I do this not only to explore the strength of mobile chess programs but also to challenge the barriers of statistics science, in a Don Quichotte fashion! Who knows i can't break down a wind mill?
True... in the beginning, the sceptical engineering mind of mine pushed me hard toward an impossible mission, to obtain a reliable list with only 20 games per engine. I admit i simply lost that bet. No way!
The need for enough samples could not be avoided, despite my efforts to vary openings and opponents and randomize things at maximum to simulate a long run. Unfortunately, i've had to extend the experiment up to 150 games per engine to start seeing something speaking. 100 games seem to be the minimum where error margins fit into +/-60 ELO. Then, fluctuations seem to stabilize significantly. If you take a look at the graphical elo trends by rounds shown in the image below, you will visualize what's happening in the long run, among a wide population of engines.
Nothing is clear before 100 games played.
Grrr! Why do they always keep moving up n down!?
Grrr! Why do they always keep moving up n down!?
This was anoher lessons learnt case of statistics for me. Regarding the ranking, it has definitely shrank compared to my previous 5 sec/move blitzoid list. I think it's quite reasonable because more thinking time helps weaker engines resist more against stronger ones or said in a different way, it makes life a little bit harder for top engines.
I've also found out that engines with bigger gap between blitz and rapid, often refer to technical reasons or bugs. For instance, Ivanhoe performed worse in rapid and after deep analysis it came obvious that this engine can't use any hash memory, probably due to a bad compile. It just needs more help from hash tables when using more time per move but there's none used in fact and logically the performance is going down.
My final comment is about Komodo 8 which simply disappointed. I'm sure it can't be "statistical noise" (Oh! What a popular term nowadays!) anymore. Android looks different than Windows here. My guess is that it's somehow linked to how the engine binary is compiled. Komodo looks strong enough to threaten the crown of Stockfish in TCEC at present on 16 cores of a double-Xeon monster-PC but here in the modest Android 32-bit arena, Stockfish 5, the already outdated May-2014 code, is still clearly ruling against all other engines.
If you ask me which is the strongest Android engine today, the confident answer is Stockfish 5!
I must hereby claim that even the development version of 12-Oct-14, delivered with Droidfish 1.55, plays weaker than Stockfish 5 (details to come soon). On Windows 20-25 ELO increase over SF5 is confirmed and true. However, Android side shows an opposite panorama, maybe due to different compiling tools used.
Now, time to stop blah blah and let the list talk. You will notice this time, the number of cores and the operating system infos are added. Would it be a prior warning about intruders from other op systems? My wink of an eye here...
BAYES ELO RATINGS BASED ON 4036 GAMES BY 56 PROGRAMS
## Name c O/S elo + - gam sco oppo drw
01 Stockfish 5 4 And32 3139 51 49 142 76% 2959 36%
02 Komodo 8 4 And32 3083 48 46 142 67% 2971 44%
03 Critter 1.6a 4 And32 3004 45 45 142 55% 2972 50%
04 Firenzina 2.4.1 4 And32 2982 45 45 142 48% 2992 48%
05 BlackMamba 2.0 4 And32 2906 47 47 148 56% 2857 43%
06 RobboLito 0.085e4l 1 And32 2855 49 49 148 51% 2834 32%
07 Senpai 1.0 4 And32 2854 50 49 142 57% 2802 35%
08 Komodo32 3 AB 1 And32 2845 48 48 144 50% 2849 42%
09 Texel 1.05a8 1 And32 2794 49 49 148 56% 2745 28%
10 Gaviota v1.0-d 4 And32 2763 47 48 144 47% 2792 38%
11 Toga II 3.0 1 And32 2682 47 47 146 49% 2683 37%
12 Arasan 15.2 JA 4 And32 2664 47 47 152 53% 2646 34%
13 Deuterium v14.3.34.130 1 And32 2639 45 45 168 52% 2622 30%
14 DiscoCheck 4.3 1 And32 2623 47 47 156 47% 2649 28%
15 GNU Chess 5.50-32 1 And32 2613 47 47 148 51% 2610 36%
16 IvanHoe 9.46b 4 And32 2609 47 47 156 49% 2612 33%
17 Rhetoric 1.4.1 1 And32 2571 47 47 148 48% 2584 34%
18 RedQueen 1.1.3 TCEC JA 4 And32 2524 49 50 144 43% 2578 27%
19 Crafty_23.4.JA 1 And32 2508 49 49 144 48% 2520 25%
20 Rodent 1.00 1 And32 2488 49 48 148 51% 2472 26%
21 Alfil 12.10 1 And32 2485 49 48 140 51% 2474 29%
22 Daydreamer 1.75 JA 1 And32 2467 48 48 148 44% 2513 28%
23 Rotor 0.7a 1 And32 2448 48 48 140 50% 2450 31%
24 cheng3 1.07 JA 1 And32 2422 49 49 146 53% 2399 25%
25 GarboChess 3 1 And32 2393 48 48 144 46% 2426 28%
26 DanasahZ_0.4.JA_xb 1 And32 2392 50 50 144 48% 2405 26%
27 Sloppy_0.23.JA_xb 1 And32 2386 48 48 146 52% 2369 33%
28 Scorpio_2.7.JA_xb 1 And32 2381 48 48 146 51% 2374 25%
29 GNU Chess 6.0.2 1 And32 2373 49 48 144 55% 2338 24%
30 Tucano_1.04.AB_xb 1 And32 2324 51 51 142 54% 2297 17%
31 Pepito v1.59 1 And32 2295 47 47 150 48% 2317 29%
32 BetsabeII_1.30.JA_xb 1 And32 2289 51 50 150 57% 2234 15%
33 GreKo_9.0.JA_uci 1 And32 2286 47 48 150 47% 2308 28%
34 Typhoon_1.0.r358.JA_xb 1 And32 2275 48 49 146 49% 2281 26%
35 Diablo 0.5.1b JA 1 And32 2252 50 50 144 51% 2243 18%
36 Sungorus 1.4 JA 1 And32 2203 50 51 146 43% 2256 21%
37 Phalanx_XXIII.JA_xb 1 And32 2195 53 52 144 57% 2131 13%
38 Olithink_5.3.2.JA_xb 1 And32 2170 52 53 144 51% 2139 20%
39 TJchess 1.1U 1 And32 2134 50 51 140 46% 2158 21%
40 Natwarlal_0.14.JA_xb 1 And32 2133 52 52 142 52% 2106 19%
41 Myrddin_0.86.JA_xb 1 And32 2110 50 51 144 46% 2142 21%
42 Jazz 6.40 JA 1 And32 2091 50 51 144 48% 2107 23%
43 Scidlet_2.61b2.JA_xb 1 And32 2067 53 53 140 55% 2013 18%
44 KmtChess_1.21.JA_xb 1 And32 2049 52 52 140 50% 2044 21%
45 AdroitChess0.4 JA 1 And32 1952 54 54 138 52% 1926 17%
46 Sjeng_1.12.JA_xb 1 And32 1877 57 58 138 47% 1899 12%
47 BikJump v1.8 1 And32 1860 55 55 138 53% 1812 20%
48 ZCT-0.3.2500 1 And32 1780 60 61 138 51% 1753 9%
49 Leonidas_r83.JA_xb 1 And32 1771 57 57 138 56% 1696 14%
50 Sjaak_4.68.JA_xb 1 And32 1735 58 58 138 54% 1669 13%
51 Zzzzzz_3.5.1.JA_xb 1 And32 1566 58 58 138 51% 1569 22%
52 Tscp_1.8.1.AB_xb 1 And32 1536 60 60 138 45% 1577 12%
53 Rocinante 2.0 JA 1 And32 1518 62 62 138 49% 1535 7%
54 VIRUTOR CHESS 1.1.1 1 And32 1372 59 60 138 42% 1447 9%
55 Chess for Android 1 And32 1283 58 61 138 33% 1448 13%
56 Simplex 0.9.8 1 And32 1062 73 85 138 11% 1496 5%
Rapidroid test platform specification:
* Samsung Galaxy Note II @ 1.6 Ghz x 4 cores + 256MB hash for SP & MP Android programs,
* Polypad 1010IPS tablet @ 1.6 Ghz x 2 cores + 128MB hash for SP Android programs,
* HTC Diamond @ 528Mhz to be used for Windows Mobile programs, with 16MB hash size,
* i7 M620 @ 2.67 Ghz + Arena 3.5 + 2GB hash tables for Windows X64 programs
* iPod Touch 64G @ 600 Mhz to be used for Windows Mobile programs
* DosBox 1.74 used to run DOS programs,
* WinVICE used to run Commodore-64 programs,
* Messtiny UCI adapters or CB-Emu2014 used to emulate Mephisto programs,
* Own books disabled and replaced by 20 ply openings taken from Adam Hair's 10 move book, whenever possible.
* Openings selection for max variety, queens on board, no check or capture at last ply, preferably rated between +0.15 to +0.39 by Stockfish and Komodo.
* Opening positions played twice with different colors, whenever possible,
* Repeating openings and twin games avoided between two programs,
* Tablebases and pondering off,
* Time control: 15 to 30 sec/move or closest possible, identical for both programs.
Great work Gurcan. Thanks for the list.
ReplyDelete