-
Notifications
You must be signed in to change notification settings - Fork 0
Creating my first test
- Keep your final change as small and neat as possible: https://tirania.org/blog/archive/2010/Dec-31.html
- Even if a change passes the fishtest tests, it is not a guarantee that it will be merged. Patches that add significant complexity will need to show a big benefit to be considered.
- Never submit a test that is a bundle of multiple ideas. Submit each idea individually as its own test.
- Participate in the fishcooking forum and the Stockfish Discord channel. These are the places to communicate with moderators and other developers. If your patch is complex, prior to making a pull request, please create a new thread on fishcooking and explain to others how your patch works. You will probably get valuable feedback.
To write patches for Stockfish and test them in the framework, you will need:
- a recent C++ compiler
- a working git on your system
- a GitHub account
- a git client on your computer (we recommend GitHub Desktop, for its simplicity)
Create a fork of the official Stockfish repository at https://github.com/official-stockfish/Stockfish and create a git clone of your forked version. Github has good help on this process: https://help.github.com/articles/fork-a-repo
Before creating a new patch, you have to make sure that your master branch is up to date and has all the newest commits from the official Stockfish master branch. You can use the following script to ensure this (this script should be used each time the official master branch changes). Save the script in a file named "sync-with-official.sh", then type the following command in a terminal: sh sync-with-official.sh
#!/bin/sh
# change directory to the path of the script
cd "${0%/*}"
# go to the src directory for Stockfish on my hard drive (edit accordingly)
cd ./GitHub/stockfish/src
echo
echo "This command will sync with master of official-stockfish"
echo
echo "Adding official Stockfish's public GitHub repository URL as a remote in my local git repository..."
git remote add official https://github.com/official-stockfish/Stockfish.git
git remote set-url official https://github.com/official-stockfish/Stockfish.git
echo
echo "Going to my local master branch..."
git checkout master
echo
echo "Downloading official Stockfish's branches and commits..."
git fetch official
echo
echo "Updating my local master branch with the new commits from official Stockfish's master..."
git reset --hard official/master
echo
echo "Pushing my local master branch to my online GitHub repository..."
git push origin master --force
echo
echo "Compiling new master..."
make clean
make build ARCH=x86-64-modern -j 8
make net
echo
echo "Done."
- In terminal, browse to the Stockfish folder
- Create a branch for your work
git checkout -b my_branch
- Edit the Stockfish source to make your changes
- Compile the source with
make build ARCH=x86-64
- Get the branch signature from
./stockfish bench
, which will look likeNodes searched : 4190940
- Commit your changes locally with
git commit -am "My commit message"
- Push your branch to Github:
git push origin my_branch
To test new nets, first, upload these nets to fishtest (account needed). By uploading you license your network under a CC0 license. The networks must follow the naming convention
nn-SHA.nnue
where SHA are the first 12 digits of the sha256sum
(shasum -a 256
) of the nn.nnue data file (sha256sum nn.nnue | cut -c1-12
).
On a Stockfish git branch, change the default value of EvalFileDefaultName
(see evaluate.h) to the name of this net and proceed as usual with STC and LTC SPRT tests of this branch. Do not add an EvalFile
test option, this is not supported on fishtest, i.e. only the default net will be visible to the engine. Nets that require different features or are of different size can be tested as well, obviously with the needed changes made to the sources in the testing branch.
Elo measurements will be done as part of the usual progression/regression tests, which will probably be more frequent.
Enjoy the power of fishtest for making progress!
- In terminal, browse to the Stockfish folder
- Create a branch for your work
git checkout -b my_tuning_branch
- Remove
const
qualifiers from the variables in the source code that you want to tune. - Flag the variables you want to tune with the
TUNE
macro. For example, if you have:
int myKing = 10, myQueen = 20;
Score myBonus = S(5, 15);
Value myValue[][2] = { { V(100), V(20) }, { V(7), V(78) } };
simply add the following line somewhere after it
TUNE(myKing, myBonus, myValue);
Type of the variables must be one of int
, Score
, or Value
. They can be arrays of arbitrary dimensions. Note that the variables will only be allowed to vary in the range[0, 2*v], where v is the initial value. See also point 7.
You can have multiple invocations of TUNE
in different places. For example, the code below is equivalent to the one above:
TUNE(myKing);
...
TUNE(myQueen, myValue);
Even more flexibility can be obtained with a custom range function. For example, use a range that is +- 20 for each variable, except those that are zero.
auto myfunc = [](int m){return m == 0 ? std::pair<int, int>(0, 0) : std::pair<int, int>(m - 20, m + 20);};
TUNE(SetRange(myfunc), QuadraticOurs);
- If you have a function that needs to be called after variables are updated, for example
void my_post_update() {}
simply add its name to theTUNE
arguments.
TUNE(myKing, myBonus, myValue, my_post_update);
You can add multiple functions and they will be called in the order you add them.
- By default, a variable
v
is tuned in the range0
and2 * v
, and only that range is allowed for the parameter. You can change that by adding a custom range as another argument toTUNE
as follows:
TUNE(SetRange(-100, 100), myKing, myQueen);
This will change the default range for all the variables. To customize it further, you can set another range for the remaining variables.
TUNE(SetRange(-100, 100), myKing, SetRange(-20, 20), myQueen);
Here myKing
is tuned in [-100, 100] while myQueen
is tuned in [-20, 20].
To return to range to default use SetDefaultRange
TUNE(SetRange(-100, 100), myKing, SetDefaultRange, myQueen);
so that the range for myQueen
is default.
Note: you can also change the range of each parameter manually as you input them to fishtest, as will be shown below. However, that range must be within the allowed range for the parameter (so reduced from what stockfish prints out).
- After you are done specifying what you want to tune and how, compile the source by running
make build ARCH=x86-64-modern
(runmake help
to learn more about compiling). - Run the following command
./stockfish
. You will notice a comma-separated list printed. Copy that list somewhere. - Get the branch signature from
./stockfish bench
, which will look likeNodes searched : 4190940
. Here, 4190940 is the signature. - Commit your changes locally by running the command
git commit -am "My commit message"
- Push your changes to Github with
git push origin my_tuning_branch
- Go to https://tests.stockfishchess.org/tests/run
- Fill in the "test branch" field with the name of your branch, e.g.
my_branch
. This MUST exactly match the actual name of the branch on github! (It will not be deduced from your test repo link.) - Fill in the "test signature" field with the bench output of your patch, e.g.
4190940
. If you added a line containing "Bench: [the bench of the patch]" to the commit message, you don't need to insert this information. - Do the same for "base branch" and "base signature". (The base branch is usually
master
, and the base signature defaults to the bench output for the official repository's current master branch.) - Fill in the "test repo" field with the link to the BASE github repo of your fork of Stockfish, with no trailing slash, e.g.
https://github.com/yourname/Stockfish
. This is NOT a link to the repo of the test branch itself. Your "test repo" will remain constant for all of your tests across all of your branches. - Fill in the notes describing your change
- Please follow the testing methodology below, unless you have a good reason to do differently.
- Click run test.
- If your STC test passes, you can move on to a LTC test (see below for parameters).
- If your LTC test passes, congratulations! Now, please create a pull request against the official-stockfish repository, so your changes can be code reviewed. Please remember, even if a patch passes STC and LTC, it is not guaranteed to be committed.
- Go to https://tests.stockfishchess.org/tests/run
- Nodestime allows SF to use a budgeted node count management, removing noise introduced by inconsistent hardware speed. If the value you're tuning is not susceptible to change significantly the nps, change
Test options
to readHash=128 nodestime=600
. For search parameters, normal TC tuning is usually preferred. - Choose the appropriate TC. A short TC is appropriate to get faster approximations or for checking the tune's correctness and parameters. A long TC (
60+0.6
) is best to get better scaling values. If using regular time control, set the appropriate TC and hash (Hash=16
for20+0.2
,Hash=64
for60+0.6
). If using nodestimes, set the time control to160+1.6
to get the equivalent of60+0.6
with normal TM. The nodestime games will usually finish long before the time is up. - Paste the list that you copied into the
SPSA Parameters
list. This is comma separated data forparameter, initial value, minimum, maximum, ck value, rk value
in that order. Here you can also make manual changes to min and max values for parameters. - A good tune run exhibits a significant change in the tuned values while minimizing random change. As a general rule, tuning many values at once (like a PSQT table) generates random change; while values which are only rarely used to score a position can have a hard time moving at all. Tweaking the ck value often helps to get a better result, with a higher ck forcing value change and a lower one reducing random change.
Note: if you use default range with initial value 0, the parameter will not be tuned since 2 * 0 and 0 / 2 are both 0 and empty intervals are not tuned.
- Make sure your test repo is correct (eg. https://github.com/yourname/Stockfish)
- Fill in the notes describing your change
- Click run test.
- If after a few thousands games the values are barely changing at all, the tuning run is useless and should be stopped. This usually happens when the ck value is too low for the parameters changes to influence the tuning games result more than random noise.
Note: Do not modify the number of games in an SPSA test while it's running. This breaks the algorithm.
Definitions:
- hippopotamus = a patch which adds at least 10 lines of code. Please avoid them if possible, it is often better to make a minimal version of your idea.
- parameters tweak = changing the value of some constants in the code. The generated machine code is the same complexity (aiming at the same number of processor instructions).
- simplification = a way to make the code and the algorithms clearer. Most of the time, a necessary condition is that the number of lines in the source code of Stockfish goes down, or the number of processor instructions in the generated code goes down -- but this is by no means a sufficient condition, because it is of course unfortunately possible to lower the number of lines in the code while obfuscating it.
- bug = a bug which has been discussed in the fishcooking forum or as an issue in github and confirmed as such by the maintainers. Potential bug fix solutions shall be first discussed in the forum, then tested in the framework.
- STC = Short time control (10+0.1)
- LTC = Long time control (60+0.6)
- TC = time control, should almost always be STC or LTC, but exceptionally (e.g. testing time management) can be 40/10 or 40/10+0.1 (40 moves in 10s or 10s + increment), or 10+0 (sudden death).
- SPRT(x,y) = SPRT test with elo0 = x and elo1 = y
The following recommendations to choose the right parameters can be best understood with the graphical SPRT calculator, which draws nice curves displaying pass-rate and average length of runs for various values of SPRT(x,y). When in doubt, stick with the standard tests parameters.
We use these for almost all our tests. It is our workhorse, designed to commit only robust patches that almost surely work. Our goal is to reduce at minimum possibility of regression and to avoid adding unnecessary complexity.
- STC: SPRT<-0.5, 2.5>
- LTC: SPRT<0.5, 3.5>
Pass-rate and expected cost
Note that Fishtest SPRT bounds are expressed in normalized Elo https://hardy.uhasselt.be/Fishtest/normalized_elo_practical.pdf. This ensures that resource consumption is independent of the draw ratio and the opening book. Currently all tests have a peak expected duration of 116K games.
STC assumptions
Bounds: [-0.5,2.5]; draw ratio: 0.82; bias: 30; TC: 10+0.1; duration: 68 moves; time consumed: 92%.
Table
Elo | %Pass | Kgames | CPUdays |
---|---|---|---|
-4.0 | 0.0% | 11.1 | 4.0 |
-3.0 | 0.0% | 14.4 | 5.2 |
-2.0 | 0.0% | 20.4 | 7.3 |
-1.0 | 0.1% | 34.7 | 12.4 |
0.0 | 12.3% | 89.3 | 31.9 |
0.5 | 59.9% | 114.8 | 41.1 |
1.0 | 94.1% | 74.2 | 26.5 |
1.5 | 99.4% | 44.8 | 16.0 |
2.0 | 99.9% | 31.0 | 11.1 |
3.0 | 100.0% | 19.0 | 6.8 |
4.0 | 100.0% | 13.7 | 4.9 |
LTC assumptions
Bounds: [0.5,3.5]; draw ratio: 0.924; bias: 23; TC: 60+0.6; duration: 68 moves; time consumed: 92%.
Table
Elo | %Pass | Kgames | CPUdays |
---|---|---|---|
-4.0 | 0.0% | 7.0 | 15.0 |
-3.0 | 0.0% | 9.0 | 19.3 |
-2.0 | 0.0% | 12.5 | 26.8 |
-1.0 | 0.0% | 20.7 | 44.3 |
0.0 | 1.9% | 56.9 | 122.2 |
0.5 | 43.6% | 115.6 | 248.2 |
1.0 | 96.8% | 63.9 | 137.1 |
1.5 | 99.9% | 32.8 | 70.4 |
2.0 | 100.0% | 21.6 | 46.5 |
3.0 | 100.0% | 12.9 | 27.6 |
4.0 | 100.0% | 9.2 | 19.6 |
Among parameter tweaks a special sub-case is the so called union patch or combo patch, that is a bundling of patches that failed SPRT but with positive or near positive score. Sometime retesting the union as a whole passes SPRT. Due to the nature of the approach and because of each individual patch failed already, a union has some constraints:
- Maximum 2 patches per union
- Each patch shall be trivial, like a parameter tweak. Patches that add/remove a concept/idea/feature shall pass individually.
These must be used for all functional changing simplifications, even one-liners, to test if the removal of the code is detrimental to Stockfish strength. We try to reject an ELO loss and even a neutral patch can fail -- nevertheless, because the code under test is simpler/smaller than original, we don't require the stricter standard mode. These tests are also used for bug fixes and other special cases, but only after being discussed in the fishcooking forum and approved in advance to avoid people testing with no-regression mode becoming their preferred toy, instead of using the stricter standard mode.
- STC: SPRT<-2.5, 0.5>
- LTC: SPRT<-2.5, 0.5>
If your patch is a non functional change, you might still need to run it through fishtest, but there are exceptions
Usually there is an open PR to collect small cleanups. They are trivial, don't change the bench, and are generally non-controversial. This might be typos, variable names, dead code, and similar.
Code refactoring and non-functional simplifications are a very wide family of patches and, by their nature, more subjective than other kind of patches. So also the acceptance guidelines rely more on maintainer knowledge, experience and sensibility.
Anything not fitting the above small cleanup category should typically be tested on fishtest, unless the code is not really exercised on fishtest (e.g. syzygy) and doesn't change bench. This approach verifies code correctness (no crashes), and makes sure there are no unexpected side effects or slowdowns.
- Regression tests will be SPRT<-2.5, 0.5> for STC and LTC or exceptionally other SPRT bounds according to maintainer judgement.
- Rejects are always possible if the patch is worse than the original
These kind of patches, although very important for long term code quality, are also the ones that can raise discussions because code style is subjective in large parts. So be prepared to accept a negative judgement by the maintainer: it is not easy for him, indeed it is a hard job for him to judge on these, so please do not take it personally or start an endless discussion in case your patch is rejected. Simply move on to your next Elo winning idea.
Test it on your machine by running stockfish bench
. It is recommended to run it several times and compute a mean and stdev to see if the improvement is statistically significant. For Linux/Windows you can use psybench, for Windows you can use the specialized tools FishBench and BuildTester, both excellent.
In case you have access to Linux, the most reliable way is to use the amazing perf tool:
$ sudo perf stat -r 5 -a -B -e cycles:u,instructions:u ./stockfish bench > /dev/null
This command will run bench 5 times, counting instructions and cycles and averaging. At the end, a report will be printed:
Performance counter stats for 'system wide' (5 runs):
22.747.981.856 cycles:u ( +- 0,05% )
28.409.592.052 instructions:u # 1,25 insn per cycle ( +- 0,06% )
4,331400608 seconds time elapsed ( +- 0,11% )
Note the very small error, in case you get a sensibly bigger one, please run the test twice. If you run Linux behind a VM, like VMWare, you have to enable performance counters in virtual machine settings.
If the code is a trivial change, send us a pull request. Speedups need to be further verified, for at least 2 reasons:
- The speedup needs to be statistically significant and not just random noise.
- The speedup needs to be confirmed on different machines. Sometimes a speedup on one machine is a slowdown on another.
To be considered for inclusion, the speedup should be around 0.5%. If the patch is more complex, then the patch will go under normal STC+LTC fishtest tests. This will require about 0.25% speedup at STC and about 0.7% speedup at LTC for a 50% passing chance. 1% speedups pass with 85% chance. The rationale is that a speed-up is totally comparable to a normal patch: it adds complexity with the aim to improve ELO, so it makes sense to test under the same conditions. Some data and discussion in the following issue: https://github.com/official-stockfish/Stockfish/issues/2593
Once you are ready, once the tests with your nice idea have passed and/or you have enough speed data to support your improvements, congratulations: you can now open a pull request against the master branch of official-stockfish :-)
Guidelines for a great pull request:
- Is my pull request up-to-date with current master? The first thing to do before opening a pull request is to synchronize your master branch with the official master branch (see the script at the top of this page to do that).
- Does my pull request consist of a single commit? Consider creating a new branch where you squash your changes together if your development branch has a long history, then open the pull request from this new branch.
- Is my code really really clean? Employ a coding style that is similar to surrounding Stockfish code, remove all spaces on empty lines and trailing spaces everywhere, etc.
- Can I improve the quality of my commit message? Very important for the community: your git commit message should have a high-level description of the patch, explaining the reasoning behind the patch and why it improves on the current code — please describe your algorithm on an example, provide chess data or console output if relevant, etc. Note that it is often informative if a discussion of the potential weaknesses is also included. Use the same comment for the pull request comment, by copy-pasting the commit message to the pull request message on GitHub.
- Do I provide easy ways to check my data? All patches that required testing (this applies to 99% of the patches for Stockfish) should also report the results obtained in fishtest at STC and LTC with links to the test pages. e.g.
STC:
LLR: 2.94 (-2.94,2.94) {-0.50,1.50}
Total: 6672 W: 1052 L: 898 D: 4722
Ptnml(0-2): 63, 600, 1868, 730, 75
https://tests.stockfishchess.org/tests/view/5f2d00b461e3b6af64881f21
LTC:
LLR: 2.96 (-2.94,2.94) {0.25,1.75}
Total: 7576 W: 573 L: 463 D: 6540
Ptnml(0-2): 8, 392, 2889, 480, 19
https://tests.stockfishchess.org/tests/view/5f2d052a61e3b6af64881f29
- How could we continue after the patch? Write a few words in the commit message to offer perspectives for future work. This is often a nice way to get momentum for continuing research on this subject in Stockfish, and maybe somebody else in the community will pick-up the challenge.
-
What will be the next signature of Stockfish if my patch is accepted? Both the commit message and the pull request comment on GitHub must mention if the patch is 'No functional change' or changes the search. The last line of the commit message should be either 'No functional change' or 'Bench: XXXXXXX' where 'Nodes searched : XXXXXXX' can be found in the output of a
./stockfish bench
invocation. - Is my patch portable? We have continuous integration testing, which will check for standard conformance and reproducibility of search using various compilers.
- Closes what pull request? Add a line containing the pull request number for easy checking of the original PR.
Here is a sample of a standard pull request:
STC:
LLR: 2.94 (-2.94,2.94) {-0.50,1.50}
Total: 6672 W: 1052 L: 898 D: 4722
Ptnml(0-2): 63, 600, 1868, 730, 75
https://tests.stockfishchess.org/tests/view/5f2d00b461e3b6af64881f21
LTC:
LLR: 2.96 (-2.94,2.94) {0.25,1.75}
Total: 7576 W: 573 L: 463 D: 6540
Ptnml(0-2): 8, 392, 2889, 480, 19
https://tests.stockfishchess.org/tests/view/5f2d052a61e3b6af64881f29
Closes https://github.com/official-stockfish/Stockfish/pull/2923
Bench: 4390086
Thanks for reading, have fun :-)
The section advanced options in the test creation page contain options which should be toggled on/off only by advanced users. Currently, the only advanced option is:
- Auto-purge: toggles auto-purge on and off. Having it off can be beneficial when testing for time-management patches or for patches affecting different OS in different manner.