v0.7.0
🚀 Streaming v0.7.0
Streaming v0.7.0
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.7.0
📈 Better Defaults for StreamingDataset
(#479)
- The default values for
StreamingDataset
have been updated to be more performant and are applicable for most use cases, detailed below:
Parameter | Old Value | New Value | Benefit |
---|---|---|---|
shuffle_algo |
py1s |
py1e |
Better shuffle and balanced downloading |
num_canonical_nodes |
64 * physical nodes |
if py1s or py2s , 64 * physical_nodes , otherwise physical_nodes |
Consistently good shuffle for all shuffle algos |
shuffle_block_size |
262,144 |
4,000,000 / num_canonical_nodes |
Consistently good shuffle for all num_canonical_nodes values |
predownload |
max(batch_size, 256 * batch_size // num_canonical_nodes) |
8 * batch_size |
Better balanced downloading |
partition_algo |
orig |
relaxed |
More flexible deterministic resumptions on nodes |
💎 New Features
🤖 Streaming Simulator: Easily simulate the performance of training configurations. (#385)
- After installing this version of streaming, simply run the command
simulator
in your terminal to open the simulation interface. - Simulate throughput, network downloads, shuffle quality, and cache limit requirements for configurations.
- Easily de-risk runs and find performant parameter settings.
- Check out the docs for more information!
🔢 More flexible deterministic training and resumption (#476)
- Deterministic training and resumptions are now possible on more numbers of nodes!
- Previously, the
num_canonical_nodes
parameter had to divide or be a multiple of the number of physical nodes for determinism. - Now, deterministic training is possible on any number of nodes that also evenly divides your run's global batch size.
🐛 Bug Fixes
- Check for invalid hash algorithm names (#486)
What's Changed
- Bump fastapi from 0.103.2 to 0.104.0 by @dependabot in #480
- Bump gitpython from 3.1.37 to 3.1.40 by @dependabot in #481
- Bump sphinx-tabs from 3.4.1 to 3.4.4 by @dependabot in #482
- do not remove local directory when out is local by @XiaohanZhangCMU in #477
- Update init.py by @XiaohanZhangCMU in #484
- Check for invalid hash algorithm name by @karan6181 in #486
- Relaxing divisibility constraints on num_canonical_nodes and num_physical_nodes by @snarayan21 in #476
- Better default values for StreamingDataset args by @snarayan21 in #479
- Update release yaml to not write anything to GitHub by @karan6181 in #487
- Bump pypandoc from 1.11 to 1.12 by @dependabot in #490
- Bump pytest from 7.4.2 to 7.4.3 by @dependabot in #491
- Bumping version for streaming v0.7.0 by @snarayan21 in #495
Full Changelog: v0.6.1...v0.7.0