Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CI/CD for unit tests #41

Merged
merged 107 commits into from
Feb 16, 2024
Merged
Show file tree
Hide file tree
Changes from 93 commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
1c79951
add CI/CD for unit tests
xrsrke Jan 19, 2024
04491d3
fix
xrsrke Jan 19, 2024
fdd5d1e
fix syntax
xrsrke Jan 19, 2024
91208dd
fix
xrsrke Jan 19, 2024
8da087d
fix
xrsrke Jan 19, 2024
00875c0
update actions/checkout
xrsrke Jan 19, 2024
cca7e56
new runner label
glegendre01 Jan 19, 2024
338c042
fix typo
glegendre01 Jan 19, 2024
0c6433c
add workflow dispatch
glegendre01 Jan 19, 2024
6de2472
remove path filter for triggering
glegendre01 Jan 19, 2024
79b22d8
test ci
xrsrke Jan 23, 2024
c73623b
update python version
xrsrke Jan 23, 2024
5efc135
add code quality
xrsrke Jan 23, 2024
4fb80a4
refactor
xrsrke Jan 23, 2024
ceb21c2
only check src
xrsrke Jan 23, 2024
05aa557
fix
xrsrke Jan 23, 2024
0010cfa
use docker image
xrsrke Jan 23, 2024
dba1eed
fix
xrsrke Jan 23, 2024
b2af5d0
use python 10
xrsrke Jan 23, 2024
8914de7
change docker image
xrsrke Jan 24, 2024
368beba
fix pip install
xrsrke Jan 24, 2024
565e081
add fa2-related tests
xrsrke Jan 24, 2024
7b38326
fix
xrsrke Jan 24, 2024
906477b
update FA2 version
xrsrke Jan 24, 2024
4491ce7
add on push
xrsrke Jan 24, 2024
5b22ede
update FA2 to flash-attn>=2.5.0
xrsrke Jan 24, 2024
5f3ce67
Merge branch 'main' of github.com:huggingface/nanotron into xrsrke/se…
xrsrke Jan 29, 2024
9a03a04
add searching for free ports in unit tests
xrsrke Jan 29, 2024
1cf4da2
remove searching port
xrsrke Jan 29, 2024
f6d9847
move searching ports to distributed
xrsrke Jan 29, 2024
f675daf
Update 3d_parallelism_unit_tests.yaml
xrsrke Jan 29, 2024
0908b74
Update 3d_parallelism_unit_tests.yaml
xrsrke Jan 29, 2024
df7cb9d
Update distributed.py
xrsrke Jan 29, 2024
839677a
Update 3d_parallelism_unit_tests.yaml
xrsrke Jan 29, 2024
b631186
Update 3d_parallelism_unit_tests.yaml
xrsrke Jan 30, 2024
128eea5
Update distributed.py
xrsrke Jan 30, 2024
f96808a
Refactor test_clip_grads_with_tp parameters
NouamaneTazi Jan 31, 2024
d123d1b
Skip test cases for ALL_REDUCE mode with async communication
NouamaneTazi Jan 31, 2024
b899564
Update init_method to use env://localhost:port
NouamaneTazi Jan 31, 2024
ff32ddb
tests run for all PRs
NouamaneTazi Jan 31, 2024
abe42c6
Update branch filter in GitHub workflows
NouamaneTazi Jan 31, 2024
0a754a1
skip ALL_REDUCE with async comm
NouamaneTazi Jan 31, 2024
5d822bb
make sure total_norm in clip grad is a scalar
NouamaneTazi Jan 31, 2024
e5e2045
Merge branch 'main' of github.com:huggingface/nanotron into xrsrke/se…
xrsrke Jan 31, 2024
5d9652a
refactor
xrsrke Jan 31, 2024
063020a
zeros([]
NouamaneTazi Feb 1, 2024
741966b
Merge pull request #52 from huggingface/nouamane/fix_ci
NouamaneTazi Feb 1, 2024
e2ed85f
exclude sanity_checks.py from CoL
xrsrke Feb 1, 2024
91234fa
exclude sanity_checks.py from CoL
xrsrke Feb 1, 2024
a57cb9b
Merge branch 'main' of github.com:huggingface/nanotron into xrsrke/se…
xrsrke Feb 10, 2024
8a98cfc
fix expectation
xrsrke Feb 10, 2024
29672db
remove empty context manager in tp tests
xrsrke Feb 10, 2024
0a34e65
add reruning a tests if a port is in used
xrsrke Feb 10, 2024
e3c3d11
fix checking total_norm should be a scalar
xrsrke Feb 10, 2024
63ca0d2
fix
xrsrke Feb 10, 2024
44c0e05
add more retrying
xrsrke Feb 10, 2024
b8eeb1e
fix clip grads
xrsrke Feb 10, 2024
b553c4e
remove testing dim in clip grads
xrsrke Feb 10, 2024
0b97c38
fuk
xrsrke Feb 10, 2024
8c7355e
refactor
xrsrke Feb 10, 2024
2a4e735
run tests in parallel
xrsrke Feb 10, 2024
d47555e
not run fa2
xrsrke Feb 10, 2024
3b70271
only run 5 tests in parallel
xrsrke Feb 10, 2024
30b8004
only run a test at a time
xrsrke Feb 10, 2024
51a804c
add forking RNG
xrsrke Feb 10, 2024
cec0c04
fix circular import
xrsrke Feb 10, 2024
f42a43e
fix rng
xrsrke Feb 10, 2024
5b375f5
remove parallel tests
xrsrke Feb 10, 2024
081b17d
add python random seed
xrsrke Feb 11, 2024
4dce881
remove dist test, and add destroying process group after running a test
xrsrke Feb 11, 2024
00bb0bf
fix
xrsrke Feb 11, 2024
957826e
edit
xrsrke Feb 11, 2024
dc65581
fix
xrsrke Feb 11, 2024
0fe7bdd
fix
xrsrke Feb 11, 2024
de52fc6
removing destroy pg
xrsrke Feb 11, 2024
f2afea3
add destroying parallel_context in unit tests
xrsrke Feb 11, 2024
97ebff4
ignore layer norm
xrsrke Feb 11, 2024
6a5fd81
wtf is going on
xrsrke Feb 11, 2024
9c7e1a7
add small run
xrsrke Feb 13, 2024
b2c71b0
run small with dist test
xrsrke Feb 13, 2024
0d21bba
debug missing destroy
xrsrke Feb 13, 2024
6bb69ff
fuck
xrsrke Feb 13, 2024
b39c831
f
xrsrke Feb 13, 2024
3bd346d
.
NouamaneTazi Feb 13, 2024
dd0079e
.
NouamaneTazi Feb 13, 2024
91cf7e3
try timeout-minutes and --rm
NouamaneTazi Feb 13, 2024
7e0fcce
try -v
NouamaneTazi Feb 13, 2024
6dcb73d
try
NouamaneTazi Feb 13, 2024
b64f04f
bring back parallel_context.destroy()
NouamaneTazi Feb 13, 2024
2d44ec7
add 3d tests
xrsrke Feb 14, 2024
5d03579
add all cicd
xrsrke Feb 14, 2024
ab09576
run parallel tests
xrsrke Feb 14, 2024
77e0764
only run 1 test
xrsrke Feb 14, 2024
f43687f
add directly spawning processes
xrsrke Feb 15, 2024
004e7f4
refactor spawn function as init_distributed
xrsrke Feb 15, 2024
558b341
please work
xrsrke Feb 15, 2024
98046f8
catch overlaping port from find_free_port
xrsrke Feb 15, 2024
d96c7fa
clean up
xrsrke Feb 15, 2024
f56f8a7
fix circular import
xrsrke Feb 15, 2024
a48b7bf
skip fp8 tests in FA2
xrsrke Feb 15, 2024
033aca9
update code quality
xrsrke Feb 15, 2024
d4c27e7
fix
xrsrke Feb 15, 2024
39e5846
fix
xrsrke Feb 15, 2024
6f7e4b2
remove uncessary files
xrsrke Feb 15, 2024
cd51bd9
fix search free poorts
xrsrke Feb 15, 2024
6c30d2c
set ParallelContext in wrapper
xrsrke Feb 16, 2024
c705f4d
remove uncessary comments
xrsrke Feb 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions .github/workflows/3d_parallelism_unit_tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
name: Run non-FA2-related unit tests

on:
push:
branches: [ main ]
# Only run tests if we modify the following files
paths:
- "src/**/*.py"
- "examples/**/*.py"
- "tests/**/*.py"

pull_request:
branches: [ '**' ]
paths:
- "src/**/*.py"
- "examples/**/*.py"
- "tests/**/*.py"

jobs:
tests:
runs-on: [multi-gpu, nvidia-gpu, 8-t4, ci]
container:
image: runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04
ports:
- 80
options: --gpus all --shm-size "8G"
steps:
- uses: actions/checkout@v3
- name: Python environment
run: |
which python
python --version

- name: Check Pytorch version
run: |
nvidia-smi
python -c "import torch; print('torch:', torch.__version__, torch)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

- name: Instal nanotron
run: |
python -m pip install --upgrade pip
pip install packaging
pip install wheel
pip install -e .
pip install -e .[dev]
pip install -e .[test]

- name: Show installed libraries and their versions
run: pip freeze | tee installed.txt

- name: Run tests
# NOTE: -m "not fa2" will run all the unit tests that don't have the mark
# "fa2" (these are FA2-related tests, we can't run it on T4)
run: |
pytest \
-n 1 \
-m "not fa2" \
--color=yes \
--durations=0 \
--ignore tests/kernels \
--verbose \
tests/
26 changes: 26 additions & 0 deletions .github/workflows/code_quality.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: Code Quality

on:
workflow_dispatch:
push:
branches: [ main ]
# Only run tests if we modify the following files
paths:
- "src/**/*.py"

pull_request:
branches: [ '**' ]
paths:
- "src/**/*.py"

jobs:
cloc:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Count Lines of Code (cloc)
uses: djdefi/cloc-action@6
with:
options: --exclude-dir=docs,tests,examples --exclude-lang=YAML --exclude-list-file=sanity_checks.py
xrsrke marked this conversation as resolved.
Show resolved Hide resolved
58 changes: 58 additions & 0 deletions .github/workflows/fa2_unit_tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
name: Run FA2-related unit tests

on:
workflow_dispatch:
push:
branches: [ main ]
# Only run tests if we modify the following files
paths:
- "src/**/*.py"
- "examples/**/*.py"
- "tests/**/*.py"

pull_request:
branches: [ '**' ]
paths:
- "src/**/*.py"
- "examples/**/*.py"
- "tests/**/*.py"

jobs:
tests:
runs-on: [single-gpu, nvidia-gpu, a10, ci]
container:
image: runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04
ports:
- 80
options: --gpus all --shm-size "8G"
steps:
- uses: actions/checkout@v3

- name: Python environment
run: |
which python
python --version

- name: Check Pytorch version
run: |
nvidia-smi
python -c "import torch; print('torch:', torch.__version__, torch)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

- name: Instal nanotron
run: |
python -m pip install --upgrade pip
pip install packaging
pip install wheel
pip install "flash-attn>=2.5.0" --no-build-isolation
pip install -e .
pip install -e .[dev]
pip install -e .[test]

- name: Show installed libraries and their versions
run: pip freeze | tee installed.txt

- name: Run tests
# NOTE: -m fa2 will only run the unit tests that have the mark
# "fa2" (these are FA2-related tests)
run: pytest -m fa2 --color=yes --durations=0 --verbose tests/
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,5 @@ cython_debug/
#.idea/

.vscode
.github

checkpoints/
8 changes: 7 additions & 1 deletion src/nanotron/distributed.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
from torch.distributed import * # noqa
from torch.distributed.distributed_c10d import ProcessGroup

from nanotron.utils import find_free_port

torch_version_above_1_13 = version.parse(torch.__version__) >= version.parse("1.13.0")
Work = dist.Work if torch_version_above_1_13 else dist._Work
default_pg_timeout = datetime.timedelta(minutes=10)
Expand Down Expand Up @@ -257,5 +259,9 @@ def initialize_torch_distributed():
backend = "gloo"

# Call the init process.
dist.init_process_group(backend=backend, world_size=world_size, rank=rank, timeout=dist.default_pg_timeout)
port = find_free_port()
init_method = f"env://localhost:{port}"
dist.init_process_group(
init_method=init_method, backend=backend, world_size=world_size, rank=rank, timeout=dist.default_pg_timeout
)
return True
4 changes: 2 additions & 2 deletions src/nanotron/optim/clip_grads.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def clip_grad_norm(
torch.stack([torch.linalg.vector_norm(g.detach(), ord=torch.inf, dtype=torch.float) for g in grads])
)
else:
total_norm = torch.zeros(1, dtype=torch.float, device=torch.device("cuda"))
total_norm = torch.zeros([], dtype=torch.float, device=torch.device("cuda"))
dist.all_reduce(total_norm, group=mp_pg, op=dist.ReduceOp.MAX)

else:
Expand All @@ -68,7 +68,7 @@ def clip_grad_norm(
dtype=torch.float,
).pow(norm_type)
else:
total_norm = torch.zeros(1, dtype=torch.float, device=torch.device("cuda"))
total_norm = torch.zeros([], dtype=torch.float, device=torch.device("cuda"))
dist.all_reduce(total_norm, group=mp_pg, op=dist.ReduceOp.SUM)
total_norm.pow_(1.0 / norm_type)

Expand Down
9 changes: 8 additions & 1 deletion src/nanotron/parallel/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def __init__(
)

if not dist.is_available():
raise ValueError("`torch.distributed is not available as a package, please install it.")
raise ValueError("torch.distributed is not available as a package, please install it.")

self.tensor_parallel_size = tensor_parallel_size
self.pipeline_parallel_size = pipeline_parallel_size
Expand Down Expand Up @@ -148,3 +148,10 @@ def get_3d_ranks(self, world_rank: int) -> Tuple[int, int, int]:
dp_rank = (world_rank // self.tp_pg.size()) % self.dp_pg.size()
tp_rank = world_rank % self.tp_pg.size()
return (pp_rank, dp_rank, tp_rank)

def destroy(self):
if not dist.is_initialized():
return

dist.barrier()
dist.destroy_process_group()
14 changes: 14 additions & 0 deletions src/nanotron/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
import os
from contextlib import ExitStack, contextmanager
from typing import Callable, ContextManager, List, Optional
import random
import socket

import torch
from packaging import version
Expand Down Expand Up @@ -147,3 +149,15 @@ def tensor_from_untyped_storage(untyped_storage: torch.UntypedStorage, dtype: to
tensor = torch.empty([], dtype=dtype, device=device)
tensor.set_(source=untyped_storage)
return tensor


def find_free_port(min_port: int = 2000, max_port: int = 65000) -> int:
while True:
port = random.randint(min_port, max_port)
try:
with socket.socket() as sock:
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(("localhost", port))
return port
except OSError as e:
raise e
130 changes: 126 additions & 4 deletions tests/helpers/utils.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
import contextlib
import os
import random
import re
import time
import uuid
from typing import Any, Dict, List, Optional, Tuple
from inspect import signature
from typing import Any, Callable, Dict, List, Optional, Tuple

import torch.cuda
from nanotron.parallel import ParallelContext
from packaging import version
from torch.distributed.launcher import elastic_launch


Expand Down Expand Up @@ -72,10 +77,10 @@ def __init__(self, func, args, kwargs, tp: int, dp: int, pp: int):

def __call__(self):
with mock_os_environ(update_key_values={"WORLD_SIZE": f"{self.tp * self.dp * self.pp}"}):
# NOTE: we use a different random seed, so that each unit tests don't generate the same port
random.seed(time.time())
parallel_context = ParallelContext(
data_parallel_size=self.dp,
pipeline_parallel_size=self.pp,
tensor_parallel_size=self.tp,
data_parallel_size=self.dp, pipeline_parallel_size=self.pp, tensor_parallel_size=self.tp
)

assert "parallel_context" not in self.kwargs
Expand Down Expand Up @@ -185,3 +190,120 @@ def get_all_3d_configurations(gpus: int) -> List[Tuple[int, int, int]]:
if tp * dp * pp == gpus:
result.append((pp, dp, tp))
return result


def rerun_if_address_is_in_use(max_try: int = 500):
"""
This function reruns a wrapped function if "address already in use" occurs
in testing spawned with torch.multiprocessing

Credits: https://github.com/hpcaitech/ColossalAI/blob/adae123df3badfb15d044bd416f0cf29f250bc86/colossalai/testing/utils.py#L157

Usage::

@rerun_if_address_is_in_use()
def test_something():
...

"""
# check version
torch_version = version.parse(torch.__version__)
assert torch_version.major >= 1

# only torch >= 1.8 has ProcessRaisedException
if torch_version >= version.parse("1.8.0"):
exception = torch.multiprocessing.ProcessRaisedException
else:
exception = Exception

func_wrapper = rerun_on_exception(exception_type=exception, pattern=".*Address already in use.*", max_try=max_try)
return func_wrapper


def rerun_on_exception(exception_type: Exception = Exception, pattern: str = None, max_try: int = 10) -> Callable:
"""
A decorator on a function to re-run when an exception occurs.

Credits: https://github.com/hpcaitech/ColossalAI/blob/adae123df3badfb15d044bd416f0cf29f250bc86/colossalai/testing/utils.py#L71

Usage::

# rerun for all kinds of exception
@rerun_on_exception()
def test_method():
print('hey')
raise RuntimeError('Address already in use')

# rerun for RuntimeError only
@rerun_on_exception(exception_type=RuntimeError)
def test_method():
print('hey')
raise RuntimeError('Address already in use')

# rerun for maximum 10 times if Runtime error occurs
@rerun_on_exception(exception_type=RuntimeError, max_try=10)
def test_method():
print('hey')
raise RuntimeError('Address already in use')

# rerun for infinite times if Runtime error occurs
@rerun_on_exception(exception_type=RuntimeError, max_try=None)
def test_method():
print('hey')
raise RuntimeError('Address already in use')

# rerun only the exception message is matched with pattern
# for infinite times if Runtime error occurs
@rerun_on_exception(exception_type=RuntimeError, pattern="^Address.*$")
def test_method():
print('hey')
raise RuntimeError('Address already in use')

Args:
exception_type (Exception, Optional): The type of exception to detect for rerun
pattern (str, Optional): The pattern to match the exception message.
If the pattern is not None and matches the exception message,
the exception will be detected for rerun
max_try (int, Optional): Maximum reruns for this function. The default value is 5.
If max_try is None, it will rerun forever if exception keeps occurring
"""

def _match_lines(lines, pattern):
for line in lines:
if re.match(pattern, line):
return True
return False

def _wrapper(func):
def _run_until_success(*args, **kwargs):
try_count = 0
assert max_try is None or isinstance(
max_try, int
), f"Expected max_try to be None or int, but got {type(max_try)}"

while max_try is None or try_count < max_try:
try:
try_count += 1
ret = func(*args, **kwargs)
return ret
except exception_type as e:
error_lines = str(e).split("\n")
if try_count < max_try and (pattern is None or _match_lines(error_lines, pattern)):

print("Exception is caught, retrying...")
# when pattern is not specified, we always skip the exception
# when pattern is specified, we only skip when pattern is matched
continue
else:
print("Maximum number of attempts is reached or pattern is not matched, no more retrying...")
raise e

# Override signature
# otherwise pytest.mark.parameterize will raise the following error:
# function does not use argument xxx
sig = signature(func)
_run_until_success.__signature__ = sig

return _run_until_success

return _wrapper
1 change: 1 addition & 0 deletions tests/kernels/test_layer_norm.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@


# @pytest.mark.skipif(available_gpus() < 1, reason="Testing test_fused_layer_norm requires at least 1 gpus")
@pytest.mark.fa2
@pytest.mark.parametrize(
"hidden_size",
[1024, 1025], # fused layer norm supports 1024 as hidden size but not 1025
Expand Down
Loading
Loading