Skip to content

Commit

Permalink
Develop stream 2024-09-12 (#404)
Browse files Browse the repository at this point in the history
* Restore accidentally removed entries in .clang-format

There were duplicate BeforeElse and BeforeCatch but two merges removed them twice making formatting weird.

* Add inplace overloads of DeviceScan functions, extend tests

* Add inplace DeviceSelect::If and Flagged, extend tests

* Add DeviceReduce::TransformReduce

* Add DeviceSelect::UniqueByKey with equality_op, support large indices

* Extract single_index_iterator from scan tests

* Add test with large indices for UniqueByKey

* Fix compilation error on nvcc and CCCL 2.4.0 (std::ostream << float is not defined).

<iosfwd> does not declare operator<<(float).
It's in <ostream>.

* Added foreach to hipcub

* Add CubVector to hipcub for cuda parity

* Add hibcub vector tests

* Fixed vector test on cuda

* Added CubVector to CHANGELOG

* implement unified get_sizes()

* Added large sizes test for device_radix_sort

* Added extra test to device_segmented_radix_sort

* Add unsigned long long int and size_t for constructor half and bfloat

* Cleanup diagnostic handling for suppressing depcrecated warnings

* Adding hipgraph tests

* Changes for review

* ci: set up sccache

* Clang format fix

* Add doxyspinx to requirements to fix docs error

---------

Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: Jaap Blok <jaap@streamhpc.com>
Co-authored-by: Robin Voetter <robin@streamhpc.com>
  • Loading branch information
4 people authored Oct 28, 2024
1 parent 84d9c1c commit 07e6201
Show file tree
Hide file tree
Showing 47 changed files with 7,139 additions and 3,163 deletions.
2 changes: 2 additions & 0 deletions .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ BraceWrapping:
AfterStruct: true
AfterUnion: true
AfterExternBlock: false
BeforeCatch: true
BeforeElse: true
BeforeLambdaBody: true
BeforeWhile: true
IndentBraces: false
Expand Down
17 changes: 17 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ include:
- /deps-format.yaml
- /deps-rocm.yaml
- /deps-nvcc.yaml
- /deps-compiler-acceleration.yaml
- /gpus-rocm.yaml
- /gpus-nvcc.yaml
- /rules.yaml
Expand Down Expand Up @@ -66,9 +67,11 @@ copyright-date:
extends:
- .deps:rocm
- .deps:cmake-minimum
- .deps:compiler-acceleration
before_script:
- !reference [".deps:rocm", before_script]
- !reference [".deps:cmake-minimum", before_script]
- !reference [".deps:compiler-acceleration", before_script]
# Install rocPRIM from git
- BRANCH_NAME="$ROCPRIM_GIT_BRANCH"
- if [ "$CI_COMMIT_BRANCH" = develop -o "$CI_COMMIT_BRANCH" = master ]; then BRANCH_NAME="$CI_COMMIT_BRANCH"
Expand All @@ -83,6 +86,8 @@ copyright-date:
-D BUILD_EXAMPLE=OFF
-D ROCM_DEP_ROCMCORE=OFF
-D GPU_TARGETS="$GPU_TARGETS"
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-B $CI_PROJECT_DIR/rocPRIM/build
-S $CI_PROJECT_DIR/rocPRIM
- cd $CI_PROJECT_DIR/rocPRIM/build
Expand All @@ -109,6 +114,8 @@ build:rocm:
-D GPU_TARGETS="$GPU_TARGETS"
-D GPU_TEST_TARGETS="$GPU_TARGETS"
-D ROCM_SYMLINK_LIBS=OFF
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-B $CI_PROJECT_DIR/build
-S $CI_PROJECT_DIR
- cmake --build $CI_PROJECT_DIR/build
Expand Down Expand Up @@ -144,6 +151,8 @@ build:rocm-benchmark:
-D CMAKE_BUILD_TYPE=Release
-D BUILD_BENCHMARK=ON
-D GPU_TARGETS="$GPU_TARGETS"
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-B $CI_PROJECT_DIR/build
-S $CI_PROJECT_DIR
- cmake --build $CI_PROJECT_DIR/build
Expand Down Expand Up @@ -283,9 +292,11 @@ test:rocm_install:
- .deps:nvcc
- .gpus:nvcc-gpus
- .deps:cmake-minimum
- .deps:compiler-acceleration
before_script:
- !reference [".deps:nvcc", before_script]
- !reference [".deps:cmake-minimum", before_script]
- !reference [".deps:compiler-acceleration", before_script]

build:nvcc:
stage: build
Expand All @@ -304,6 +315,9 @@ build:nvcc:
-D BUILD_EXAMPLE=ON
-D NVGPU_TARGETS="$GPU_TARGETS"
-D ROCM_SYMLINK_LIBS=OFF
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CUDA_COMPILER_LAUNCHER=phc_sccache_cuda
-B $CI_PROJECT_DIR/build
-S $CI_PROJECT_DIR
- cmake --build $CI_PROJECT_DIR/build
Expand Down Expand Up @@ -337,6 +351,9 @@ build:nvcc-benchmark:
-D CMAKE_BUILD_TYPE=Release
-D BUILD_BENCHMARK=ON
-D NVGPU_TARGETS="$GPU_TARGETS"
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CUDA_COMPILER_LAUNCHER=phc_sccache_cuda
-B $CI_PROJECT_DIR/build
-S $CI_PROJECT_DIR
- cmake --build $CI_PROJECT_DIR/build
Expand Down
14 changes: 13 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,12 @@
Documentation for hipCUB is available at
[https://rocm.docs.amd.com/projects/hipCUB/en/latest/](https://rocm.docs.amd.com/projects/hipCUB/en/latest/).

## (Unreleased) hipCUB-x.x.x for ROCm 6.4.0

### Added
* Added `ForEach`, `ForEachN`, `ForEachCopy`, `ForEachCopyN` and `Bulk` functions to have parity with CUB.
* Added the `hipcub::CubVector` type for CUB parity.

## (Unreleased) hipCUB-3.3.0 for ROCm 6.3.0

### Fixed
Expand All @@ -12,8 +18,14 @@ Documentation for hipCUB is available at
### Added
* Add support for large indices in `hipcub::DeviceSegmentedReduce::*`. rocPRIM's backend provides support for all reduce variants, but CUB's does not have support yet for `DeviceSegmentedReduce::Arg*`, so large indices support has been excluded for these as well in hipCUB.
* Add -t smoke option in rtest.py. It will run a subset of tests such that the total test time is in 5 minutes. Use python3 ./rtest.py --test smoke or python3 ./rtest.py -t smoke to execute smoke test.
* Add inplace overloads of `DeviceScan` functions.
* Add inplace overloads of `DeviceSelect::Flagged` and `DeviceSelect::If`.
* Add `DeviceReduce::TransformReduce`.
* Add `DeviceSelect::UniqueByKey` overload with `equality_op`.
* Add support for large indices in `DeviceSelect::UniqueByKey`.

### Changed
* The NVIDIA backend now requires CUB, Thrust and libcu++ 2.3.2. If it is not found it will be downloaded from the NVIDIA CCCL repository.
* The NVIDIA backend now requires CUB, Thrust and libcu++ 2.4.0. If it is not found it will be downloaded from the NVIDIA CCCL repository.

## (Unreleased) hipCUB-3.2.0 for ROCm 6.2.0

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ python3 -m http.server
* Requires CMake 3.16.9 or later
* For NVIDIA GPUs:
* CUDA Toolkit
* CCCL library (>= 2.3.2)
* CCCL library (>= 2.4.0)
* Automatically downloaded and built by the CMake script
* Requires CMake 3.15.0 or later
* Python 3.6 or higher (for HIP on Windows only; this is only required for install scripts)
Expand Down
1 change: 1 addition & 0 deletions benchmark/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ add_hipcub_benchmark(benchmark_block_shuffle.cpp)
add_hipcub_benchmark(benchmark_device_adjacent_difference.cpp)
add_hipcub_benchmark(benchmark_device_batch_copy.cpp)
add_hipcub_benchmark(benchmark_device_batch_memcpy.cpp)
add_hipcub_benchmark(benchmark_device_for.cpp)
add_hipcub_benchmark(benchmark_device_histogram.cpp)
add_hipcub_benchmark(benchmark_device_memory.cpp)
add_hipcub_benchmark(benchmark_device_merge_sort.cpp)
Expand Down
160 changes: 160 additions & 0 deletions benchmark/benchmark_device_for.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
// MIT License
//
// Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.

// CUB's implementation of single_pass_scan_operators has maybe uninitialized parameters,
// disable the warning because all warnings are threated as errors:

#include "common_benchmark_header.hpp"

// HIP API
#include "hipcub/device/device_for.hpp"

#ifndef DEFAULT_N
const size_t DEFAULT_N = 1024 * 1024 * 32;
#endif

const unsigned int batch_size = 10;
const unsigned int warmup_size = 5;

template<class T>
struct op_t
{
unsigned int* d_count;

HIPCUB_DEVICE
void operator()(T i)
{
// The data is non zero so atomic will never be activated.
if(i == 0)
{
atomicAdd(d_count, 1);
}
}
};

template<class Value>
void run_benchmark(benchmark::State& state, hipStream_t stream, size_t size)
{
using T = Value;

// Generate data
std::vector<T> values_input(size, 4);

T* d_input;
HIP_CHECK(hipMalloc(&d_input, size * sizeof(T)));
HIP_CHECK(hipMemcpy(d_input, values_input.data(), size * sizeof(T), hipMemcpyHostToDevice));

unsigned int* d_count;
HIP_CHECK(hipMalloc(&d_count, sizeof(T)));
HIP_CHECK(hipMemset(d_count, 0, sizeof(T)));
op_t<T> device_op{d_count};

// Warm-up
for(size_t i = 0; i < warmup_size; i++)
{
HIP_CHECK(hipcub::ForEach(d_input, d_input + size, device_op, stream));
}
HIP_CHECK(hipDeviceSynchronize());

for(auto _ : state)
{
auto start = std::chrono::high_resolution_clock::now();

for(size_t i = 0; i < batch_size; i++)
{
HIP_CHECK(hipcub::ForEach(d_input, d_input + size, device_op, stream));
}
HIP_CHECK(hipStreamSynchronize(stream));

auto end = std::chrono::high_resolution_clock::now();
auto elapsed_seconds
= std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
state.SetIterationTime(elapsed_seconds.count());
}
state.SetBytesProcessed(state.iterations() * batch_size * size * sizeof(T));
state.SetItemsProcessed(state.iterations() * batch_size * size);

HIP_CHECK(hipFree(d_count));
HIP_CHECK(hipFree(d_input));
}

#define CREATE_BENCHMARK(Value) \
benchmark::RegisterBenchmark(("for_each<Datatype:" #Value ">"), \
&run_benchmark<Value>, \
stream, \
size)

int main(int argc, char* argv[])
{
cli::Parser parser(argc, argv);
parser.set_optional<size_t>("size", "size", DEFAULT_N, "number of values");
parser.set_optional<int>("trials", "trials", -1, "number of iterations");
parser.run_and_exit_if_error();

// Parse argv
benchmark::Initialize(&argc, argv);
const size_t size = parser.get<size_t>("size");
const int trials = parser.get<int>("trials");

std::cout << "benchmark_device_reduce_by_key" << std::endl;

// HIP
hipStream_t stream = 0; // default
hipDeviceProp_t devProp;
int device_id = 0;
HIP_CHECK(hipGetDevice(&device_id));
HIP_CHECK(hipGetDeviceProperties(&devProp, device_id));
std::cout << "[HIP] Device name: " << devProp.name << std::endl;

using custom_double2 = benchmark_utils::custom_type<double, double>;

// Add benchmarks
std::vector<benchmark::internal::Benchmark*> benchmarks = {
CREATE_BENCHMARK(float),
CREATE_BENCHMARK(double),
CREATE_BENCHMARK(custom_double2),
CREATE_BENCHMARK(int8_t),
CREATE_BENCHMARK(float),
CREATE_BENCHMARK(double),
CREATE_BENCHMARK(long long),
};

// Use manual timing
for(auto& b : benchmarks)
{
b->UseManualTime();
b->Unit(benchmark::kMillisecond);
}

// Force number of iterations
if(trials > 0)
{
for(auto& b : benchmarks)
{
b->Iterations(trials);
}
}

// Run benchmarks
benchmark::RunSpecifiedBenchmarks();
return 0;
}
2 changes: 1 addition & 1 deletion cmake/Dependencies.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ endif(USER_BUILD_BENCHMARK)

# CUB (only for CUDA platform)
if(HIP_COMPILER STREQUAL "nvcc")
set(CCCL_MINIMUM_VERSION 2.3.2)
set(CCCL_MINIMUM_VERSION 2.4.0)
if(NOT DOWNLOAD_CUB)
find_package(CUB ${CCCL_MINIMUM_VERSION} CONFIG)
find_package(Thrust ${CCCL_MINIMUM_VERSION} CONFIG)
Expand Down
2 changes: 2 additions & 0 deletions docs/sphinx/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ docutils==0.21.2
# myst-parser
# pydata-sphinx-theme
# sphinx
doxysphinx==3.3.8
# via rocm-docs-core
fastjsonschema==2.19.1
# via rocm-docs-core
gitdb==4.0.11
Expand Down
Loading

0 comments on commit 07e6201

Please sign in to comment.