Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEDeconvolutionLayer f16 performance issue #1129

Open
alvoron opened this issue Jul 25, 2024 · 4 comments
Open

NEDeconvolutionLayer f16 performance issue #1129

alvoron opened this issue Jul 25, 2024 · 4 comments

Comments

@alvoron
Copy link

alvoron commented Jul 25, 2024

NEDeconvolutionLayer run() with f16 tensors takes more time than NEDeconvolutionLayer run() with f32 tensors.
On Ampere f32 version takes ~66 milliseconds, f16 version ~80 milliseconds.

ACL build command:

scons arch=armv8.6-a neon=1 os=linux opencl=0 build=native -j 64 Werror=false validation_tests=1 fixed_format_kernels=1 multi_isa=1 openmp=0 cppthreads=1

Reproducer build command

g++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include ~/avoron/acl_deconv.cpp -o bug -L./ComputeLibrary/build/ -larm_compute ./ComputeLibrary/build/tests/AssetsLibrary.o ./ComputeLibrary/build/tests/RawTensor.o ./ComputeLibrary/build/tests/framework/Exceptions.o -std=c++17

Reproducer run commands:

LD_LIBRARY_PATH=ComputeLibrary/build ./bug
LD_LIBRARY_PATH=ComputeLibrary/build ./bug 1

The 1st command uses f32 tensors, the 2nd one - f16 tensors.

Reproducer:

#include "arm_compute/core/Error.h"
#include "arm_compute/core/TensorShape.h"
#include "arm_compute/runtime/Tensor.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "tests/Utils.h"
#include "tests/NEON/Accessor.h"
#include "tests/AssetsLibrary.h"

#include <iostream>
#include <vector>
#include <chrono>

using namespace arm_compute;
using namespace arm_compute::test;


int main(int argc, char *argv[]) {

    PadStrideInfo deconv_info = PadStrideInfo(3, 3, 0, 0, 0, 0, DimensionRoundingType::FLOOR);

    //f32 if no argument passed; f16 if any argument passed
    DataType dt = (argc == 1) ? DataType::F32 : DataType::F16;

    TensorInfo srcTensorInfo = TensorInfo(TensorShape(36, 640, 360, 1), 1, dt, DataLayout::NHWC);
    TensorInfo weiTensorInfo = TensorInfo(TensorShape(36, 3, 3, 4), 1, dt, DataLayout::NHWC);
    TensorInfo dstTensorInfo = TensorInfo(TensorShape(4, 1920, 1080, 1), 1, dt, DataLayout::NHWC);

    auto status = NEDeconvolutionLayer::validate(&srcTensorInfo, &weiTensorInfo, nullptr, &dstTensorInfo, deconv_info);
    if(status.error_code() != ErrorCode::OK) {
      std::cout << "ERROR: " << status.error_description().c_str() << std::endl;
      exit(1);
    }
    std::cout << "PASSED VALIDATION" << std::endl;

    Tensor srcTensor;
    Tensor weiTensor;
    Tensor dstTensor;

    srcTensor.allocator()->init(srcTensorInfo);
    weiTensor.allocator()->init(weiTensorInfo);
    dstTensor.allocator()->init(dstTensorInfo);
  
    NEDeconvolutionLayer deconv;
    deconv.configure(&srcTensor, &weiTensor, nullptr, &dstTensor, deconv_info);
    std::cout << "PASSED CONFIGURATION" << std::endl;

    srcTensor.allocator()->allocate();
    weiTensor.allocator()->allocate();
    dstTensor.allocator()->allocate();

    AssetsLibrary library(".", std::random_device()());
    std::uniform_real_distribution<> distribution{ 0.0f, 100.0f };
    library.fill(Accessor(srcTensor), distribution, 0);
    library.fill(Accessor(weiTensor), distribution, 0);

    //warm-up
    deconv.run();

    std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < 100; i++) deconv.run();
    std::chrono::high_resolution_clock::time_point finish = std::chrono::high_resolution_clock::now();
    std::cout << "PASSED RUN: " << std::chrono::duration_cast<std::chrono::microseconds>(finish - start).count() / 100 << std::endl;

    srcTensor.allocator()->free();
    weiTensor.allocator()->free();
    dstTensor.allocator()->free();

    return 0;
}
@morgolock
Copy link

Hi @alvoron

Thanks. I can reproduce the problem. FP32 performance for this specific configuration is better than FP16. It will require further investigation.

@morgolock
Copy link

Hi @alvoron

The following patch solves the problem.

Make sure that in your test you enable fast_math when calling NEDeconvolutionLayer::configure()

See below the following change in your test

NEDeconvolutionLayer deconv;
deconv.configure(&srcTensor, &weiTensor, nullptr, &dstTensor, deconv_info, /* enable fast match */ true);
std::cout << "PASSED CONFIGURATION" << std::endl;
[user@test_deconv]$ LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./test 1
F16
PASSED VALIDATION
PASSED CONFIGURATION
PASSED RUN: 151639
[user@test_deconv]$ LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./test 
F32
PASSED VALIDATION
PASSED CONFIGURATION
PASSED RUN: 221537

Hope this helps.

@alvoron
Copy link
Author

alvoron commented Sep 10, 2024

@morgolock thank you for the patch, it works for me as well.
Although, my diff between f32 and f16 is not so high as yours - I have 65-67 ms on f32 and 60-62 ms on f16.
What machine was used to get results you shared above?

@morgolock
Copy link

Hi @alvoron

I ran this on Neoverse N1.

I built the library with cons -j32 Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 validation_tests=1 os=linux arch=armv8a build=native multi_isa=1 fixed_format_kernels=1 openmp=1 cppthreads=0 asserts=0 logging=0 -j8

Make sure you use openmp=1 cppthreads=0

Hope this helps

@morgolock morgolock added this to the v24.09 milestone Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants