NEDeconvolutionLayer f16 performance issue #1129

alvoron · 2024-07-25T15:05:55Z

NEDeconvolutionLayer run() with f16 tensors takes more time than NEDeconvolutionLayer run() with f32 tensors.
On Ampere f32 version takes ~66 milliseconds, f16 version ~80 milliseconds.

ACL build command:

scons arch=armv8.6-a neon=1 os=linux opencl=0 build=native -j 64 Werror=false validation_tests=1 fixed_format_kernels=1 multi_isa=1 openmp=0 cppthreads=1

Reproducer build command

g++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include ~/avoron/acl_deconv.cpp -o bug -L./ComputeLibrary/build/ -larm_compute ./ComputeLibrary/build/tests/AssetsLibrary.o ./ComputeLibrary/build/tests/RawTensor.o ./ComputeLibrary/build/tests/framework/Exceptions.o -std=c++17

Reproducer run commands:

LD_LIBRARY_PATH=ComputeLibrary/build ./bug
LD_LIBRARY_PATH=ComputeLibrary/build ./bug 1

The 1st command uses f32 tensors, the 2nd one - f16 tensors.

Reproducer:

#include "arm_compute/core/Error.h"
#include "arm_compute/core/TensorShape.h"
#include "arm_compute/runtime/Tensor.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "tests/Utils.h"
#include "tests/NEON/Accessor.h"
#include "tests/AssetsLibrary.h"

#include <iostream>
#include <vector>
#include <chrono>

using namespace arm_compute;
using namespace arm_compute::test;


int main(int argc, char *argv[]) {

    PadStrideInfo deconv_info = PadStrideInfo(3, 3, 0, 0, 0, 0, DimensionRoundingType::FLOOR);

    //f32 if no argument passed; f16 if any argument passed
    DataType dt = (argc == 1) ? DataType::F32 : DataType::F16;

    TensorInfo srcTensorInfo = TensorInfo(TensorShape(36, 640, 360, 1), 1, dt, DataLayout::NHWC);
    TensorInfo weiTensorInfo = TensorInfo(TensorShape(36, 3, 3, 4), 1, dt, DataLayout::NHWC);
    TensorInfo dstTensorInfo = TensorInfo(TensorShape(4, 1920, 1080, 1), 1, dt, DataLayout::NHWC);

    auto status = NEDeconvolutionLayer::validate(&srcTensorInfo, &weiTensorInfo, nullptr, &dstTensorInfo, deconv_info);
    if(status.error_code() != ErrorCode::OK) {
      std::cout << "ERROR: " << status.error_description().c_str() << std::endl;
      exit(1);
    }
    std::cout << "PASSED VALIDATION" << std::endl;

    Tensor srcTensor;
    Tensor weiTensor;
    Tensor dstTensor;

    srcTensor.allocator()->init(srcTensorInfo);
    weiTensor.allocator()->init(weiTensorInfo);
    dstTensor.allocator()->init(dstTensorInfo);
  
    NEDeconvolutionLayer deconv;
    deconv.configure(&srcTensor, &weiTensor, nullptr, &dstTensor, deconv_info);
    std::cout << "PASSED CONFIGURATION" << std::endl;

    srcTensor.allocator()->allocate();
    weiTensor.allocator()->allocate();
    dstTensor.allocator()->allocate();

    AssetsLibrary library(".", std::random_device()());
    std::uniform_real_distribution<> distribution{ 0.0f, 100.0f };
    library.fill(Accessor(srcTensor), distribution, 0);
    library.fill(Accessor(weiTensor), distribution, 0);

    //warm-up
    deconv.run();

    std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < 100; i++) deconv.run();
    std::chrono::high_resolution_clock::time_point finish = std::chrono::high_resolution_clock::now();
    std::cout << "PASSED RUN: " << std::chrono::duration_cast<std::chrono::microseconds>(finish - start).count() / 100 << std::endl;

    srcTensor.allocator()->free();
    weiTensor.allocator()->free();
    dstTensor.allocator()->free();

    return 0;
}

The text was updated successfully, but these errors were encountered:

morgolock · 2024-08-13T11:06:30Z

Hi @alvoron

Thanks. I can reproduce the problem. FP32 performance for this specific configuration is better than FP16. It will require further investigation.

morgolock · 2024-08-27T14:34:45Z

Hi @alvoron

The following patch solves the problem.

Make sure that in your test you enable fast_math when calling NEDeconvolutionLayer::configure()

See below the following change in your test

NEDeconvolutionLayer deconv;
deconv.configure(&srcTensor, &weiTensor, nullptr, &dstTensor, deconv_info, /* enable fast match */ true);
std::cout << "PASSED CONFIGURATION" << std::endl;

[user@test_deconv]$ LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./test 1
F16
PASSED VALIDATION
PASSED CONFIGURATION
PASSED RUN: 151639
[user@test_deconv]$ LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./test 
F32
PASSED VALIDATION
PASSED CONFIGURATION
PASSED RUN: 221537

Hope this helps.

alvoron · 2024-09-10T09:59:04Z

@morgolock thank you for the patch, it works for me as well.
Although, my diff between f32 and f16 is not so high as yours - I have 65-67 ms on f32 and 60-62 ms on f16.
What machine was used to get results you shared above?

morgolock · 2024-09-10T10:23:33Z

Hi @alvoron

I ran this on Neoverse N1.

I built the library with cons -j32 Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 validation_tests=1 os=linux arch=armv8a build=native multi_isa=1 fixed_format_kernels=1 openmp=1 cppthreads=0 asserts=0 logging=0 -j8

Make sure you use openmp=1 cppthreads=0

Hope this helps

morgolock added Help wanted Question Performance labels Aug 4, 2024

morgolock added this to the v24.09 milestone Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEDeconvolutionLayer f16 performance issue #1129

NEDeconvolutionLayer f16 performance issue #1129

alvoron commented Jul 25, 2024

morgolock commented Aug 13, 2024

morgolock commented Aug 27, 2024

alvoron commented Sep 10, 2024

morgolock commented Sep 10, 2024

NEDeconvolutionLayer f16 performance issue #1129

NEDeconvolutionLayer f16 performance issue #1129

Comments

alvoron commented Jul 25, 2024

morgolock commented Aug 13, 2024

morgolock commented Aug 27, 2024

alvoron commented Sep 10, 2024

morgolock commented Sep 10, 2024