-
Notifications
You must be signed in to change notification settings - Fork 774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is 1D convolution on CPU via NEConvolutionLayer so slow? #1119
Comments
Hi @poltomo The first iteration is costly because ACL performs various transformations to the input and the weights so that then the computation can be done faster. I'd suggest you try two things:
Hope this helps |
Hi, @morgolock 1. I tried the warmup call, but it is still 10x slower than my implementation. I got 0.017 seconds.2. I think I am using NHWC. Could you confirm that my tensor initializations are actually NHWC?Tensor conv_input;
Tensor conv_weight;
Tensor conv_bias;
Tensor conv_output;
const int N = 1;
const int Hi = 1;
const int Wi = 1<<20;
const int Ci = 1;
const int Hf = 1;
const int Wf = 3;
const int Ho = Hi - Hf + 1;
const int Wo = Wi - Wf + 1;
const int Co = 1;
cout << "f_n = " << Wi << "\ng_n = " << Wf << "\nh_n = " << Wo << "\n";
conv_input.allocator()->init(TensorInfo(TensorShape(Ci, Wi, Hi), 1, DataType::F32, DataLayout::NHWC));
conv_weight.allocator()->init(TensorInfo(TensorShape(Hf, Wf, Ci), 1, DataType::F32, DataLayout::NHWC));
// conv_bias.allocator()->init(TensorInfo(TensorShape(Co), 1, DataType::F32));
conv_output.allocator()->init(TensorInfo(TensorShape(Co, Wo, Ho), 1, DataType::F32, DataLayout::NHWC)); Ci is input channels, Wi is input width. Wf is filter width and so on. |
Hi, @morgolock I have 7 questions (see numbers) 1. High Level Question: How do I get the exact inference time of ARM Compute Library's convolution implementations minus and runtime/scheduler overhead?I found the implementation I want to benchmark here. How do I benchmark this alone? What is the window reference argument I built ARM compute library just for neon support. Here's how I built the library. 2. Please tell me if there are any flags that I am missing out on. I want to be fair to this library
no openmp, opencl or cppthreads. 3. Please let me know if the build configuration is not being fair to ARM compute library.
Here's my benchmark for ARM Compute LibraryImportant Questions: 4. Is
|
Hi @poltomo Thanks, we'll have a look at the performance for this specific configuration.
There is no easy way to do this, I would suggest to have a look at our benchmark graph examples.
You'll get the best performance out of ACL if you build with
Yes, a lot is happening under the hood in NEConvolutionLayer. From the algorithm point of view, depending on the workload configuration (shapes, data_types, layouts, etc) various transformations are used to prepare the data in memory in an optional way to achieve maximum performance in the computation. This is the reason why the first iteration is costly and slower than the next ones. The options you use to build ACL will also affect the performance, enabling one of the schedulers (
ACL has been desgined to run efficently the most common workloads present in major models like the ones you can see in our graph examples. Could you please let us know what the use case is for this specific shape and configuration you are running? Is this from a concrete ML model? I assume when you say optimal you mean from the performance perspective? What is the actual device and version of Android you are using to run your test? Hope this helps |
Hi, I found out how to target direct convolutions directly without the runtime: #include "src/cpu/kernels/directconv2d/nhwc/neon/fp32.cpp"
...
Window win = calculate_max_window(*conv_output.info(), Steps());
arm_compute::cpu::kernels::neon_fp32_nhwc_directconv2d(win, &conv_input, &conv_weight, &conv_output, PadStrideInfo(1,1,0,0)); Thankfully, this works in the 1d case. Its about 20 to 30x slower than my 1d convolution implementation. Its slow for openmp builds, and soleley neon builds. I guess its alright since 2d is in the name of the kernel. I'd be happy to just add the op to the library. How do I do that? I think this library will have to start fragmenting convolution implementations. There's just too much performance potential at stake. It can be done without making things messy too. NEConvolutionLayer already chooses an implementation for you, so why not explicitly implement popular convs like 3x3 stride 1, winograd 3x3 and so on? |
Hi @poltomo Please see our contribution guide for more information on how to add a new operator
ConvLayer has different convolution methods and there is an heuristic in place which selects the best method based on the workload configuration (shapes, types, layout, etc). You would need to add your 1d kernel and make the necessary changes so that Hope this helps |
Benchmark details: 1d convolution of a 2^16 wide 1D input signal with a length 3 kernel. Both input and output channels are 1. There is no bias term.
Here's my benchmark:
benchmark_acl.cpp
output
The 0 means that the first enum element GEMM is being used.
convolution of 1, 2, 3, ... with 1,2,3 is 14,20,26,32,38,..., so the correct answer is being computed.
Why is it so slow?
For reference I made my own 1D direct convolution implementation and achieved
time 0.00166391
This was without openmp multithreading, just plain implementation with compiler optimizations.
What could be the reason for this?
Also, here is ARM Compute Library's Direct Conv performance:
output
time 0.0609249
my device info
I compiled with against latest android cpu release shared lib
The text was updated successfully, but these errors were encountered: