Optional use of GPU to offload computation #806

sjamgade · 2018-03-09T11:40:17Z

What:
I wanted to have some discussion around a possible way to have gnocchi
offload its processing to GPGPU.

Why:
This would increase performance drastically and also increase
the volume of data that could be processed in one go.

How:
The type of computation currently done by gnocchi are fairly simple
and straight forward, for exmaple the sum function in carborana library:

lets take an exmple:

Computation done in the benchmark functionof the AggregateTimeSerie Class:
(the example uses the pycuda library and was tested on nvidia Quadro K620)

#   def __init__(self)
        self.a = self._ts['values']
        self.a = self.a.copy(order='C').astype(numpy.float32)
        self.a_gpu = cuda.mem_alloc(self.a.size * self.a.dtype.itemsize)

#  number of points in the input data
#  SplitKey.POINTS_PER_SPLIT = 3600*100
    def sum(self):
        summod = SourceModule("""
            __global__ void addit(float *a, int *i) {
              int perthread=i[0];
              int counter = i[0]-1;
              for(;counter;counter--)
                  a[threadIdx.x*perthread] += a[threadIdx.x*perthread+counter];
            }""")

        cuda.memcpy_htod(self.a_gpu, self.a)
        func = summod.get_function("addit")
        func(self.a_gpu, cuda.In(numpy.array([6])), block=(32, 1, 1), grid=(1800, 1))
        a_doubled = numpy.empty_like(self.a)
        cuda.memcpy_dtoh(a_doubled, self.a_gpu)

        print "original array:"
        print self.a
        print "doubled with kernel:"
        print a_doubled

with a kernel like shown above, one could easily beat the benchmark by a factor of 10 atleast.
Here I had to limit the computation, as my GPU was not really designed to be GPGPU.

the parameters of grid and block are hardcoded in the example however they could be
easily calculated based on the SplitKey.POINTS_PER_SPLIT.
These values control the amount of parallel work to be done, and are a deterministic
function of (GPU-compute-capability, SplitKey.POINTS_PER_SPLIT, resampleFactor).

The values of 6 passed to the kernel call, is just ResampleFactor:
In the exmple
changing the granularity input data from 5s to 35
35/5 = 7
combine 7 values to produce 1 value.

Since all the values can be easily calculated before launching the computation
on GPU, this "algorithm" can be easily incorporated into library as a subclass.
And the problem of being dependent on the pycuda(cuda) in general:
there are libraries which help us abstract out the hardware and the propitiatory
blob as it is an already solved problem in the area machine learning and such.

Any thoughts ?

Which version of Gnocchi are you using

master git hash: 816fd83

The text was updated successfully, but these errors were encountered:

jd · 2018-03-09T13:24:16Z

I think it's a fantastic idea. Do you think there's any change we should/could make to Gnocchi to leverage even more that?

cdent · 2018-03-09T14:42:55Z

This is an awesome idea.

sjamgade · 2018-03-14T12:23:00Z

Thanks for the support.

I think writting a cuda kernel can be pretty challenging, and in case of gnocchi the kernel
has to be all forgiving and well provisioned with default.

Since we were relying on numpy's intelligence in case of absence of data, to deal
with datatypes and to take care of memory management, and some other fetaures, these
will have to be considered while implementing the kernel for sum and quatile

I did this test on a very basic combination of input form the benchmark, the next task is
to make the kernel for sum function which can give considerable speedups for combination
of benchmark input.

jd · 2018-03-14T12:24:37Z

@sjamgade That sounds great. Feel free to ping us if you need hint or help. Also, don't hesitate to send small and early patch to get feedback and so we can have an understanding of what's going on. :)

sjamgade · 2018-05-04T09:02:16Z

I have been working on this for some time already and have collected by experiments in the carbs repo.
Currently I am trying to explain my learning in blog post and wiki and will post an update here once the rest of the posts(wiki) are ready

I would really appreciate any kind of feedback

jd · 2018-05-04T09:41:56Z

I've read your blog post and it's great, good job @sjamgade.

From what I've understood, you'd need larger span of points to be more efficient? I'm not sure how this can be achieved TBH. If you have suggestions, feel free to write about them.

Other than that, that looks like an interesting feature we'd be happy to merge when it's ready!

chungg · 2018-05-04T13:13:38Z

i'm curious, in your initial implementation, are there any significant changes to the workflow or data model required to make Gnocchi work with GPU? it'd be interesting to see if maybe there's another design we can take at Gnocchi.

sjamgade · 2018-05-11T10:38:49Z

I have a vague idea of the kind of changes Gnocchi can take which could ease out the offloading. But I like to imagine that as a last option. I will keep working on other parts of the library in the repo and possibly post results with full scale testing.

It will promising to have it tested in a more production like deployment, but currently I am not aware of any probable opportunities.

In other news, I have published the part 2 and final part of my experiment which are also available on the wiki

jd added the enhancement label Mar 9, 2018

jd changed the title ~~Feature: optional use of GPU to offload computation~~ Optional use of GPU to offload computation Mar 9, 2018

jd added the performance label Mar 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional use of GPU to offload computation #806

Optional use of GPU to offload computation #806

sjamgade commented Mar 9, 2018

jd commented Mar 9, 2018

cdent commented Mar 9, 2018

sjamgade commented Mar 14, 2018

jd commented Mar 14, 2018

sjamgade commented May 4, 2018

jd commented May 4, 2018

chungg commented May 4, 2018

sjamgade commented May 11, 2018

Optional use of GPU to offload computation #806

Optional use of GPU to offload computation #806

Comments

sjamgade commented Mar 9, 2018

Which version of Gnocchi are you using

jd commented Mar 9, 2018

cdent commented Mar 9, 2018

sjamgade commented Mar 14, 2018

jd commented Mar 14, 2018

sjamgade commented May 4, 2018

jd commented May 4, 2018

chungg commented May 4, 2018

sjamgade commented May 11, 2018