Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional use of GPU to offload computation #806

Open
sjamgade opened this issue Mar 9, 2018 · 8 comments
Open

Optional use of GPU to offload computation #806

sjamgade opened this issue Mar 9, 2018 · 8 comments

Comments

@sjamgade
Copy link

sjamgade commented Mar 9, 2018

What:
I wanted to have some discussion around a possible way to have gnocchi
offload its processing to GPGPU.

Why:
This would increase performance drastically and also increase
the volume of data that could be processed in one go.

How:
The type of computation currently done by gnocchi are fairly simple
and straight forward, for exmaple the sum function in carborana library:

lets take an exmple:

Computation done in the benchmark functionof the AggregateTimeSerie Class:
(the example uses the pycuda library and was tested on nvidia Quadro K620)

#   def __init__(self)
        self.a = self._ts['values']
        self.a = self.a.copy(order='C').astype(numpy.float32)
        self.a_gpu = cuda.mem_alloc(self.a.size * self.a.dtype.itemsize)

#  number of points in the input data
#  SplitKey.POINTS_PER_SPLIT = 3600*100
    def sum(self):
        summod = SourceModule("""
            __global__ void addit(float *a, int *i) {
              int perthread=i[0];
              int counter = i[0]-1;
              for(;counter;counter--)
                  a[threadIdx.x*perthread] += a[threadIdx.x*perthread+counter];
            }""")

        cuda.memcpy_htod(self.a_gpu, self.a)
        func = summod.get_function("addit")
        func(self.a_gpu, cuda.In(numpy.array([6])), block=(32, 1, 1), grid=(1800, 1))
        a_doubled = numpy.empty_like(self.a)
        cuda.memcpy_dtoh(a_doubled, self.a_gpu)

        print "original array:"
        print self.a
        print "doubled with kernel:"
        print a_doubled

with a kernel like shown above, one could easily beat the benchmark by a factor of 10 atleast.
Here I had to limit the computation, as my GPU was not really designed to be GPGPU.

the parameters of grid and block are hardcoded in the example however they could be
easily calculated based on the SplitKey.POINTS_PER_SPLIT.
These values control the amount of parallel work to be done, and are a deterministic
function of (GPU-compute-capability, SplitKey.POINTS_PER_SPLIT, resampleFactor).

The values of 6 passed to the kernel call, is just ResampleFactor:
In the exmple
changing the granularity input data from 5s to 35
35/5 = 7
combine 7 values to produce 1 value.

Since all the values can be easily calculated before launching the computation
on GPU, this "algorithm" can be easily incorporated into library as a subclass.
And the problem of being dependent on the pycuda(cuda) in general:
there are libraries which help us abstract out the hardware and the propitiatory
blob as it is an already solved problem in the area machine learning and such.

Any thoughts ?

Which version of Gnocchi are you using

master git hash: 816fd83

@jd jd added the enhancement label Mar 9, 2018
@jd jd changed the title Feature: optional use of GPU to offload computation Optional use of GPU to offload computation Mar 9, 2018
@jd
Copy link
Member

jd commented Mar 9, 2018

I think it's a fantastic idea. Do you think there's any change we should/could make to Gnocchi to leverage even more that?

@cdent
Copy link
Contributor

cdent commented Mar 9, 2018

This is an awesome idea.

@sjamgade
Copy link
Author

Thanks for the support.

I think writting a cuda kernel can be pretty challenging, and in case of gnocchi the kernel
has to be all forgiving and well provisioned with default.

Since we were relying on numpy's intelligence in case of absence of data, to deal
with datatypes and to take care of memory management, and some other fetaures, these
will have to be considered while implementing the kernel for sum and quatile

I did this test on a very basic combination of input form the benchmark, the next task is
to make the kernel for sum function which can give considerable speedups for combination
of benchmark input.

@jd
Copy link
Member

jd commented Mar 14, 2018

@sjamgade That sounds great. Feel free to ping us if you need hint or help. Also, don't hesitate to send small and early patch to get feedback and so we can have an understanding of what's going on. :)

@jd jd added the performance label Mar 14, 2018
@sjamgade
Copy link
Author

sjamgade commented May 4, 2018

I have been working on this for some time already and have collected by experiments in the carbs repo.
Currently I am trying to explain my learning in blog post and wiki and will post an update here once the rest of the posts(wiki) are ready

I would really appreciate any kind of feedback

@jd
Copy link
Member

jd commented May 4, 2018

I've read your blog post and it's great, good job @sjamgade.

From what I've understood, you'd need larger span of points to be more efficient? I'm not sure how this can be achieved TBH. If you have suggestions, feel free to write about them.

Other than that, that looks like an interesting feature we'd be happy to merge when it's ready!

@chungg
Copy link
Member

chungg commented May 4, 2018

i'm curious, in your initial implementation, are there any significant changes to the workflow or data model required to make Gnocchi work with GPU? it'd be interesting to see if maybe there's another design we can take at Gnocchi.

@sjamgade
Copy link
Author

I have a vague idea of the kind of changes Gnocchi can take which could ease out the offloading. But I like to imagine that as a last option. I will keep working on other parts of the library in the repo and possibly post results with full scale testing.

It will promising to have it tested in a more production like deployment, but currently I am not aware of any probable opportunities.

In other news, I have published the part 2 and final part of my experiment which are also available on the wiki

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants