-
Notifications
You must be signed in to change notification settings - Fork 23
/
TODO
17 lines (10 loc) · 1.23 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
- build a version that does most of the batch building on the GPU (using scikits.cuda / pycuda).
* pycuda.gpuarray.GPUArray does not seem to do advanced indexing. So how will that work? Theano does, but mixing Theano and pycuda might not be the best idea?
* alternatively implement batch matrix inversion using scikits.cuda as a Theano op, and do everything in Theano.
* problem with that: can't really do in-place updates of the factors on the GPU. So that would mean they have to live on the CPU. So that would probably still be slow...
* need to implement a pycuda kernel to do the indexing or something. Should not be too hard, it's just a fancy copy operation.
Example of a 'harder' kernel: http://wiki.tiker.net/PyCuda/Examples/MatrixmulTiled
- build a version that uses multiprocessing WITHIN batch generation (instead of generating multiple batches simultaneously).
- profile code with solve_gpu to identify the slowest parts of the batch building code
- The algorithm could be adapted to work with CSR representations stored on disk in e.g. HDF5 format (for very large matrices). Try this?
How to do a partial matrix update in PyCUDA: http://stackoverflow.com/questions/5708319/pycuda-gpuarray-slice-based-operations