Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

making Multinomial sampling slightly faster #786

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

amartya18x
Copy link
Contributor

This makes multinomial sampling slightly faster by generating a sample sized array of uniformly sampled random numbers outside the kernel and passing it to the kernel. The different distributions in the kernel can use the same array of uniformly sampled numbers as they are independant distributions.

I have addded a file called test/multinomial.lua to do some benchmarking. I can remove it later if the PR is accepted.

@amartya18x
Copy link
Contributor Author

The NEW block shows the time for the current code and the old block shows it for the previous code.

@amartya18x
Copy link
Contributor Author

@pavanky Could you have a look at this too ?

Copy link
Contributor

@pavanky pavanky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of this is the function uses more memory. I think there are ways to make it faster without generating the uniform array outside of the kernel.

n_sample,
THCudaLongTensor_data(state, self),
numDist, numCategories,
THCTensor_(data)(state, prefixSum));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the lines are not aligned.

@amartya18x
Copy link
Contributor Author

@pavanky But I think generating just an array (the size of n_sample instead of num_dist * n_sample) is a good improvement. Could it be a good update for now ? Maybe it can be improved again later?

@soumith
Copy link
Member

soumith commented Jul 3, 2017

@amartya18x what speedups are you seeing? Can you give the output of the benchmark and specify what GPU you ran them on?

@wickedfoo
Copy link
Contributor

Yeah, I'm not convinced.

Not only does it require allocation and deallocation (which is bad if one is not running with the caching allocator), it's replacing in-kernel generation with extra global memory reads and writes, which together with the memory allocator stuff may be on par with what was there before. Do you have kernel timings?

@@ -292,7 +291,7 @@ sampleMultinomialWithoutReplacement(curandStateMtgp32* state,

// All threads must participate in this
T r = ScalarConvert<float, T>::to(curand_uniform(&state[blockIdx.x]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plus, this wasn't converted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't do it for this. I just tried it out for the other one. If the idea convinced everyone, I could do it for the other one too.

print("")
print("Benchmarking multinomial with "..curr_n_class.." classes and "..curr_n_sample.." samples")
torch.seed()
local probs = torch.CudaDoubleTensor(n_dist, curr_n_class):uniform(0,1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that benchmarking using float64 is useful, since almost all work done with Torch is in float32. Furthermore, this will have very skewed results on different GPUs due to lack of float64 ALUs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest using torch.FloatTensor ?

a:reset()
for i = 1,10 do
torch.multinomial(probs, curr_n_sample, true)
cutorch.synchronize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you synchronizing every time through for the benchmark? one should only synchronize at the beginning and the end.

@amartya18x
Copy link
Contributor Author

@soumith I had the benchmarking here in a comment showing the speedups. I think the rebasing messed it up. Let me check

@pavanky
Copy link
Contributor

pavanky commented Jul 3, 2017

@wickedfoo the problem was the previous kernel was generating more random numbers than needed. This PR fixes that issue by creating the required number of random numbers outside.

That said, this isn't the most optimal fix. We should be generating the necessary random numbers within the kernel.

@wickedfoo
Copy link
Contributor

Is generating more random numbers than needed really that much of an issue? If we're talking microseconds here, this isn't that big of a deal in my opinion.

The RNG being used requires the entire block to update the state at once, although you are right that only one value per warp is being used.

The alternative would be to see if warp divergence for the binary search part isn't too bad, in which case every thread would use the value that it generated locally.

@pavanky
Copy link
Contributor

pavanky commented Jul 3, 2017

@wickedfoo Another alternative would be to perform parallel linear search instead of a binary search on a single thread.

Either way the improvements here would be minimal. The more substantial improvements we are seeing come from here: #784

@pavanky
Copy link
Contributor

pavanky commented Jul 3, 2017

The RNG being used requires the entire block to update the state at once, although you are right that only one value per warp is being used.

AFAIK this is only a limitation of MTGP32 generator. This is also a bit slower than other generators in CURAND. Any reason to stick with MTGP32 ?

@pavanky
Copy link
Contributor

pavanky commented Jul 3, 2017

Nvm my previous comment, I didnt realize cutorch only supports this generator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants