making Multinomial sampling slightly faster #786

amartya18x · 2017-06-28T16:53:54Z

This makes multinomial sampling slightly faster by generating a sample sized array of uniformly sampled random numbers outside the kernel and passing it to the kernel. The different distributions in the kernel can use the same array of uniformly sampled numbers as they are independant distributions.

I have addded a file called test/multinomial.lua to do some benchmarking. I can remove it later if the PR is accepted.

amartya18x · 2017-06-28T16:54:49Z

The NEW block shows the time for the current code and the old block shows it for the previous code.

amartya18x · 2017-06-29T15:02:27Z

@pavanky Could you have a look at this too ?

pavanky

The downside of this is the function uses more memory. I think there are ways to make it faster without generating the uniform array outside of the kernel.

pavanky · 2017-06-29T15:13:16Z

lib/THC/generic/THCTensorRandom.cu

+	   n_sample,
+	   THCudaLongTensor_data(state, self),
+	   numDist, numCategories,
+	   THCTensor_(data)(state, prefixSum));


All the lines are not aligned.

amartya18x · 2017-06-29T15:27:30Z

@pavanky But I think generating just an array (the size of n_sample instead of num_dist * n_sample) is a good improvement. Could it be a good update for now ? Maybe it can be improved again later?

soumith · 2017-07-03T04:42:05Z

@amartya18x what speedups are you seeing? Can you give the output of the benchmark and specify what GPU you ran them on?

wickedfoo · 2017-07-03T18:57:49Z

Yeah, I'm not convinced.

Not only does it require allocation and deallocation (which is bad if one is not running with the caching allocator), it's replacing in-kernel generation with extra global memory reads and writes, which together with the memory allocator stuff may be on par with what was there before. Do you have kernel timings?

wickedfoo · 2017-07-03T18:58:19Z

lib/THC/THCTensorRandom.cuh

@@ -292,7 +291,7 @@ sampleMultinomialWithoutReplacement(curandStateMtgp32* state,

    // All threads must participate in this
    T r = ScalarConvert<float, T>::to(curand_uniform(&state[blockIdx.x]));


plus, this wasn't converted

Yeah, I didn't do it for this. I just tried it out for the other one. If the idea convinced everyone, I could do it for the other one too.

wickedfoo · 2017-07-03T18:59:47Z

test/multinomial.lua

+	 print("")
+	 print("Benchmarking multinomial with "..curr_n_class.." classes and "..curr_n_sample.." samples")
+	 torch.seed()
+	 local probs = torch.CudaDoubleTensor(n_dist, curr_n_class):uniform(0,1)


I'm not sure that benchmarking using float64 is useful, since almost all work done with Torch is in float32. Furthermore, this will have very skewed results on different GPUs due to lack of float64 ALUs.

Do you suggest using torch.FloatTensor ?

wickedfoo · 2017-07-03T19:00:18Z

test/multinomial.lua

+	 a:reset()
+	 for i = 1,10 do
+	    torch.multinomial(probs, curr_n_sample, true)
+	    cutorch.synchronize()


why are you synchronizing every time through for the benchmark? one should only synchronize at the beginning and the end.

amartya18x · 2017-07-03T19:04:46Z

@soumith I had the benchmarking here in a comment showing the speedups. I think the rebasing messed it up. Let me check

pavanky · 2017-07-03T19:12:42Z

@wickedfoo the problem was the previous kernel was generating more random numbers than needed. This PR fixes that issue by creating the required number of random numbers outside.

That said, this isn't the most optimal fix. We should be generating the necessary random numbers within the kernel.

wickedfoo · 2017-07-03T19:47:18Z

Is generating more random numbers than needed really that much of an issue? If we're talking microseconds here, this isn't that big of a deal in my opinion.

The RNG being used requires the entire block to update the state at once, although you are right that only one value per warp is being used.

The alternative would be to see if warp divergence for the binary search part isn't too bad, in which case every thread would use the value that it generated locally.

pavanky · 2017-07-03T19:54:59Z

@wickedfoo Another alternative would be to perform parallel linear search instead of a binary search on a single thread.

Either way the improvements here would be minimal. The more substantial improvements we are seeing come from here: #784

pavanky · 2017-07-03T20:08:32Z

The RNG being used requires the entire block to update the state at once, although you are right that only one value per warp is being used.

AFAIK this is only a limitation of MTGP32 generator. This is also a bit slower than other generators in CURAND. Any reason to stick with MTGP32 ?

pavanky · 2017-07-03T20:10:27Z

Nvm my previous comment, I didnt realize cutorch only supports this generator.

pavanky reviewed Jun 29, 2017

View reviewed changes

Faster Multinomial Sampling code

6946f0e

amartya18x force-pushed the MultiFast branch from 3dd47e8 to 6946f0e Compare June 29, 2017 15:25

wickedfoo reviewed Jul 3, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

making Multinomial sampling slightly faster #786

making Multinomial sampling slightly faster #786

amartya18x commented Jun 28, 2017

amartya18x commented Jun 28, 2017

amartya18x commented Jun 29, 2017

pavanky left a comment

pavanky Jun 29, 2017

amartya18x commented Jun 29, 2017

soumith commented Jul 3, 2017

wickedfoo commented Jul 3, 2017

wickedfoo Jul 3, 2017

amartya18x Jul 3, 2017

wickedfoo Jul 3, 2017

amartya18x Jul 3, 2017

wickedfoo Jul 3, 2017

amartya18x commented Jul 3, 2017

pavanky commented Jul 3, 2017

wickedfoo commented Jul 3, 2017

pavanky commented Jul 3, 2017 •

edited

Loading

pavanky commented Jul 3, 2017 •

edited

Loading

pavanky commented Jul 3, 2017

		@@ -292,7 +291,7 @@ sampleMultinomialWithoutReplacement(curandStateMtgp32* state,

		// All threads must participate in this
		T r = ScalarConvert<float, T>::to(curand_uniform(&state[blockIdx.x]));

making Multinomial sampling slightly faster #786

Are you sure you want to change the base?

making Multinomial sampling slightly faster #786

Conversation

amartya18x commented Jun 28, 2017

amartya18x commented Jun 28, 2017

amartya18x commented Jun 29, 2017

pavanky left a comment

Choose a reason for hiding this comment

pavanky Jun 29, 2017

Choose a reason for hiding this comment

amartya18x commented Jun 29, 2017

soumith commented Jul 3, 2017

wickedfoo commented Jul 3, 2017

wickedfoo Jul 3, 2017

Choose a reason for hiding this comment

amartya18x Jul 3, 2017

Choose a reason for hiding this comment

wickedfoo Jul 3, 2017

Choose a reason for hiding this comment

amartya18x Jul 3, 2017

Choose a reason for hiding this comment

wickedfoo Jul 3, 2017

Choose a reason for hiding this comment

amartya18x commented Jul 3, 2017

pavanky commented Jul 3, 2017

wickedfoo commented Jul 3, 2017

pavanky commented Jul 3, 2017 • edited Loading

pavanky commented Jul 3, 2017 • edited Loading

pavanky commented Jul 3, 2017

pavanky commented Jul 3, 2017 •

edited

Loading

pavanky commented Jul 3, 2017 •

edited

Loading