Mapping speed improvement #791

suhrig · 2019-12-08T18:00:09Z

Hi Alex,

This is a proposal for a mapping speed improvement. It is independent of the improvement suggested by alexey0308 and boosts the throughput by 10-25% - depending on the operating system / compiler version / CPU. I benchmarked it on two systems with human data:

Ubuntu 19.10 / GCC v9.2.1 / Intel Xeon E5-2660 2.0GHz
CentOS 7.4 / GCC v4.8.5 / AMD Opteron 23xx Gen3 2.1GHz.

I could not benchmark it on a Mac. This might be of interest, however, because memset is supposedly very efficient on MacOS.

Here are the essentials: stitchPieces wastes a lot of time resetting the winBin array using memset. With every call, it rewrites ~200kb of memory, although typically only a few hundred bytes of the array are accessed. I replaced winBin with a "smart" vector that only resets the elements that are actually accessed (FastResetVector). You pay a little bit of overhead with every access to an element, because upon each access the vector needs to check if the element being accessed is stale. But you avoid having to rewrite the whole vector. This pays off when the vector is sparse (as is the case with winBin) and when there are not too many accesses to the vector. Or stated the other way around: This implementation would be detrimental to performance when many elements of winBin would be written/read. According to my measurements there are typically only a few hundred. I'm curious, though, whether there could be a scenario where winBin is heavily accessed. I noticed that alignWindowsPerReadNmax is set to 10000 by default. Under what circumstances would this limit be reached? In my tests it barely even reached 100 and only with a few reads. Let me know what you think. I lack the overview to judge if my suggested patch would be infeasible in some settings.

There is a second (minor) code optimization: The operator[]() function of PackedArray seems to be a code hot spot. The bit shifting in this line turns out to be the most expensive: a1 = ((a1>>S)<<wordCompLength)>>wordCompLength;. It can easily be replaced with a bit mask operation. I did not really think that it would make a difference, but it does. The effect is a bit hard to measure, but taking the median of ten runs of STAR with/without the optimization confirmed a 1-2% performance improvement.

Regards,
Sebastian

… bit shifting

alexdobin · 2019-12-10T19:25:55Z

Hi Sebastian,

thanks a lot for this PR! This is definitely yet another part of the code that need improvements.
I have been swamped by other things - hope to get to work on this and Alexey's PR next week.

Cheers
Alex

Roleren · 2020-05-25T13:53:48Z

As a note here, I reach the limit all the time when aligning Ribo-seq to tRNA databases from tRNA-scan.

Using a small tRNA database of 80kb:

WARNING: not enough space allocated for transcript. Did not process all windows for read D00852:101:HNKC3BCX2:1:2204:17707:40093
   SOLUTION: increase alignTranscriptsPerReadNmax and re-run
WARNING: not enough space allocated for transcript. Did not process all windows for read D00852:98:HNK2YBCX2:2:1205:7697:75368
   SOLUTION: increase alignTranscriptsPerReadNmax and re-run
WARNING: not enough space allocated for transcript. Did not process all windows for read D00852:98:HNK2YBCX2:1:2109:6418:74123
   SOLUTION: increase alignTranscriptsPerReadNmax and re-run

I get thousands of these lines in the log.

I know I could join the genome with the trna database, or align to genome before trna, but I prefer it this way, as it is logically much simpler.

Alignment speed on these small databases are terrible since the SA index must be like 4, and it tries hard to map reads to the small trna database.

So if I get this right, this will make it even slower for me ?

suhrig · 2020-05-25T14:51:35Z

Hi @Roleren,

Thanks for the feedback. This could be a useful test case for some of the more unusual scenarios where the patch might be detrimental. Have you actually tried it or this speculation?

I reach the limit all the time

Note that my post talks about alignWindowsPerReadNmax, while yours mentions alignTranscriptsPerReadNmax. They could be correlated, though. Not sure about this ...

So if I get this right, this will make it even slower for me ?

The easiest and most reliable answer to this question is to run a benchmark. Do you want to give my patch a try? Or can you share the tRNA database and a test sample with me? Then I could benchmark this myself.

Regards,
Sebastian

Roleren · 2020-05-26T07:43:45Z

This is the run for current STAR:
CENTOS 7 (196 cores intel server, 2 TB ram)
Using 50 cores

tRNA file (80 kB , 400 sequences of rat mature tRNAs)
Ribo-seq 7GB (65925239 reads, ~ 65 M reads)

`
Started job on | May 25 14:59:06
Started mapping on | May 25 14:59:18
Finished on | May 25 18:52:05
Mapping speed, Million of reads per hour | 16.99

                      Number of input reads |	65925239
                  Average input read length |	27
                                UNIQUE READS:
               Uniquely mapped reads number |	258
                    Uniquely mapped reads % |	0.00%
                      Average mapped length |	28.69
                   Number of splices: Total |	0
        Number of splices: Annotated (sjdb) |	0
                   Number of splices: GT/AG |	0
                   Number of splices: GC/AG |	0
                   Number of splices: AT/AC |	0
           Number of splices: Non-canonical |	0
                  Mismatch rate per base, % |	1.70%
                     Deletion rate per base |	0.09%
                    Deletion average length |	1.00
                    Insertion rate per base |	0.00%
                   Insertion average length |	0.00
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |	7620
         % of reads mapped to multiple loci |	0.01%
    Number of reads mapped to too many loci |	14956
         % of reads mapped to too many loci |	0.02%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |	0.00%
             % of reads unmapped: too short |	99.92%
                 % of reads unmapped: other |	0.04%
                              CHIMERIC READS:
                   Number of chimeric reads |	0
                        % of chimeric reads |	0.00%

`

Roleren · 2020-05-26T11:26:44Z

Seeing these terrible stats, I see I should maybe just change the way I do this, instead of testing your commit :D

suhrig added 2 commits December 7, 2019 20:33

improve mapping speed by avoiding calls to memset for winBin

819edbc

minor speed improvement: unpack PackedArray using bit mask instead of…

28121ff

… bit shifting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping speed improvement #791

Mapping speed improvement #791

suhrig commented Dec 8, 2019

alexdobin commented Dec 10, 2019

Roleren commented May 25, 2020

suhrig commented May 25, 2020

Roleren commented May 26, 2020 •

edited

Loading

Roleren commented May 26, 2020

Mapping speed improvement #791

Are you sure you want to change the base?

Mapping speed improvement #791

Conversation

suhrig commented Dec 8, 2019

alexdobin commented Dec 10, 2019

Roleren commented May 25, 2020

suhrig commented May 25, 2020

Roleren commented May 26, 2020 • edited Loading

Roleren commented May 26, 2020

Roleren commented May 26, 2020 •

edited

Loading