Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping speed improvement #791

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Mapping speed improvement #791

wants to merge 2 commits into from

Conversation

suhrig
Copy link
Contributor

@suhrig suhrig commented Dec 8, 2019

Hi Alex,

This is a proposal for a mapping speed improvement. It is independent of the improvement suggested by alexey0308 and boosts the throughput by 10-25% - depending on the operating system / compiler version / CPU. I benchmarked it on two systems with human data:

  • Ubuntu 19.10 / GCC v9.2.1 / Intel Xeon E5-2660 2.0GHz
  • CentOS 7.4 / GCC v4.8.5 / AMD Opteron 23xx Gen3 2.1GHz.

I could not benchmark it on a Mac. This might be of interest, however, because memset is supposedly very efficient on MacOS.

Here are the essentials: stitchPieces wastes a lot of time resetting the winBin array using memset. With every call, it rewrites ~200kb of memory, although typically only a few hundred bytes of the array are accessed. I replaced winBin with a "smart" vector that only resets the elements that are actually accessed (FastResetVector). You pay a little bit of overhead with every access to an element, because upon each access the vector needs to check if the element being accessed is stale. But you avoid having to rewrite the whole vector. This pays off when the vector is sparse (as is the case with winBin) and when there are not too many accesses to the vector. Or stated the other way around: This implementation would be detrimental to performance when many elements of winBin would be written/read. According to my measurements there are typically only a few hundred. I'm curious, though, whether there could be a scenario where winBin is heavily accessed. I noticed that alignWindowsPerReadNmax is set to 10000 by default. Under what circumstances would this limit be reached? In my tests it barely even reached 100 and only with a few reads. Let me know what you think. I lack the overview to judge if my suggested patch would be infeasible in some settings.

There is a second (minor) code optimization: The operator[]() function of PackedArray seems to be a code hot spot. The bit shifting in this line turns out to be the most expensive: a1 = ((a1>>S)<<wordCompLength)>>wordCompLength;. It can easily be replaced with a bit mask operation. I did not really think that it would make a difference, but it does. The effect is a bit hard to measure, but taking the median of ten runs of STAR with/without the optimization confirmed a 1-2% performance improvement.

Regards,
Sebastian

@alexdobin
Copy link
Owner

Hi Sebastian,

thanks a lot for this PR! This is definitely yet another part of the code that need improvements.
I have been swamped by other things - hope to get to work on this and Alexey's PR next week.

Cheers
Alex

@Roleren
Copy link

Roleren commented May 25, 2020

As a note here, I reach the limit all the time when aligning Ribo-seq to tRNA databases from tRNA-scan.

Using a small tRNA database of 80kb:

WARNING: not enough space allocated for transcript. Did not process all windows for read D00852:101:HNKC3BCX2:1:2204:17707:40093
   SOLUTION: increase alignTranscriptsPerReadNmax and re-run
WARNING: not enough space allocated for transcript. Did not process all windows for read D00852:98:HNK2YBCX2:2:1205:7697:75368
   SOLUTION: increase alignTranscriptsPerReadNmax and re-run
WARNING: not enough space allocated for transcript. Did not process all windows for read D00852:98:HNK2YBCX2:1:2109:6418:74123
   SOLUTION: increase alignTranscriptsPerReadNmax and re-run

I get thousands of these lines in the log.

I know I could join the genome with the trna database, or align to genome before trna, but I prefer it this way, as it is logically much simpler.

Alignment speed on these small databases are terrible since the SA index must be like 4, and it tries hard to map reads to the small trna database.

So if I get this right, this will make it even slower for me ?

@suhrig
Copy link
Contributor Author

suhrig commented May 25, 2020

Hi @Roleren,

Thanks for the feedback. This could be a useful test case for some of the more unusual scenarios where the patch might be detrimental. Have you actually tried it or this speculation?

I reach the limit all the time

Note that my post talks about alignWindowsPerReadNmax, while yours mentions alignTranscriptsPerReadNmax. They could be correlated, though. Not sure about this ...

So if I get this right, this will make it even slower for me ?

The easiest and most reliable answer to this question is to run a benchmark. Do you want to give my patch a try? Or can you share the tRNA database and a test sample with me? Then I could benchmark this myself.

Regards,
Sebastian

@Roleren
Copy link

Roleren commented May 26, 2020

This is the run for current STAR:
CENTOS 7 (196 cores intel server, 2 TB ram)
Using 50 cores

tRNA file (80 kB , 400 sequences of rat mature tRNAs)
Ribo-seq 7GB (65925239 reads, ~ 65 M reads)

`
Started job on | May 25 14:59:06
Started mapping on | May 25 14:59:18
Finished on | May 25 18:52:05
Mapping speed, Million of reads per hour | 16.99

                      Number of input reads |	65925239
                  Average input read length |	27
                                UNIQUE READS:
               Uniquely mapped reads number |	258
                    Uniquely mapped reads % |	0.00%
                      Average mapped length |	28.69
                   Number of splices: Total |	0
        Number of splices: Annotated (sjdb) |	0
                   Number of splices: GT/AG |	0
                   Number of splices: GC/AG |	0
                   Number of splices: AT/AC |	0
           Number of splices: Non-canonical |	0
                  Mismatch rate per base, % |	1.70%
                     Deletion rate per base |	0.09%
                    Deletion average length |	1.00
                    Insertion rate per base |	0.00%
                   Insertion average length |	0.00
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |	7620
         % of reads mapped to multiple loci |	0.01%
    Number of reads mapped to too many loci |	14956
         % of reads mapped to too many loci |	0.02%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |	0.00%
             % of reads unmapped: too short |	99.92%
                 % of reads unmapped: other |	0.04%
                              CHIMERIC READS:
                   Number of chimeric reads |	0
                        % of chimeric reads |	0.00%

`

@Roleren
Copy link

Roleren commented May 26, 2020

Seeing these terrible stats, I see I should maybe just change the way I do this, instead of testing your commit :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants