Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up fai_retrieve() by reading whole lines at once #1799

Merged
merged 1 commit into from
Jul 8, 2024

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented Jul 1, 2024

While looking around for functions that could benefit from SIMD techniques a few weeks ago, I looked at fai_retrieve()'s search for (uncommon) non-graphic characters. However I soon realised that it already knows exactly where these characters are, so doesn't need to search or check for them at all!

Because fai_retrieve() is given only well-formatted input containing lines of the same length, it already knows exactly where the base and non-graphic characters are. So in general the interval to be read will look like

    ......ATGCAT    (read last six bases and line terminator)
    ATGCATGCATGC    (read complete line including line terminator)
    ATGCATGCATGC    (read complete line including line terminator)
    ATGC........    (read first four base characters)

and can be read a line at a time instead of a character at a time, with special handling for the partial first and last lines, and discarding the terminator characters at the end of each line read.

Timings in seconds for queries for regions of various lengths, with the iteration counts chosen so that develop took about a minute on each “repeatedly querying the same locus” test:

 len 6     len 41    len 100   len 400   len 4000
(on one FASTA line)  (spread over two / several / lots of FASTA lines)

Uncompressed FASTA, reading the same region repeatedly

 63.092    56.103    64.216    57.681    57.825    develop branch
 47.779    39.477    46.256    41.923    41.989    PR #1797: inline isgraph and bgzf_getc
 46.992    12.043     7.168     2.589     1.690    This PR using bgzf_read()
 42.097    10.667     5.844     2.054     1.070    This PR using inlined bgzf_read_small()

Uncompressed FASTA, jumping around to actually different regions

296.450   119.696    96.292    68.044    57.622    develop branch
283.962    99.184    76.719    52.291    41.920    PR #1797: inline isgraph and bgzf_getc
280.806    73.992    39.230    12.152     1.825    This PR using bgzf_read()
276.599    72.327    38.039    11.542     1.197    This PR using inlined bgzf_read_small()

Opened as a draft PR because this will cause test failures until #1798 is fixed.

@jkbonfield
Copy link
Contributor

I thought about this when doing my trivial edit before, but for some reason was misremembering my code. I used to use code that dealt with the EMBL sequence formats which had blocks of text with spacing in them (eg a white space every 10 chars for human readability when counting) and so discounted whole line reads as too fiddly.

However it doesn't take much thinking to realise this isn't compatible with the fai index format and it simply can't work for random access (although it is incorrectly parseable with samtools faidx and yields incorrect results, which is probably a minor bug, albeit "garbage in, garbage out").

Thanks for the update. It's a significant speed gain. I don' t know how much this actually impacts on real world pipelines, but I'm sure there are a few pathological cases out there.

Because fai_retrieve() is given only well-formatted input containing
lines of the same length, it already knows exactly where the base and
non-graphic characters are. So in general the interval to be read will
look like

    ......ATGCAT    (read last six bases and line terminator)
    ATGCATGCATGC    (read complete line including line terminator)
    ATGCATGCATGC    (read complete line including line terminator)
    ATGC........    (read first four base characters)

and can be read a line at a time instead of a character at a time,
with special handling for the partial first and last lines, and
discarding the terminator characters at the end of each line read.
@jmarshall
Copy link
Member Author

I used to use code that dealt with the EMBL sequence formats which had blocks of text with spacing in them (eg a white space every 10 chars for human readability when counting) and so discounted whole line reads as too fiddly.

Hilariously that could work with faidx if the file has Unix line terminators; e.g.

>funky
ATGCATGCAT AATTGGCCAT AAAATTTTGG GGCCCCATGC
ATGCGTCAAT AATT

If this was indexed as blen 10 len 11, faidx would retrieve the right bases without trouble! I thought this would Just Work, but unfortunately faidx indexing looks for \n and \r specifically rather than !isgraph

@whitwham whitwham merged commit 41ea68b into samtools:develop Jul 8, 2024
10 checks passed
@jmarshall jmarshall deleted the fai_retrieve branch July 8, 2024 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants