Speed up fai_retrieve() by reading whole lines at once #1799

jmarshall · 2024-07-01T11:28:00Z

While looking around for functions that could benefit from SIMD techniques a few weeks ago, I looked at fai_retrieve()'s search for (uncommon) non-graphic characters. However I soon realised that it already knows exactly where these characters are, so doesn't need to search or check for them at all!

Because fai_retrieve() is given only well-formatted input containing lines of the same length, it already knows exactly where the base and non-graphic characters are. So in general the interval to be read will look like

    ......ATGCAT    (read last six bases and line terminator)
    ATGCATGCATGC    (read complete line including line terminator)
    ATGCATGCATGC    (read complete line including line terminator)
    ATGC........    (read first four base characters)

and can be read a line at a time instead of a character at a time, with special handling for the partial first and last lines, and discarding the terminator characters at the end of each line read.

Timings in seconds for queries for regions of various lengths, with the iteration counts chosen so that develop took about a minute on each “repeatedly querying the same locus” test:

 len 6     len 41    len 100   len 400   len 4000
(on one FASTA line)  (spread over two / several / lots of FASTA lines)

Uncompressed FASTA, reading the same region repeatedly

 63.092    56.103    64.216    57.681    57.825    develop branch
 47.779    39.477    46.256    41.923    41.989    PR #1797: inline isgraph and bgzf_getc
 46.992    12.043     7.168     2.589     1.690    This PR using bgzf_read()
 42.097    10.667     5.844     2.054     1.070    This PR using inlined bgzf_read_small()

Uncompressed FASTA, jumping around to actually different regions

296.450   119.696    96.292    68.044    57.622    develop branch
283.962    99.184    76.719    52.291    41.920    PR #1797: inline isgraph and bgzf_getc
280.806    73.992    39.230    12.152     1.825    This PR using bgzf_read()
276.599    72.327    38.039    11.542     1.197    This PR using inlined bgzf_read_small()

~~Opened as a draft PR because this will cause test failures until #1798 is fixed.~~

jkbonfield · 2024-07-01T13:59:32Z

I thought about this when doing my trivial edit before, but for some reason was misremembering my code. I used to use code that dealt with the EMBL sequence formats which had blocks of text with spacing in them (eg a white space every 10 chars for human readability when counting) and so discounted whole line reads as too fiddly.

However it doesn't take much thinking to realise this isn't compatible with the fai index format and it simply can't work for random access (although it is incorrectly parseable with samtools faidx and yields incorrect results, which is probably a minor bug, albeit "garbage in, garbage out").

Thanks for the update. It's a significant speed gain. I don' t know how much this actually impacts on real world pipelines, but I'm sure there are a few pathological cases out there.

Because fai_retrieve() is given only well-formatted input containing lines of the same length, it already knows exactly where the base and non-graphic characters are. So in general the interval to be read will look like ......ATGCAT (read last six bases and line terminator) ATGCATGCATGC (read complete line including line terminator) ATGCATGCATGC (read complete line including line terminator) ATGC........ (read first four base characters) and can be read a line at a time instead of a character at a time, with special handling for the partial first and last lines, and discarding the terminator characters at the end of each line read.

jmarshall · 2024-07-08T11:05:32Z

I used to use code that dealt with the EMBL sequence formats which had blocks of text with spacing in them (eg a white space every 10 chars for human readability when counting) and so discounted whole line reads as too fiddly.

Hilariously that could work with faidx if the file has Unix line terminators; e.g.

>funky
ATGCATGCAT AATTGGCCAT AAAATTTTGG GGCCCCATGC
ATGCGTCAAT AATT

If this was indexed as blen 10 len 11, faidx would retrieve the right bases without trouble! I thought this would Just Work, but unfortunately faidx indexing looks for \n and \r specifically rather than !isgraph…

jmarshall mentioned this pull request Jul 1, 2024

Using bgzf_read_small() breaks subsequent bgzf_useek() calls #1798

Closed

jmarshall force-pushed the fai_retrieve branch from ef79099 to 4035a79 Compare July 1, 2024 20:12

jmarshall marked this pull request as ready for review July 1, 2024 20:13

daviesrob assigned whitwham Jul 4, 2024

daviesrob mentioned this pull request Jul 4, 2024

Windows builds via GitHub Actions #1796

Merged

whitwham force-pushed the fai_retrieve branch from 4035a79 to 3f0479b Compare July 8, 2024 11:02

whitwham merged commit 41ea68b into samtools:develop Jul 8, 2024
10 checks passed

jmarshall deleted the fai_retrieve branch July 8, 2024 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up fai_retrieve() by reading whole lines at once #1799

Speed up fai_retrieve() by reading whole lines at once #1799

jmarshall commented Jul 1, 2024 •

edited

Loading

jkbonfield commented Jul 1, 2024

jmarshall commented Jul 8, 2024

Speed up fai_retrieve() by reading whole lines at once #1799

Speed up fai_retrieve() by reading whole lines at once #1799

Conversation

jmarshall commented Jul 1, 2024 • edited Loading

jkbonfield commented Jul 1, 2024

jmarshall commented Jul 8, 2024

jmarshall commented Jul 1, 2024 •

edited

Loading