API overhaul #80

jakobnissen · 2022-06-15T15:35:31Z

jakobnissen
Jun 15, 2022
Maintainer

TL;DR:

The current BioAlignments API is confusing and hard to use
We should make a breaking release which rehauls the entire API
I would like to, when I have time, make a flurry of PRs implementing this new API
I would like a discussion about it first so as not to break people's code without reason. This is that discussion.

The problem

About a year ago I taught a course where we had some exercises with BioAlignments.jl.
Unfortunately, both the students and teachers found the API to be quite unintuitive and difficult to use:

Needlessly nested types

When you create a pairwise alignment result, it creates this nested type with the layout:

PairwiseAlignmentResult{Int64, LongDNASeq, LongDNASeq}
  value: Int64 22
  isscore: Bool true
  aln: PairwiseAlignment{LongDNASeq, LongDNASeq}
    a: AlignedSequence{LongDNASeq}
      seq: LongDNASeq
      aln: Alignment
        anchors: Vector{AlignmentAnchor}
        firstref: Int64 1
        lastref: Int64 9
    b: LongDNASeq

Using this object requires the user to access these deeply nested types doing e.g. aln.aln.a.aln.anchors. Missing with fields is un-idiomatic Julia, as fields are normally presumed to be internal.

It is also bad for discoverability - what on earth is the a field of a PairwiseAlignment? And what's the difference between PairwiseAlignment, Alignment and AlignedSequence? And why is Alignment with its generic name a sub-field of PairwiseAlignment? It's all deeply confusing.

Unclear what is internal and external

Another issue is that it's difficult to know what is part of the API and what isn't. See the two innermost integer fields. Currently, no code in BioAlignments reads these fields. So are they redundant? Who knows! We can't change the layout of any fields in case a user reads it directly. Reading the docs does not make it clear either what is public and private

Unpleasant API

Accessing these nested types is a pain
The shortest way to make a pairwise alignment I have found is pairalign(GlobalAlignment(), dna"TAG", dna"AAA", AffineGapScoreModel(EDNAFULL, gap_open=-12, gap_extend=-2)). Not exactly brief. We can do better.
Why is there no default models?

Needless complication

For all the different types in BioAlignments, it's not clear what they are used for or how to use them. For example:

What can I do with an Alignment that I can't do with an AlignedSequence?
What's the different between an AbstractCostModel and an AbstractScoreModel?
And why are these even standalone types if there are no way to score an existing alignment?
Why is isscore a field in PairwiseAlignmentResult, if the only difference between this type and PairwiseAlignment is the presence of a score in the former?

The proposed solution

Rethink it again, and rewrite the package from the ground up.

What BioAlignments should do

What is our goals? As far as I can tell, BioAlignments currently can, and therefore should support at least:

Pairwise alignment of different sequence types, using different scoring models, under different alignment algorithms
Evaluate an existing alignment under a different model
Given an alignment, switch between sequence, reference and alignment position (i.e. seq2ref and similar functions)
Given an alignment, compute deletions, insertions, mismatches etc.
Convert and alignment to/from a CIGAR string
Iterate over an alignment, getting seq/ref pairs.

Please let me know if I missed something

Concrete proposals

Refactor tests

Splitting the tests into a lot of smaller files makes testing easier.

Simplify AlignmentAnchor

Currently, it takes up 32 bytes (!). I think we can encode it into 32 bits, using 8 times less data, by not storing the positions, but re-computing them.
This will make seq2ref function O(n), but lookups can be cached.

The idea is to store only two types of information: What kind of operation it is, and how "long" it is.
We can use e.g. 5 bits for the former and 27 bits for the latter. It's exceedingly unlikely people are going to need indels of > 100_000_000 bp.

Merge cost models and score models

I don't understand the difference between these two. Can't they just be merged to one type?

Simplify PairwiseAlignment type structure

We can remove types PairwiseAlignmentResult, Alignment and AlignedSequence, AlignmentAnchor, and simply have

struct PairwiseAlignment{T, S, R}
    score::T
    seq::S
    ref::R
    ops::Vector{Operation}
end

I can't see what this type can't do which the current 6 nested types is able to do.

Remove alignment of `String`

Only BioSequence types should be able to be aligned - I think.
The alignment process internally converts all the elements to BioSymbol anyway.

Make all public fields accessible via a function

That is, to get the reference sequence of an alignment, all documentation and examples must use reference(aln), not aln.ref

Misc

What is the defined field doing in SubstitutionMatrix? Does it matter?
AffineGapScoreModels need easier constructors.

Ordered checklist

Refactor tests
Collapse PairwiseAlignment type
Collapse cost model and score models
Simplify AlignmentAnchor

kescobo · 2022-06-15T19:02:27Z

kescobo
Jun 15, 2022
Maintainer

I agree 100% with all of this. A couple of thoughts:

this is a small enough list that it might be worth opening Issues on all of them to see if anyone has additional thoughts.
I'd be happy to do the testing refactor, especially if I can use one of my new favorite packages - ReTest.jl
There was some discussion at some point (I haven't bothered to look for it) about what an "alignment primitive" or something would be. Eg - is there any reason to have an AlignmentsCore.jl that things like XAM.jl / other packages could use.

0 replies

MillironX · 2022-06-15T21:11:15Z

MillironX
Jun 15, 2022
Maintainer

I like the ideas here. A few points/counterpoints

Can you clarify what the API would look like for AlignmentAnchor? You wrote out a section on how to "Simplify AlignmentAnchor," and then later say "We can remove types ... AlignmentAnchor." I can see how AlignmentAnchor can be simplified, but am skeptical that it can be removed entirely.
I've got quite a bit of code depending on #64 (and consequently #44). I would really appreciate having most of the current PRs (#60, #62, #64, #76) merged and tagged as a "BioAlignments 3" before everything breaks again.

In the short term, it still might be worth writing getter functions and forwarded getter functions for the nested types. Example

# I think a few of these _are_ implemented, but I can't remember which ones, and they aren't well-documented
alignment(as::AlignedSequence) = as.aln
alignment(pa::PairwiseAlignment) = alignment(pa.a)
alignment(par::PairwiseAlignmentResult) = alignment(par.aln)
sequence(as::AlignedSequence) = as.seq
sequence(pa::Union{PairwiseAlignment,PairwiseAlignmentResult}) = sequence(alignment(pa))

Your list of "What BioAlignments should do" is spot-on: I can't think of anything to add or remove
I love the idea of speeding up AlignmentAnchor: my benchmarking shows that calling Alignment(cigar) is a major bottleneck in many of my programs.

2 replies

kescobo Jun 16, 2022
Maintainer

Is there a reason these changes would need another major version? They all seem like additional features that aren't breaking, or did I misunderstand?

I'm all for getting these things in and tagged before breaking everything, if it can be done in a timely way.

CiaranOMara Jun 16, 2022
Maintainer

The hotfix #76 could take v2 features. But, after that is merged, we'd need to address #44, which is breaking.

jgreener64 · 2022-06-15T21:53:25Z

jgreener64
Jun 15, 2022
Maintainer

On the list of dependent packages I can speak for BioStructures.jl (and consequently Molly.jl): BioAlignments.jl is only used in one place (https://github.com/BioJulia/BioStructures.jl/blob/master/src/spatial.jl#L151-L170) and that usage could benefit from an updated API.

0 replies

CiaranOMara · 2022-06-15T23:29:57Z

CiaranOMara
Jun 15, 2022
Maintainer

It might be worth setting up a project/roadmap for v3 and v4 of BioAlignments.

I've rounded up the BioAlignments.jl use cases in XAM.jl - there's certainly plenty of scope for improvement as these still contain the original code from when XAM.jl was split/lifted from BioAlignemnts.jl.

XAM.jl is mostly using BioAligments.jl to calculate BAM.Record and SAM.Record's right-side position via alignment length. Otherwise, it is populating BioAlignments.Alignment.

BAM

https://github.com/BioJulia/XAM.jl/blob/5e7973c1c5e732d95096c54c667798eb5eee75e1/src/bam/record.jl#L385-L395

function extract_cigar_rle(data::Vector{UInt8}, offset, n)
    ops = Vector{BioAlignments.Operation}()
    lens = Vector{Int}()
    for i in offset:4:offset + (n - 1) * 4
        x = unsafe_load(Ptr{UInt32}(pointer(data, i)))
        op = BioAlignments.Operation(x & 0x0F)
        push!(ops, op)
        push!(lens, x >> 4)
    end
    return ops, lens
end

https://github.com/BioJulia/XAM.jl/blob/5e7973c1c5e732d95096c54c667798eb5eee75e1/src/bam/record.jl#L432-L456

function alignment(record::Record)::BioAlignments.Alignment
    checkfilled(record)
    if !ismapped(record)
        return BioAlignments.Alignment(BioAlignments.AlignmentAnchor[])
    end
    seqpos = 0
    refpos = position(record) - 1
    alnpos = 0
    anchors = [BioAlignments.AlignmentAnchor(seqpos, refpos, alnpos, BioAlignments.OP_START)]
    for (op, len) in zip(cigar_rle(record)...)
        if BioAlignments.ismatchop(op)
            seqpos += len
            refpos += len
        elseif BioAlignments.isinsertop(op)
            seqpos += len
        elseif BioAlignments.isdeleteop(op)
            refpos += len
        else
            error("operation $(op) is not supported")
        end
        alnpos += len
        push!(anchors, BioAlignments.AlignmentAnchor(seqpos, refpos, alnpos, op))
    end
    return BioAlignments.Alignment(anchors)
end

https://github.com/BioJulia/XAM.jl/blob/5e7973c1c5e732d95096c54c667798eb5eee75e1/src/bam/record.jl#L467-L478

function alignlength(record::Record)::Int
    offset = seqname_length(record)
    length::Int = 0
    for i in offset + 1:4:offset + n_cigar_op(record, false) * 4
        x = unsafe_load(Ptr{UInt32}(pointer(record.data, i)))
        op = BioAlignments.Operation(x & 0x0F)
        if BioAlignments.ismatchop(op) || BioAlignments.isdeleteop(op)
            length += x >> 4
        end
    end
    return length
end

SAM

https://github.com/BioJulia/XAM.jl/blob/5e7973c1c5e732d95096c54c667798eb5eee75e1/src/sam/record.jl#L315-L322

function alignment(record::Record)::BioAlignments.Alignment
    if ismapped(record)
        return BioAlignments.Alignment(cigar(record), 1, position(record))
    end
    return BioAlignments.Alignment(BioAlignments.AlignmentAnchor[])
end

https://github.com/BioJulia/XAM.jl/blob/5e7973c1c5e732d95096c54c667798eb5eee75e1/src/sam/record.jl#L333-L352

function alignlength(record::Record)::Int
    if length(record.cigar) == 1 && record.data[first(record.cigar)] == UInt8('*')
        return 0
    end
    ret::Int = 0
    len = 0  # operation length
    for i in record.cigar
        c = record.data[i]
        if in(c, UInt8('0'):UInt8('9'))
            len = len * 10 + (c - UInt8('0'))
            continue
        end
        op = convert(BioAlignments.Operation, Char(c))
        if BioAlignments.ismatchop(op) || BioAlignments.isdeleteop(op) #Note: reference consuming ops ('M', 'D', 'N', '=', 'X').
            ret += len
        end
        len = 0
    end
    return ret
end

0 replies

kescobo · 2022-06-16T01:08:47Z

kescobo
Jun 16, 2022
Maintainer

Rad, I'm glad to see this discussion picking up steam so fast. I'm in favor of @MillironX 's proposal to facilitate v2.x continuing to work while @jakobnissen and others work on the new API. I wonder if it would be worth breaking the later work into a fork for clarity, or at least having a separate main-for-v3 branch so that the two processes can work in parallel.

Either way, I'm 100% on board with setting up a github project, I quite like them. I can try to set that up tomorrow.

4 replies

kescobo Jun 16, 2022
Maintainer

Alright, I think I captured everything here: https://github.com/BioJulia/BioAlignments.jl/projects?type=beta

MillironX Jun 20, 2022
Maintainer

Alright, I think I captured everything here: BioJulia/BioAlignments.jl/projects?type=beta

Are projects restricted somehow? I don't see anything.

kescobo Jun 20, 2022
Maintainer

Huh - that's weird. I didn't think so. I'm on vacation without my laptop, is anyone else able to check?

kescobo Jun 20, 2022
Maintainer

I just checked, and indeed when I wasn't signed in, I couldn't see that. I couldn't see how to change that at a quick glance though :-/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API overhaul #80

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

API overhaul #80

jakobnissen Jun 15, 2022 Maintainer

The problem

Needlessly nested types

Unclear what is internal and external

Unpleasant API

Needless complication

The proposed solution

What BioAlignments should do

Concrete proposals

Refactor tests

Simplify AlignmentAnchor

Merge cost models and score models

Simplify PairwiseAlignment type structure

Remove alignment of String

Make all public fields accessible via a function

Misc

Ordered checklist

Replies: 5 comments · 6 replies

kescobo Jun 15, 2022 Maintainer

MillironX Jun 15, 2022 Maintainer

kescobo Jun 16, 2022 Maintainer

CiaranOMara Jun 16, 2022 Maintainer

jgreener64 Jun 15, 2022 Maintainer

CiaranOMara Jun 15, 2022 Maintainer

BAM

SAM

kescobo Jun 16, 2022 Maintainer

kescobo Jun 16, 2022 Maintainer

MillironX Jun 20, 2022 Maintainer

kescobo Jun 20, 2022 Maintainer

kescobo Jun 20, 2022 Maintainer

jakobnissen
Jun 15, 2022
Maintainer

Remove alignment of `String`

Replies: 5 comments 6 replies

kescobo
Jun 15, 2022
Maintainer

MillironX
Jun 15, 2022
Maintainer

kescobo Jun 16, 2022
Maintainer

CiaranOMara Jun 16, 2022
Maintainer

jgreener64
Jun 15, 2022
Maintainer

CiaranOMara
Jun 15, 2022
Maintainer

kescobo
Jun 16, 2022
Maintainer

kescobo Jun 16, 2022
Maintainer

MillironX Jun 20, 2022
Maintainer

kescobo Jun 20, 2022
Maintainer

kescobo Jun 20, 2022
Maintainer