API overhaul #80
Replies: 5 comments 6 replies
-
I agree 100% with all of this. A couple of thoughts:
|
Beta Was this translation helpful? Give feedback.
-
I like the ideas here. A few points/counterpoints
|
Beta Was this translation helpful? Give feedback.
-
On the list of dependent packages I can speak for BioStructures.jl (and consequently Molly.jl): BioAlignments.jl is only used in one place (https://github.com/BioJulia/BioStructures.jl/blob/master/src/spatial.jl#L151-L170) and that usage could benefit from an updated API. |
Beta Was this translation helpful? Give feedback.
-
It might be worth setting up a project/roadmap for v3 and v4 of BioAlignments. I've rounded up the BioAlignments.jl use cases in XAM.jl - there's certainly plenty of scope for improvement as these still contain the original code from when XAM.jl was split/lifted from BioAlignemnts.jl. XAM.jl is mostly using BioAligments.jl to calculate BAMfunction extract_cigar_rle(data::Vector{UInt8}, offset, n)
ops = Vector{BioAlignments.Operation}()
lens = Vector{Int}()
for i in offset:4:offset + (n - 1) * 4
x = unsafe_load(Ptr{UInt32}(pointer(data, i)))
op = BioAlignments.Operation(x & 0x0F)
push!(ops, op)
push!(lens, x >> 4)
end
return ops, lens
end function alignment(record::Record)::BioAlignments.Alignment
checkfilled(record)
if !ismapped(record)
return BioAlignments.Alignment(BioAlignments.AlignmentAnchor[])
end
seqpos = 0
refpos = position(record) - 1
alnpos = 0
anchors = [BioAlignments.AlignmentAnchor(seqpos, refpos, alnpos, BioAlignments.OP_START)]
for (op, len) in zip(cigar_rle(record)...)
if BioAlignments.ismatchop(op)
seqpos += len
refpos += len
elseif BioAlignments.isinsertop(op)
seqpos += len
elseif BioAlignments.isdeleteop(op)
refpos += len
else
error("operation $(op) is not supported")
end
alnpos += len
push!(anchors, BioAlignments.AlignmentAnchor(seqpos, refpos, alnpos, op))
end
return BioAlignments.Alignment(anchors)
end function alignlength(record::Record)::Int
offset = seqname_length(record)
length::Int = 0
for i in offset + 1:4:offset + n_cigar_op(record, false) * 4
x = unsafe_load(Ptr{UInt32}(pointer(record.data, i)))
op = BioAlignments.Operation(x & 0x0F)
if BioAlignments.ismatchop(op) || BioAlignments.isdeleteop(op)
length += x >> 4
end
end
return length
end SAMfunction alignment(record::Record)::BioAlignments.Alignment
if ismapped(record)
return BioAlignments.Alignment(cigar(record), 1, position(record))
end
return BioAlignments.Alignment(BioAlignments.AlignmentAnchor[])
end function alignlength(record::Record)::Int
if length(record.cigar) == 1 && record.data[first(record.cigar)] == UInt8('*')
return 0
end
ret::Int = 0
len = 0 # operation length
for i in record.cigar
c = record.data[i]
if in(c, UInt8('0'):UInt8('9'))
len = len * 10 + (c - UInt8('0'))
continue
end
op = convert(BioAlignments.Operation, Char(c))
if BioAlignments.ismatchop(op) || BioAlignments.isdeleteop(op) #Note: reference consuming ops ('M', 'D', 'N', '=', 'X').
ret += len
end
len = 0
end
return ret
end |
Beta Was this translation helpful? Give feedback.
-
Rad, I'm glad to see this discussion picking up steam so fast. I'm in favor of @MillironX 's proposal to facilitate v2.x continuing to work while @jakobnissen and others work on the new API. I wonder if it would be worth breaking the later work into a fork for clarity, or at least having a separate Either way, I'm 100% on board with setting up a github project, I quite like them. I can try to set that up tomorrow. |
Beta Was this translation helpful? Give feedback.
-
TL;DR:
The problem
About a year ago I taught a course where we had some exercises with BioAlignments.jl.
Unfortunately, both the students and teachers found the API to be quite unintuitive and difficult to use:
Needlessly nested types
When you create a pairwise alignment result, it creates this nested type with the layout:
Using this object requires the user to access these deeply nested types doing e.g.
aln.aln.a.aln.anchors
. Missing with fields is un-idiomatic Julia, as fields are normally presumed to be internal.It is also bad for discoverability - what on earth is the
a
field of aPairwiseAlignment
? And what's the difference betweenPairwiseAlignment
,Alignment
andAlignedSequence
? And why isAlignment
with its generic name a sub-field ofPairwiseAlignment
? It's all deeply confusing.Unclear what is internal and external
Another issue is that it's difficult to know what is part of the API and what isn't. See the two innermost integer fields. Currently, no code in BioAlignments reads these fields. So are they redundant? Who knows! We can't change the layout of any fields in case a user reads it directly. Reading the docs does not make it clear either what is public and private
Unpleasant API
pairalign(GlobalAlignment(), dna"TAG", dna"AAA", AffineGapScoreModel(EDNAFULL, gap_open=-12, gap_extend=-2))
. Not exactly brief. We can do better.Needless complication
For all the different types in BioAlignments, it's not clear what they are used for or how to use them. For example:
Alignment
that I can't do with anAlignedSequence
?AbstractCostModel
and anAbstractScoreModel
?isscore
a field inPairwiseAlignmentResult
, if the only difference between this type andPairwiseAlignment
is the presence of a score in the former?The proposed solution
Rethink it again, and rewrite the package from the ground up.
What BioAlignments should do
What is our goals? As far as I can tell, BioAlignments currently can, and therefore should support at least:
seq2ref
and similar functions)Please let me know if I missed something
Concrete proposals
Refactor tests
Splitting the tests into a lot of smaller files makes testing easier.
Simplify AlignmentAnchor
Currently, it takes up 32 bytes (!). I think we can encode it into 32 bits, using 8 times less data, by not storing the positions, but re-computing them.
This will make
seq2ref
function O(n), but lookups can be cached.The idea is to store only two types of information: What kind of operation it is, and how "long" it is.
We can use e.g. 5 bits for the former and 27 bits for the latter. It's exceedingly unlikely people are going to need indels of > 100_000_000 bp.
Merge cost models and score models
I don't understand the difference between these two. Can't they just be merged to one type?
Simplify PairwiseAlignment type structure
We can remove types
PairwiseAlignmentResult
,Alignment
andAlignedSequence
,AlignmentAnchor
, and simply haveI can't see what this type can't do which the current 6 nested types is able to do.
Remove alignment of
String
Only
BioSequence
types should be able to be aligned - I think.The alignment process internally converts all the elements to
BioSymbol
anyway.Make all public fields accessible via a function
That is, to get the reference sequence of an alignment, all documentation and examples must use
reference(aln)
, notaln.ref
Misc
defined
field doing in SubstitutionMatrix? Does it matter?AffineGapScoreModel
s need easier constructors.Ordered checklist
Beta Was this translation helpful? Give feedback.
All reactions