Releases: naturalis/barcode-constrained-phylogeny
Releases · naturalis/barcode-constrained-phylogeny
v1.2.0
v1.0.0 - Monkeys
This release is the first end-to-end implementation of the pipeline. It is made available here for reference purposes but most not be used in production settings. It has at least the following issues:
- BIN representatives are selected by their total length. This means that sequences that span multiple markers 'win', even if they have missing data within the focal marker. Instead, the longest complete sequence within the marker should be selected. The same is true for the clade-level exemplars.
- When clade-level outgroups are selected that end up being placed inside the ingroup, raxml-ng fails to root the resulting tree. Possibly this can be addressed with the
reroot_backbone.py
script, but perhaps the whole business can be avoided by doing phylogenetic placement followed by branch length optimization, as per this issue. - We currently select the 'shallowest' exemplars, but this seems to produce suboptimal branch lengths as this leads to most of the backbone changes leading to the clade being placed as symplesiomorphies, i.e. a long basal branch. Subsequently grafting the clade on top of that leads to 'tall' clades. Maybe use 'tallest' exemplars instead? Or 'most average'? More discussion in this issue.
- The number of families that go into the pipeline needs to be specified ahead of time. Parallelization is based on this and it helps create the DAG. However, this means having to know things that only become knowable once the SQLite database is populated. Maybe
dynamic()
can be used? Maybescattergather.split
can be updated? - Logs are produced in large numbers in the parallelized steps. This is messy, so a subfolder for those steps would be better. Also, the logs are currently in an unparseable (at least, by pycharm) format, so the configuration of
logging
needs to be updated to conform to a more standard format. - There are no tests of any kind. It is supposedly possible to generate these after running the pipeline (this needs to be investigated). The presence of tests is a requirement for submission to WorkFlowHub.
- The final tree only has process IDs at its tip. It would be nice if these (and ideally also interior nodes) were annotated with taxon names, BIN numbers, and OpenTree taxon IDs.
- It should also be possible to filter on country instead of taxonomy, so that we can produce the area phylogeny (e.g. of The Netherlands).