-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 'PBS()' #2717
Comments
A pull request would be very welcome if you'd like to contribute @atongsa! If you're interested, we'd be happy to provide some concrete pointers on where to start. (The development docs should get you started, in the first instance) |
thank you, sir. I am learning the development docs and will try to add a PBS() function to tskit. |
i imagine the ts.PBS() logic is like,
but i dont find ways to read vcf into tree sequence (just write_vcf) in tskit document, can you give me some prompt |
As it's a "derived" statistic, I would imagine the implementation would look something like that of Fst. However, it may be that it's not as complicated as that. Perhaps you could try implementing this as a standalone function based on some simple simulations, and paste your code in here? We wouldn't be working with VCF here, just tskit tree sequence objects. |
Just FYI @atongsa, there is a long and complicated "inference" procedure to try and construct a (sometimes) reasonable tree sequence from the much more limited genetic variation data present in a VCF (e.g. via tsinfer). Moreover, even once you manage to infer a tree sequence, to calculate branch length stats like Fst, you need to date the tree sequence too (e.g. via tsdate). So to implement stats calculations in tskit, you'll want to work with directly simulated tree sequences, not VCF data. |
i write a demo function, and i will improve it later for sites and windows calculation (because i don't understand some function in trees.py, i need some time to learn), and then try to make a tskit.pbs() pull request def pbs(ts, test, ref1, ref2):
"""
Calculates the population branch statistic (PBS) between a test population and two reference populations.
Parameters
----------
ts: tree sequence
test : numpy.ndarray, nodes as the focus population
ref1 : numpy.ndarray, nodes as the far away population
ref2 : numpy.ndarray, nodes as the sister group of focus population
Returns
-------
PBS values of the focus population
"""
# Calculate pairwise Fst values
fst_test_ref1 = ts.Fst([test, ref1])
fst_test_ref2 = ts.Fst([test, ref2])
fst_ref1_ref2 = ts.Fst([ref1, ref2])
# Calculate PBS
pbs = (-math.log(1-fst_test_ref1, 10) + -math.log(1-fst_test_ref2, 10) - -math.log(1-fst_ref1_ref2, 10)) / 2
return pbs the test code below get a return value
|
Working through the motivation and implementation in #2777 here: so, the statistic is defined in terms of some variables called The main question I have at this point is whether we should really be using Now, our Fst is implemented as
and so letting
and so plugging this in to the definition of PBS we'd get that
I'm not seeing any particular simplification or insight here, just writing it down. Okay - I think defining PBS in terms of Fst is what we ought to do - at least, it agrees with the literature. It would be interesting to look at the other statistic we'd get by just putting in divergences (I think that's been called a "y-statistic"?), but that's a separate topic. |
PBS (population branch statistics) 1 is derived from Fst, which can detect positive selection signals in admixted populations .
the formula is
Footnotes
Yi, X. et al. Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude. Science 329, 75–78 (2010). ↩
The text was updated successfully, but these errors were encountered: