Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pair_coalescence_rates fails with overall time window [0, float('inf')] and sometimes also with [0, 20000, 100000, float('inf')] #3035

Closed
Luker121 opened this issue Oct 17, 2024 · 5 comments · Fixed by #3038

Comments

@Luker121
Copy link

Hi,
This bug is similar to what was reported in #2986, where we discussed a similar issue that occurs with pairwise coalescent rate calculations. I have to say, that I am using the 0.6.0-dev version of tskit.

Here is my problem:

When trying to calculate the overall pairwise coalescent rates using the time window [0, float('inf')] and genomic windows, I get a similar bug, which we had before already in #2986:

Bug detected in lib/tskit/trees.c at line 9801. If you are using tskit directly please open an issue on GitHub, ideally with a reproducible example.

Here is the non-working code that produces the bug using this time window [0, float('inf')]:

import tskit
import numpy as np
import itertools

def calculate_overall_coalescent_rates(tree_seq, populations):
    sequence_length = tree_seq.sequence_length
    window_breaks = np.append(np.arange(0, sequence_length, 80000), sequence_length)
    
    
    overall_time_window = np.array([0, float('inf')])

    indexes = list(itertools.combinations_with_replacement(range(len(populations)), 2))
    
    # Calculations of coalescent rates for the entire time window
    try:
        print(f"Calculating overall coalescent rates for window: {overall_time_window}")
        overall_coal_rates = tree_seq.pair_coalescence_rates(
            time_windows=overall_time_window,
            sample_sets=populations,
            indexes=indexes,
            windows=window_breaks 
        )
        print(f"Overall coalescent rates shape: {overall_coal_rates.shape}")
    except Exception as e:
        print(f"Error calculating overall coalescent rates: {e}")


if __name__ == "__main__":
    tree_file_path = "test_tree.trees"
    tree_seq = tskit.load(tree_file_path)

    # I have 4 populations
    populations = [
        list(range(0, 30)),  # Population 1
        list(range(30, 50)),  # Population 2
        list(range(50, 62)),  # Population 3
        list(range(62, 74))   # Population 4
    ]

    calculate_overall_coalescent_rates(tree_seq, populations)

And here is the example of code that works when dividing the time windows into smaller intervals [0, 20000, 100000, float('inf')]:

import tskit
import numpy as np
import itertools

def calculate_pairwise_coalescent_rates(tree_seq, populations):
    sequence_length = tree_seq.sequence_length
    window_breaks = np.append(np.arange(0, sequence_length, 80000), sequence_length)
    
    time_windows = np.array([0, 20000, 100000, float('inf')])

    indexes = list(itertools.combinations_with_replacement(range(len(populations)), 2))
    
    # Calculation of pairwise coalescent rates for the defined time windows
    try:
        print(f"Calculating pairwise coalescent rates for windows: {window_breaks}")
        pairwise_coal_rates = tree_seq.pair_coalescence_rates(
            time_windows=time_windows,
            sample_sets=populations,
            indexes=indexes,
            windows=window_breaks 
        )
        print(f"Pairwise coalescent rates shape: {pairwise_coal_rates.shape}")
    except Exception as e:
        print(f"Error calculating pairwise coalescent rates: {e}")

if __name__ == "__main__":
    tree_file_path = "test_tree.trees"
    tree_seq = tskit.load(tree_file_path)

    # I have 4 populations
    populations = [
        list(range(0, 30)),  # Population 1
        list(range(30, 50)),  # Population 2
        list(range(50, 62)),  # Population 3
        list(range(62, 74))   # Population 4
    ]

    calculate_pairwise_coalescent_rates(tree_seq, populations)

The error seems to be triggered when using the time window [0, float('inf')] only. I attached my test_tree_SINGER.trees file as a zip file. This is actually a .trees file created by SINGER (ARG-inference by Yun Deng: https://github.com/popgenmethods/SINGER). But when using both scripts on a simulated .trees file created by SLiM, both scripts work perfectly fine. So there must be something in the .trees file created by SINGER triggering this bug.
I attached both files (test_tree_SINGER.trees and test_tree_slim.trees).
test_tree_files.zip

@Luker121 Luker121 changed the title pair_coalescence_rates fails with overall time window [0, float('inf')] pair_coalescence_rates fails with overall time window [0, float('inf')] and sometimes also with [0, 20000, 100000, float('inf')] Oct 17, 2024
@Luker121
Copy link
Author

Luker121 commented Oct 17, 2024

I realized, when trying my script with smaller time windows on different .trees files inferred by SINGER, I get the same error. So for example when using time windows like: [0, 20000, 100000, float('inf')] as in my "working" script. I get this error again:

Bug detected in lib/tskit/trees.c at line 9801. If you are using tskit directly please open an issue on GitHub, ideally with a reproducible example. (https://github.com/tskit-dev/tskit/issues) If you are using software that uses tskit, please report an issue to that software's issue tracker, at least initially.
Aborted (core dumped)

I attached now two .trees files inferred by SINGER. One of them is working, the other one is not. test_tree_SINGER_3.trees is working with the smaller time window steps and test_tree_SINGER_2.trees is not working with the smaller time window steps.

test_tree_SINGER_2_and_3.zip

I extracted also some general information for both files, but there is nothing obvious for me:

Statistics for test_tree_SINGER_2.trees:
File: test_tree_SINGER_2.trees
Sequence length: 1999999.0
Number of trees: 11500
Number of nodes: 11404
Number of edges: 39116
Number of sites: 7375
Number of mutations: 8128
Number of individuals: 37
Number of populations: 4
Number of samples: 74
Average tree height: 420705.5727773398
Max tree height: 3186686.3746369244
Min tree height: 1360.3180827641665


Statistics for test_tree_SINGER_3.trees:
File: test_tree_SINGER_3.trees
Sequence length: 1999999.0
Number of trees: 10874
Number of nodes: 10803
Number of edges: 36990
Number of sites: 7375
Number of mutations: 7882
Number of individuals: 37
Number of populations: 4
Number of samples: 74
Average tree height: 444325.56503201055
Max tree height: 1973965.6162056425
Min tree height: 1088.7423502283561

@nspope
Copy link
Contributor

nspope commented Oct 17, 2024

Thanks for the thorough report, I'll take a look!

@nspope
Copy link
Contributor

nspope commented Oct 17, 2024

This seems to just be floating point error, in a check that prop_coalescing_pairs[time_window] <= 1. This can happen when all the coalescence events (for a genomic window) are inside a single time window. We can get rid of the assertion, it's a holdover from development anyway. Will be fixed by #3038.

Just a usage note that the coalescence rate with a single time window [0, inf) (what you call "overall coalescence rate") should be equal to diversity (or divergence if it's a cross-coalescence rate between two populations), sans some slight differences in how missing sequence is handled. That is, aside from testing one shouldn't every really want to calculate coalescence rates with a single time window (use ts.divergence or ts.diversity instead).

@nspope
Copy link
Contributor

nspope commented Oct 21, 2024

Hey @Luker121 this bug should be fixed in the current github head of tskit. If you don't mind trying it out on your actual data, that'd be very helpful. Thanks!

@Luker121
Copy link
Author

Hi @nspope, I just tried it out and it works perfectly now (also on my inferred files by SINGER). Thanks so much for your fast help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants