-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cluster sampling strategies #40
Comments
/take |
Hi! @csjackson0 would love to help here! |
Hi ! I thought of following up once more! Please let me know as I have some time now but might not have in the future as much! |
Hi all (@csjackson0 and @NAEV95) The plan is to simply change the current dataset structure where we will split each uniref30 cluster to a separate pickle file that includes all colabfolddb sequences that correspond to that file as well as another file that contains the number of sequences in each of these files. Then, we can write a dataset that simply samples each of these pickle files based on some function of the cluster size (this should be able to be provided as a callable, no need to have it in the dataset class) Does this sound like a plan? It should be very minimal work to get it to work. The splitting into files will be done by me. |
One important part is that the dataset should have a simple cache. Thus we can have a function like
This should allow us to tune our memory requirements while using the file system responsibly. |
@csjackson0 Yes let’s do this 💪 |
Hey @NAEV95 , take a look at the PR I made, feel free to make suggestions etc |
The text was updated successfully, but these errors were encountered: