Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cluster sampling strategies #40

Open
pascalnotin opened this issue Sep 7, 2023 · 9 comments
Open

Add cluster sampling strategies #40

pascalnotin opened this issue Sep 7, 2023 · 9 comments

Comments

@pascalnotin
Copy link
Collaborator

  1. Create code that maps each sequence in input dataset to cluster representative where clusters are based on max 30% sequence similarity (eg., mapping of Uniref100 sequences to the corresponding Uniref30 cluster)
  2. Add two sampling strategies to form input training batches (using Uniref30/100 as example):
  • sample Uniref30 clusters at random and then, within each cluster, sample a Uniref100 sequence
  • sample Uniref30 clusters based on the log(uniref30_cluster_size) and then, within each cluster, sample a Uniref100 sequence
@csjackson0
Copy link
Contributor

/take

@pascalnotin pascalnotin moved this from Todo to In Progress in project-lm-scaling Sep 21, 2023
@NAEV95
Copy link

NAEV95 commented Oct 9, 2023

Hi! @csjackson0 would love to help here!

@csjackson0
Copy link
Contributor

@NAEV95 sounds good! @jamaliki do you need help on this?

@NAEV95
Copy link

NAEV95 commented Oct 23, 2023

Hi ! I thought of following up once more! Please let me know as I have some time now but might not have in the future as much!

@jamaliki
Copy link
Collaborator

Hi all (@csjackson0 and @NAEV95)

The plan is to simply change the current dataset structure where we will split each uniref30 cluster to a separate pickle file that includes all colabfolddb sequences that correspond to that file as well as another file that contains the number of sequences in each of these files.

Then, we can write a dataset that simply samples each of these pickle files based on some function of the cluster size (this should be able to be provided as a callable, no need to have it in the dataset class)

Does this sound like a plan? It should be very minimal work to get it to work. The splitting into files will be done by me.

@jamaliki
Copy link
Collaborator

One important part is that the dataset should have a simple cache. Thus we can have a function like

@functools.lru_cache(maxsize=self.max_cache_size)
def get_cluster(self, cluster_name):
    pass

This should allow us to tune our memory requirements while using the file system responsibly.

@csjackson0
Copy link
Contributor

@jamaliki Sounds like a plan. I can help with the Dataset class. @NAEV95 would you like to collaborate?

@NAEV95
Copy link

NAEV95 commented Oct 26, 2023

@csjackson0 Yes let’s do this 💪

@csjackson0 csjackson0 removed their assignment Nov 7, 2023
@jamaliki
Copy link
Collaborator

Hey @NAEV95 , take a look at the PR I made, feel free to make suggestions etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

4 participants