Add cluster sampling strategies #40

pascalnotin · 2023-09-07T14:54:09Z

Create code that maps each sequence in input dataset to cluster representative where clusters are based on max 30% sequence similarity (eg., mapping of Uniref100 sequences to the corresponding Uniref30 cluster)
Add two sampling strategies to form input training batches (using Uniref30/100 as example):

sample Uniref30 clusters at random and then, within each cluster, sample a Uniref100 sequence
sample Uniref30 clusters based on the log(uniref30_cluster_size) and then, within each cluster, sample a Uniref100 sequence

csjackson0 · 2023-09-20T05:19:03Z

/take

NAEV95 · 2023-10-09T10:06:51Z

Hi! @csjackson0 would love to help here!

csjackson0 · 2023-10-10T17:11:31Z

@NAEV95 sounds good! @jamaliki do you need help on this?

NAEV95 · 2023-10-23T21:47:52Z

Hi ! I thought of following up once more! Please let me know as I have some time now but might not have in the future as much!

jamaliki · 2023-10-24T17:55:18Z

Hi all (@csjackson0 and @NAEV95)

The plan is to simply change the current dataset structure where we will split each uniref30 cluster to a separate pickle file that includes all colabfolddb sequences that correspond to that file as well as another file that contains the number of sequences in each of these files.

Then, we can write a dataset that simply samples each of these pickle files based on some function of the cluster size (this should be able to be provided as a callable, no need to have it in the dataset class)

Does this sound like a plan? It should be very minimal work to get it to work. The splitting into files will be done by me.

jamaliki · 2023-10-24T17:59:37Z

One important part is that the dataset should have a simple cache. Thus we can have a function like

@functools.lru_cache(maxsize=self.max_cache_size)
def get_cluster(self, cluster_name):
    pass

This should allow us to tune our memory requirements while using the file system responsibly.

csjackson0 · 2023-10-25T05:23:21Z

@jamaliki Sounds like a plan. I can help with the Dataset class. @NAEV95 would you like to collaborate?

NAEV95 · 2023-10-26T16:51:16Z

@csjackson0 Yes let’s do this 💪

jamaliki · 2023-11-28T14:04:51Z

Hey @NAEV95 , take a look at the PR I made, feel free to make suggestions etc

pascalnotin added this to project-lm-scaling Sep 7, 2023

pascalnotin converted this from a draft issue Sep 7, 2023

pascalnotin mentioned this issue Sep 7, 2023

Add hierarchical cluster sampling strategy #42

Open

github-actions bot assigned csjackson0 Sep 20, 2023

pascalnotin moved this from Todo to In Progress in project-lm-scaling Sep 21, 2023

csjackson0 removed their assignment Nov 7, 2023

jamaliki mentioned this issue Nov 28, 2023

Create cluster_dataset.py #63

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cluster sampling strategies #40

Add cluster sampling strategies #40

pascalnotin commented Sep 7, 2023

csjackson0 commented Sep 20, 2023

NAEV95 commented Oct 9, 2023

csjackson0 commented Oct 10, 2023

NAEV95 commented Oct 23, 2023

jamaliki commented Oct 24, 2023

jamaliki commented Oct 24, 2023

csjackson0 commented Oct 25, 2023

NAEV95 commented Oct 26, 2023 •

edited

Loading

jamaliki commented Nov 28, 2023

Add cluster sampling strategies #40

Add cluster sampling strategies #40

Comments

pascalnotin commented Sep 7, 2023

csjackson0 commented Sep 20, 2023

NAEV95 commented Oct 9, 2023

csjackson0 commented Oct 10, 2023

NAEV95 commented Oct 23, 2023

jamaliki commented Oct 24, 2023

jamaliki commented Oct 24, 2023

csjackson0 commented Oct 25, 2023

NAEV95 commented Oct 26, 2023 • edited Loading

jamaliki commented Nov 28, 2023

NAEV95 commented Oct 26, 2023 •

edited

Loading