Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaffold Split Not Implemented Error #33

Closed
kc-chong opened this issue Sep 19, 2024 · 5 comments
Closed

Scaffold Split Not Implemented Error #33

kc-chong opened this issue Sep 19, 2024 · 5 comments

Comments

@kc-chong
Copy link

 config = OptimizationConfig( 
  data=Dataset(
        input_column=input_col,  # Typical names are "SMILES" and "smiles".
        response_column=value_col,  # Often a specific name (like here), or just "activity".
        training_dataset_file= train_data,
        test_dataset_file= test_data # Hidden during optimization.
    ),
    
    descriptors= descriptors,
    algorithms= algorithms ,
    settings=OptimizationConfig.Settings(
        mode=ModelMode.REGRESSION,
        cross_validation=cv,
        cv_split_strategy= ScaffoldSplit(),
        n_trials=100,  # Total number of trials.
        n_startup_trials=50,  # Number of startup ("random") trials.
        random_seed=42, # Seed for reproducability
        direction=OptimizationDirection.MAXIMIZATION,
    ),)

File "~/miniforge3/envs/qsartuna/lib/python3.10/site-packages/optunaz/utils/preprocessing/splitter.py", line 450, in get_sklearn_splitter
    raise NotImplementedError()
NotImplementedError

I tried to use Scaffold split but get this error, seems like the Scaffold splitting is not as straightforward as other splittings? Any clue how to use it to to perform evaluation split? Thanks!

@lewismervin1
Copy link
Collaborator

lewismervin1 commented Oct 2, 2024

Hi @kc-chong.

The scaffoldsplit is not currently supported for the Cross Validation split used during Hyperparameter optimisation.

You can use the ScaffoldSplit as a proper evaluation splitting strategy (used to determine the overall reported test performance of your model) which is part of the dataset in the configuration.

For example, see this example, where you may change the line split_strategy=Stratified(fraction=0.2) to e.g. split_strategy=ScaffoldSplit().

I hope this helps, let me know any problems

@kc-chong
Copy link
Author

kc-chong commented Oct 4, 2024

@lewismervin1 Thanks for the response! Appreciate that :)

@kc-chong kc-chong closed this as completed Oct 7, 2024
@kc-chong
Copy link
Author

kc-chong commented Oct 9, 2024

@lewismervin1 ,

I realise in the example, there's still cv in the OptimConfig, and when I look at the source code, seems like cv is always activated in the fitting? So I'm abit confused about the difference between the split in dataset, and the split in cv. Do you mind clarifying? More specifically, when I define ScaffoldSplit in the dataset, I suppose the data is split into train and valid set by ScaffoldSplit, and the score from the valid set will be used to inform optuna optimization? If that's the case, what's the role of cv over here.

@kc-chong kc-chong reopened this Oct 9, 2024
@lewismervin1
Copy link
Collaborator

lewismervin1 commented Oct 25, 2024

HI @kc-chong, there are two different cross validation approaches that can be customised in QSARtuna.

Overall performance split (split_strategy): This split is the one in the "Dataset" configuration (which is optional) and can be used to split your data into a External Validation and Internal train set (see figure form the publication below). If you supply ScaffoldSplit here then your input is split into a External test set of distinct compound scaffolds (which will never be seen by Optuna optimisation) and an internal train set of remaining compounds with the remaining scaffolds left. The external set is only used to benchmark the performance of the optimised model later, during the "Build" stage (and can be merged into a final production model (see "Model Building" box in the figure) with no external performance).

Hyper-parameter optimisation split (cv_split_strategy): A second hyper-parameter split found in the "Settings" is applied to your training set (e.g. in our example this is the training set of distinct scaffolds split from above), which can be either Random or Stratified (see the "Score with CV" box in the figure). This splits your internal training set again using multiple rounds of cross validation for Optuna optimisation (to benchmark the performance of different algorithm-descriptor pairs). The performance of this split will be reported to the user as part of the "Optimisation" stage.

So to conclude, the test from the first "split_strategy" split is never exposed to the data in the "cv_split_strategy" to avoid leakage between the optimisation and build stages. The performance of both splits will be reported to the user at either Optimisation or Build time.

Hope this helps - More information can be found in the QSARtuna publication here: https://pubs.acs.org/doi/10.1021/acs.jcim.4c00457

image

@kc-chong
Copy link
Author

Thanks for the clarification. Appreciate that! Will close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants