Scaffold Split Not Implemented Error #33

kc-chong · 2024-09-19T07:48:47Z

 config = OptimizationConfig( 
  data=Dataset(
        input_column=input_col,  # Typical names are "SMILES" and "smiles".
        response_column=value_col,  # Often a specific name (like here), or just "activity".
        training_dataset_file= train_data,
        test_dataset_file= test_data # Hidden during optimization.
    ),
    
    descriptors= descriptors,
    algorithms= algorithms ,
    settings=OptimizationConfig.Settings(
        mode=ModelMode.REGRESSION,
        cross_validation=cv,
        cv_split_strategy= ScaffoldSplit(),
        n_trials=100,  # Total number of trials.
        n_startup_trials=50,  # Number of startup ("random") trials.
        random_seed=42, # Seed for reproducability
        direction=OptimizationDirection.MAXIMIZATION,
    ),)

File "~/miniforge3/envs/qsartuna/lib/python3.10/site-packages/optunaz/utils/preprocessing/splitter.py", line 450, in get_sklearn_splitter
    raise NotImplementedError()
NotImplementedError

I tried to use Scaffold split but get this error, seems like the Scaffold splitting is not as straightforward as other splittings? Any clue how to use it to to perform evaluation split? Thanks!

The text was updated successfully, but these errors were encountered:

lewismervin1 · 2024-10-02T12:42:56Z

Hi @kc-chong.

The scaffoldsplit is not currently supported for the Cross Validation split used during Hyperparameter optimisation.

You can use the ScaffoldSplit as a proper evaluation splitting strategy (used to determine the overall reported test performance of your model) which is part of the dataset in the configuration.

For example, see this example, where you may change the line split_strategy=Stratified(fraction=0.2) to e.g. split_strategy=ScaffoldSplit().

I hope this helps, let me know any problems

kc-chong · 2024-10-04T07:39:40Z

@lewismervin1 Thanks for the response! Appreciate that :)

kc-chong · 2024-10-09T05:52:31Z

@lewismervin1 ,

I realise in the example, there's still cv in the OptimConfig, and when I look at the source code, seems like cv is always activated in the fitting? So I'm abit confused about the difference between the split in dataset, and the split in cv. Do you mind clarifying? More specifically, when I define ScaffoldSplit in the dataset, I suppose the data is split into train and valid set by ScaffoldSplit, and the score from the valid set will be used to inform optuna optimization? If that's the case, what's the role of cv over here.

lewismervin1 · 2024-10-25T09:21:41Z

HI @kc-chong, there are two different cross validation approaches that can be customised in QSARtuna.

Overall performance split (split_strategy): This split is the one in the "Dataset" configuration (which is optional) and can be used to split your data into a External Validation and Internal train set (see figure form the publication below). If you supply ScaffoldSplit here then your input is split into a External test set of distinct compound scaffolds (which will never be seen by Optuna optimisation) and an internal train set of remaining compounds with the remaining scaffolds left. The external set is only used to benchmark the performance of the optimised model later, during the "Build" stage (and can be merged into a final production model (see "Model Building" box in the figure) with no external performance).

Hyper-parameter optimisation split (cv_split_strategy): A second hyper-parameter split found in the "Settings" is applied to your training set (e.g. in our example this is the training set of distinct scaffolds split from above), which can be either Random or Stratified (see the "Score with CV" box in the figure). This splits your internal training set again using multiple rounds of cross validation for Optuna optimisation (to benchmark the performance of different algorithm-descriptor pairs). The performance of this split will be reported to the user as part of the "Optimisation" stage.

So to conclude, the test from the first "split_strategy" split is never exposed to the data in the "cv_split_strategy" to avoid leakage between the optimisation and build stages. The performance of both splits will be reported to the user at either Optimisation or Build time.

Hope this helps - More information can be found in the QSARtuna publication here: https://pubs.acs.org/doi/10.1021/acs.jcim.4c00457

kc-chong · 2024-10-29T03:00:05Z

Thanks for the clarification. Appreciate that! Will close the issue.

kc-chong closed this as completed Oct 7, 2024

kc-chong reopened this Oct 9, 2024

kc-chong closed this as completed Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaffold Split Not Implemented Error #33

Scaffold Split Not Implemented Error #33

kc-chong commented Sep 19, 2024

lewismervin1 commented Oct 2, 2024 •

edited

Loading

kc-chong commented Oct 4, 2024

kc-chong commented Oct 9, 2024

lewismervin1 commented Oct 25, 2024 •

edited

Loading

kc-chong commented Oct 29, 2024

Scaffold Split Not Implemented Error #33

Scaffold Split Not Implemented Error #33

Comments

kc-chong commented Sep 19, 2024

lewismervin1 commented Oct 2, 2024 • edited Loading

kc-chong commented Oct 4, 2024

kc-chong commented Oct 9, 2024

lewismervin1 commented Oct 25, 2024 • edited Loading

kc-chong commented Oct 29, 2024

lewismervin1 commented Oct 2, 2024 •

edited

Loading

lewismervin1 commented Oct 25, 2024 •

edited

Loading