-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaffold Split Not Implemented Error #33
Comments
Hi @kc-chong. The scaffoldsplit is not currently supported for the Cross Validation split used during Hyperparameter optimisation. You can use the ScaffoldSplit as a proper evaluation splitting strategy (used to determine the overall reported test performance of your model) which is part of the dataset in the configuration. For example, see this example, where you may change the line I hope this helps, let me know any problems |
@lewismervin1 Thanks for the response! Appreciate that :) |
I realise in the example, there's still cv in the OptimConfig, and when I look at the source code, seems like cv is always activated in the fitting? So I'm abit confused about the difference between the split in dataset, and the split in cv. Do you mind clarifying? More specifically, when I define ScaffoldSplit in the dataset, I suppose the data is split into train and valid set by ScaffoldSplit, and the score from the valid set will be used to inform optuna optimization? If that's the case, what's the role of cv over here. |
HI @kc-chong, there are two different cross validation approaches that can be customised in QSARtuna. Overall performance split (split_strategy): This split is the one in the "Dataset" configuration (which is optional) and can be used to split your data into a External Validation and Internal train set (see figure form the publication below). If you supply ScaffoldSplit here then your input is split into a External test set of distinct compound scaffolds (which will never be seen by Optuna optimisation) and an internal train set of remaining compounds with the remaining scaffolds left. The external set is only used to benchmark the performance of the optimised model later, during the "Build" stage (and can be merged into a final production model (see "Model Building" box in the figure) with no external performance). Hyper-parameter optimisation split (cv_split_strategy): A second hyper-parameter split found in the "Settings" is applied to your training set (e.g. in our example this is the training set of distinct scaffolds split from above), which can be either Random or Stratified (see the "Score with CV" box in the figure). This splits your internal training set again using multiple rounds of cross validation for Optuna optimisation (to benchmark the performance of different algorithm-descriptor pairs). The performance of this split will be reported to the user as part of the "Optimisation" stage. So to conclude, the test from the first "split_strategy" split is never exposed to the data in the "cv_split_strategy" to avoid leakage between the optimisation and build stages. The performance of both splits will be reported to the user at either Optimisation or Build time. Hope this helps - More information can be found in the QSARtuna publication here: https://pubs.acs.org/doi/10.1021/acs.jcim.4c00457 |
Thanks for the clarification. Appreciate that! Will close the issue. |
I tried to use Scaffold split but get this error, seems like the Scaffold splitting is not as straightforward as other splittings? Any clue how to use it to to perform evaluation split? Thanks!
The text was updated successfully, but these errors were encountered: