Revisit data split approach for training and validation #82

emmamendelsohn · 2024-03-15T18:06:59Z

Initial split randomly by date-district combinations. While we are not doing any spatial extraction, we need to have spatial splits because the outbreak data is so clustered in time that we wouldn't have enough coverage in our splits.
Mask from the training set the three months following the holdout dates for the given district and the surrounding districts. The reasoning for this is a) our data has three month lags for weather and NDVI, and so we want to avoid data leakage from our holdout set into the lags of our training set, and b) spatial masking prevents the model from relying too heavily on surrounding districts to make predictions for the holdout district. Given the logic of preventing data leakage from the holdout set, we should also mask the surrounding districts for the following three months in case the "future" surrounding data has an impact on the "current" predictions.
Cross validation on the training set should basically mirror the approach above. We need to enforce at least 3 months between each test date, and mask out surrounding districts.

We talked about how the immunity and recent outbreak layers present a challenge related to data leakage from the holdout set into the training set. As they are longer-term cumulative, it wouldn't be possible to mask them out as we do with the three month lags. It could present the problem of having "future" information hidden in the training set. And if anything, this could be more of a "give away" than future NDVI or weather data, because it's explicitly about the outcome variable. While we may not be able to solve the problem entirely, we could try masking out more than three months. Maybe one year, to at least deal with the leakage from the recent outbreak layer? Would have to take a look at how much data that leaves us to work with.

cc @noamross

emmamendelsohn · 2024-03-30T18:32:35Z

Step 2 above reduces the training data from 17,721 to 399 rows. So, we can't mask dates AND surrounding districts for 3 months.

Trying a) mask the district itself for 3 months and b) mask the surrounding districts only for the date itself. This leaves 6575 rows and 58 of the 192 outbreaks in the training dataset. There will be further data reduction when we apply the masking approach within each split.

So far I have been masking the holdout dataset against the full training dataset. I think we could probably mask only the analysis splits of the CV, ie leave masked data in the assessment splits. This wouldn't improve data availability for training, but could give us more assessment data points.

emmamendelsohn · 2024-04-23T14:54:36Z

I think the bigger concern will be the immunity and recent outbreak layers, as these can cause data leakage of the outcome variable, whereas lagged weather/NDVI data is much less informative.

I tried masking only the surrounding districts on the given holdout day, and that still reduces the dataset by 70%.

The surrounding areas on the given holdout day likely is a true leakage issue.

emmamendelsohn · 2024-05-06T21:31:16Z

Our current approach is to split training to include pre-2018 outbreak and validation to be 2018 outbreak onward. This presents a challenge because 2018 was a single outbreak in a single district, so we may need to rearrange the splits to include more outbreaks in the validation.

emmamendelsohn · 2024-06-28T18:37:46Z

I believe this leaves just two points in the validation dataset, so there may be a need to revisit this approach. I would say to focus on building out the model before revisiting this.

emmamendelsohn added a commit that referenced this issue Mar 29, 2024

progress on #82

31c224f

emmamendelsohn mentioned this issue Mar 30, 2024

Feature/outbreak layer #84

Open

emmamendelsohn added model-development labels Jun 28, 2024

emmamendelsohn changed the title ~~Training split approach~~ Revisit data split approach for training and validation Jun 28, 2024

emmamendelsohn mentioned this issue Jun 28, 2024

Roadmap #91

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit data split approach for training and validation #82

Revisit data split approach for training and validation #82

emmamendelsohn commented Mar 15, 2024

emmamendelsohn commented Mar 30, 2024 •

edited

Loading

emmamendelsohn commented Apr 23, 2024 •

edited

Loading

emmamendelsohn commented May 6, 2024

emmamendelsohn commented Jun 28, 2024

Revisit data split approach for training and validation #82

Revisit data split approach for training and validation #82

Comments

emmamendelsohn commented Mar 15, 2024

emmamendelsohn commented Mar 30, 2024 • edited Loading

emmamendelsohn commented Apr 23, 2024 • edited Loading

emmamendelsohn commented May 6, 2024

emmamendelsohn commented Jun 28, 2024

emmamendelsohn commented Mar 30, 2024 •

edited

Loading

emmamendelsohn commented Apr 23, 2024 •

edited

Loading