-
Notifications
You must be signed in to change notification settings - Fork 42
Prepare a RAMP event
In case you built your starting kit for launching a (public or private) data challenge, here are the additional steps to follow. In fact, these steps usually precede the writing of the starting kit since we partition the data into public train and private test here.
The repository should be created typically in the ramp-data organization, which will hold your data set. It is important to keep the data private for allowing proper cross-validation and model testing. So either keep this repository private or make sure that the privacy of the data is assured using other techniques.
In the case of iris, prepare_data.py
first reads the full data from /data/iris.csv
and splits it into /data/train.csv
and /data/iris.csv
df = pd.read_csv(os.path.join('data', 'iris.csv'))
df_train, df_test = train_test_split(df, test_size=0.2, random_state=57)
df_train.to_csv(os.path.join('data', 'train.csv'), index=False)
df_test.to_csv(os.path.join('data', 'test.csv'), index=False)
/data/test.csv
is the private test data which is used to compute the scores on the private leaderboard, visible only to RAMP administrators. /data/train.csv
is the public train data on which we do cross validation to compute the scores on the public leaderboard. You do not need to follow this exact naming convention, what is important is that your convention matches what you do in the problem.py
file of the corresponding starting kit, since, when we pull your data repository on the backend, we will test it with the same testing.py
script as the script that submitters use to test their submissions.
In the case of titanic, we already prepared train and test files so prepare_data.py
simply reads them here.
df_train = pd.read_csv(os.path.join('data', 'train.csv'))
df_test = pd.read_csv(os.path.join('data', 'test.csv')) # noqa
After preparing the backend data sets, we also usually prepare the public starting kit data sets that we will upload into the starting kit repo. It is a good practice to make the public data independent of both the training and test data on the backend, but it is also fine if the public data is the same as the backend training data (e.g., in case we don't have much data to spare), since "cheaters" can be caught by looking at their code and by them overfitting the public leaderboard. It is, on the other hand, crucial not to leak the private test data.
It is assumed that ramp-kits
and ramp-data
are installed in the same directory, but prepare_data.py
also need to accept a ramp_kits_dir
argument that specifies where to copy the public train and test files. In the case of iris, we do
df_public = df_train
df_public_train, df_public_test = train_test_split(
df_public, test_size=0.2, random_state=57)
df_public_train.to_csv(os.path.join('data', 'public_train.csv'), index=False)
df_public_test.to_csv(os.path.join('data', 'public_test.csv'), index=False)
# copy starting kit files to <ramp_kits_dir>/<ramp_name>/data
copyfile(
os.path.join('data', 'public_train.csv'),
os.path.join(ramp_kits_dir, ramp_name, 'data', 'train.csv')
)
copyfile(
os.path.join('data', 'public_test.csv'),
os.path.join(ramp_kits_dir, ramp_name, 'data', 'test.csv')
)
The notebook named <ramp_kit_name>_starting_kit.ipynb
(for example titanic_starting_kit.ipynb
) should describe the predictive problem, the data set, and the workflow, and it usually presents some exploratory analysis and data visualization. This notebook will be rendered at the RAMP site.
In the backend, we will pull the data repo into ramp-data
and the kit repo into ramp-kits
, and test both with testing.py
.
In the case of titanic,
$ mkdir ramp-data ramp-kits
$ git clone https://github.com/ramp-data/titanic.git ramp-data/titanic
$ git clone https://github.com/ramp-kits/titanic.git ramp-kits/titanic
$ python ramp-data/titanic/prepare_data.py
$ ramp_test_submission --ramp_data_dir=ramp-data/titanic --ramp_kit_dir=ramp-kits/titanic
$ ramp_test_submission --ramp_data_dir=ramp-kits/titanic --ramp_kit_dir=ramp-kits/titanic
Copyright (c) 2014 - 2018 Paris-Saclay Center for Data Science (http://www.datascience-paris-saclay.fr/)