Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Training Data Duplicated in Heldout Data #5

Open
bjascob opened this issue Sep 10, 2018 · 10 comments
Open

Some Training Data Duplicated in Heldout Data #5

bjascob opened this issue Sep 10, 2018 · 10 comments

Comments

@bjascob
Copy link

bjascob commented Sep 10, 2018

While using the preprocessed data from http://www.statmt.org/lm-benchmark/ I noticed that some of the training data was duplicated in the heldout (aka test). This is in addition to train/news.en-00000-of-00100 which appears to be a complete copy of all the heldout data.

Using a simple python script to put the sentences into a dict, I see 303,465 unique heldout sentences and 3,223 duplicates to sentences in the training directory. Attached is a file bw_duplicates.txt with the duplicates. You can easily verify this by grep'ing for them in the training directory.

Is this a known issue? My concern is that many people use this data for benchmarking language models and the test data has about 1% of the training data mixed into it. That's probably not going to change the results much but it isn't desirable either.

@ciprian-chelba
Copy link
Owner

ciprian-chelba commented Sep 10, 2018 via email

@bjascob
Copy link
Author

bjascob commented Sep 10, 2018

Yes, that was my understanding.
I'm saying that sentences in the attached bw_duplicates.txt file show up in both,
training-monolingual.tokenized.shuffled/news.en-00000-of-00100 and
training-monolingual.tokenized.shuffled/news.en-000xx-of-00100 (where xx is 1 to 99)
For instance, the first duplicate sentence in the list "Bush is remembered by many Haitians -- " shows up in training/news.en-00000-of-00100 and training/news.en-00056-of-00100 exactly the same.
(Note that shards 59 and 75 also contain close to the same sentence but do differ by the "--" )

@ciprian-chelba
Copy link
Owner

ciprian-chelba commented Sep 10, 2018 via email

@ciprian-chelba
Copy link
Owner

ciprian-chelba commented Sep 10, 2018 via email

@ciprian-chelba
Copy link
Owner

ciprian-chelba commented Sep 10, 2018 via email

@bjascob
Copy link
Author

bjascob commented Sep 10, 2018

For all 50 holdout shards, I see 303,465 unique sentences and 3,223 duplicates to sentences in the training directory, so roughly 1%.

@ciprian-chelba
Copy link
Owner

ciprian-chelba commented Sep 10, 2018 via email

@ciprian-chelba
Copy link
Owner

ciprian-chelba commented Sep 10, 2018 via email

@bjascob
Copy link
Author

bjascob commented Sep 11, 2018

For news.en.heldout-00000-of-00050 there were 6,005 unique sentences and 70 that were duplicates of training data. The duplicates sentences are listed in the following file: bw_dup_shard0.txt.

@ciprian-chelba
Copy link
Owner

ciprian-chelba commented Sep 11, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants