Some Training Data Duplicated in Heldout Data #5

bjascob · 2018-09-10T18:30:38Z

While using the preprocessed data from http://www.statmt.org/lm-benchmark/ I noticed that some of the training data was duplicated in the heldout (aka test). This is in addition to train/news.en-00000-of-00100 which appears to be a complete copy of all the heldout data.

Using a simple python script to put the sentences into a dict, I see 303,465 unique heldout sentences and 3,223 duplicates to sentences in the training directory. Attached is a file bw_duplicates.txt with the duplicates. You can easily verify this by grep'ing for them in the training directory.

Is this a known issue? My concern is that many people use this data for benchmarking language models and the test data has about 1% of the training data mixed into it. That's probably not going to change the results much but it isn't desirable either.

ciprian-chelba · 2018-09-10T19:19:56Z

Hi, The training data is: 1-billion-word-language-modeling-benchmark/training-monolingual.tokenized.shuffled/news.en-000??-of-00100 As you will notice, the fileglob expansion is missing the news.en-00000-of-00100 file which is used as held-out data: 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100 Since that is a bit large, we sharded it 50-way, giving us 50 smaller sets for evaluation, parameter tuning, etc. The test set on which we reported results in the paper is: 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050 You can find more details on all this in the README files at: https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark Hope this answers your questions, -Ciprian

…

On Mon, Sep 10, 2018 at 11:38 AM bjascob ***@***.***> wrote: While using the preprocessed data from http://www.statmt.org/lm-benchmark/ <http://url> I noticed that some of the training data was duplicated in the heldout (aka test). This is in addition to *train/news.en-00000-of-00100* which appears to be a complete copy of all the heldout data. Using a simple python script to put the sentences into a *dict*, I see 303,465 unique heldout sentences and 3,223 duplicates to sentences in the training directory. Attached is a file bw_duplicates.txt <https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/files/2367910/bw_duplicates.txt> with the duplicates. You can easily verify this by grep'ing for them in the training directory. Is this a known issue? My concern is that many people use this data for benchmarking language models and the test data has about 1% of the training data mixed into it. That's probably not going to change the results much but it isn't desirable either. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AK_u-2mmb3N5IkQVSLOqb1FwMxBxrNRRks5uZq_PgaJpZM4Wh8H7> .

-- -Ciprian

bjascob · 2018-09-10T22:21:19Z

Yes, that was my understanding.
I'm saying that sentences in the attached bw_duplicates.txt file show up in both,
training-monolingual.tokenized.shuffled/news.en-00000-of-00100 and
training-monolingual.tokenized.shuffled/news.en-000xx-of-00100 (where xx is 1 to 99)
For instance, the first duplicate sentence in the list "Bush is remembered by many Haitians -- " shows up in training/news.en-00000-of-00100 and training/news.en-00056-of-00100 exactly the same.
(Note that shards 59 and 75 also contain close to the same sentence but do differ by the "--" )

ciprian-chelba · 2018-09-10T23:14:56Z

On Mon, Sep 10, 2018 at 3:21 PM bjascob ***@***.***> wrote: Yes, that was my understanding.

Sorry, I thought your concern was that somehow the entire held-out set is also part of the training set.

I'm saying that sentences in the attached bw_duplicates.txt file show up in both, training-monolingual.tokenized.shuffled/news.en-00000-of-00100 and training-monolingual.tokenized.shuffled/news.en-000xx-of-00100 (where xx is 1 to 99) For instance, the first duplicate sentence in the list "Bush is remembered by many Haitians -- " shows up in training/news.en-00000-of-00100 and training/news.en-00056-of-00100 exactly the same. (Note that shards 59 and 75 also contain close to the same sentence but do differ by the "--" )

Well, as I mention at: https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/split-input-data.perl#L17 after tokenization I ran: # $ sort -u --parallel=10 news.20XX.en.shuffled.tokenized # --output=news.20XX.en.shuffled.tokenized.sorted # to get the input data. # # A sample command line for running this: # ./scripts/split-input-data.perl # --output_file_base="$PWD/training-monolingual.tokenized.shuffled.perl/news.en" # --num_shards=100 # --input_file=./training-monolingual.tokenized/news.20XX.en.shuffled.tokenized.sorted So the problem is in the Unix sort then?! Hard to believe. Perhaps what starts as different UTF-8 character sequences at sort time gets later normalized to the same sequence?! Not sure where the duplicates could come from... This was originally done in MapReduce internally, but that data could not be released for legal concerns; I could only release code. As I explained at 3 in README.corpus_generation <https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/README.corpus_generation>, we went through a few iterations in making sure the results I was getting on my machine were the same as the ones that Tony got on his; I guess some bugs survived. :) How much of the test set 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050 overlaps with the training data? Generally speaking, it was pointed out to me by users of this data that some level of overlap between training and test is to be expected in practice, especially at the short sentence end. So in that sense de-duping the data is not ideal either... But we started from a far worse situation.

…

-- -Ciprian

ciprian-chelba · 2018-09-10T23:22:03Z

On Mon, Sep 10, 2018 at 4:14 PM Ciprian Chelba (personal account) < ciprian.chelba@gmail.com> wrote:

On Mon, Sep 10, 2018 at 3:21 PM bjascob ***@***.***> wrote: > Yes, that was my understanding. > Sorry, I thought your concern was that somehow the entire held-out set is also part of the training set. > I'm saying that sentences in the attached bw_duplicates.txt file show up > in both, > training-monolingual.tokenized.shuffled/news.en-00000-of-00100 and > training-monolingual.tokenized.shuffled/news.en-000xx-of-00100 (where xx > is 1 to 99) > For instance, the first duplicate sentence in the list "Bush is > remembered by many Haitians -- " shows up in > training/news.en-00000-of-00100 and training/news.en-00056-of-00100 exactly > the same. > (Note that shards 59 and 75 also contain close to the same sentence but > do differ by the "--" ) > Well, as I mention at: https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/split-input-data.perl#L17 after tokenization I ran: # $ sort -u --parallel=10 news.20XX.en.shuffled.tokenized # --output=news.20XX.en.shuffled.tokenized.sorted # to get the input data. # # A sample command line for running this: # ./scripts/split-input-data.perl # --output_file_base="$PWD/training-monolingual.tokenized.shuffled.perl/news.en" # --num_shards=100 # --input_file=./training-monolingual.tokenized/news.20XX.en.shuffled.tokenized.sorted So the problem is in the Unix sort then?! Hard to believe. Perhaps what starts as different UTF-8 character sequences at sort time gets later normalized to the same sequence?! Not sure where the duplicates could come from...

Nevermind, here is probably the reason: Reading through: https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/get-data.sh#L40 the unique sort of the data was done before running punctuation normalization and tokenization, so that explains the origin of the duplicates.

…

This was originally done in MapReduce internally, but that data could not be released for legal concerns; I could only release code. As I explained at 3 in README.corpus_generation <https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/README.corpus_generation>, we went through a few iterations in making sure the results I was getting on my machine were the same as the ones that Tony got on his; I guess some bugs survived. :) How much of the test set 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050 overlaps with the training data? Generally speaking, it was pointed out to me by users of this data that some level of overlap between training and test is to be expected in practice, especially at the short sentence end. So in that sense de-duping the data is not ideal either... But we started from a far worse situation. -- -Ciprian

-- -Ciprian

ciprian-chelba · 2018-09-10T23:43:18Z

On Mon, Sep 10, 2018 at 4:21 PM Ciprian Chelba (personal account) < ciprian.chelba@gmail.com> wrote:

On Mon, Sep 10, 2018 at 4:14 PM Ciprian Chelba (personal account) < ***@***.***> wrote: > > > On Mon, Sep 10, 2018 at 3:21 PM bjascob ***@***.***> wrote: > >> Yes, that was my understanding. >> > Sorry, I thought your concern was that somehow the entire held-out set is > also part of the training set. > >> I'm saying that sentences in the attached bw_duplicates.txt file show up >> in both, >> training-monolingual.tokenized.shuffled/news.en-00000-of-00100 and >> training-monolingual.tokenized.shuffled/news.en-000xx-of-00100 (where xx >> is 1 to 99) >> For instance, the first duplicate sentence in the list "Bush is >> remembered by many Haitians -- " shows up in >> training/news.en-00000-of-00100 and training/news.en-00056-of-00100 exactly >> the same. >> (Note that shards 59 and 75 also contain close to the same sentence but >> do differ by the "--" ) >> > Well, as I mention at: > > https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/split-input-data.perl#L17 > after tokenization I ran: > > # $ sort -u --parallel=10 news.20XX.en.shuffled.tokenized > # --output=news.20XX.en.shuffled.tokenized.sorted > # to get the input data. > # > # A sample command line for running this: > # ./scripts/split-input-data.perl > # > --output_file_base="$PWD/training-monolingual.tokenized.shuffled.perl/news.en" > # --num_shards=100 > # > --input_file=./training-monolingual.tokenized/news.20XX.en.shuffled.tokenized.sorted > > So the problem is in the Unix sort then?! Hard to believe. Perhaps what > starts as different UTF-8 character sequences at sort time gets later > normalized to the same sequence?! Not sure where the duplicates could come > from... > Nevermind, here is probably the reason: Reading through: https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/get-data.sh#L40 the unique sort of the data was done before running punctuation normalization and tokenization, so that explains the origin of the duplicates.

P.p.s. Looking at the history on get-data.sh, it was the last commit <3780330?diff=split> that moved the sorting of the data before the normalization/tokenization. The description explains the decision: Sort the date before doing the perl pre-processing. This will make sure the training/held-out partitioning of the data is the same irrespective of how Perl handles Unicode peculiarities in the raw text. So it seems that (other than typing "data" instead of "date") we could not have done it better. I am now relieved. :-) Thanks for pointing this out! It would be great to know what percentage of the test set sentences are observed as such in the training data.

…

-Ciprian So it seems that

> > This was originally done in MapReduce internally, but that data could not > be released for legal concerns; I could only release code. As I explained > at 3 in README.corpus_generation > <https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/README.corpus_generation>, > we went through a few iterations in making sure the results I was getting > on my machine were the same as the ones that Tony got on his; I guess some > bugs survived. :) > > How much of the test > set 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050 > overlaps with the training data? > > Generally speaking, it was pointed out to me by users of this data that > some level of overlap between training and test is to be expected in > practice, especially at the short sentence end. So in that sense de-duping > the data is not ideal either... But we started from a far worse situation. > > -- > -Ciprian > -- -Ciprian

-- -Ciprian

bjascob · 2018-09-10T23:48:46Z

For all 50 holdout shards, I see 303,465 unique sentences and 3,223 duplicates to sentences in the training directory, so roughly 1%.

ciprian-chelba · 2018-09-10T23:55:25Z

On Mon, Sep 10, 2018 at 4:48 PM bjascob ***@***.***> wrote: For all 50 holdout shards, I see 303,465 unique sentences and 3,223 duplicates to sentences to those in the training directory, so roughly 1%.

Would it hard to get the exact number for 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100 only? I could do it too, but if you have the script handy and are willing to re-run it...

…

-- -Ciprian

ciprian-chelba · 2018-09-10T23:56:44Z

On Mon, Sep 10, 2018 at 4:55 PM Ciprian Chelba (personal account) < ciprian.chelba@gmail.com> wrote:

On Mon, Sep 10, 2018 at 4:48 PM bjascob ***@***.***> wrote: > For all 50 holdout shards, I see 303,465 unique sentences and 3,223 > duplicates to sentences to those in the training directory, so roughly 1%. > Would it hard to get the exact number for 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100 only?

Wrong copy/paste, sorry: I meant: 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050

…

I could do it too, but if you have the script handy and are willing to re-run it... -- -Ciprian

-- -Ciprian

bjascob · 2018-09-11T00:14:11Z

For news.en.heldout-00000-of-00050 there were 6,005 unique sentences and 70 that were duplicates of training data. The duplicates sentences are listed in the following file: bw_dup_shard0.txt.

ciprian-chelba · 2018-09-11T00:24:55Z

Thanks!

…

On Mon, Sep 10, 2018 at 5:14 PM bjascob ***@***.***> wrote: For news.en.heldout-00000-of-00050 there were 6,005 unique sentences and 70 that were duplicates of training data. The duplicates sentences are listed in the following file: bw_dup_shard0.txt <https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/files/2368867/bw_dup_shard0.txt> . — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AK_u-4kX1z0wYO8mfSFqCp8ALe5Ytsqgks5uZwBTgaJpZM4Wh8H7> .

-- -Ciprian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Training Data Duplicated in Heldout Data #5

Some Training Data Duplicated in Heldout Data #5

bjascob commented Sep 10, 2018

ciprian-chelba commented Sep 10, 2018 via email

bjascob commented Sep 10, 2018

ciprian-chelba commented Sep 10, 2018 via email

ciprian-chelba commented Sep 10, 2018 via email

ciprian-chelba commented Sep 10, 2018 via email

bjascob commented Sep 10, 2018 •

edited

Loading

ciprian-chelba commented Sep 10, 2018 via email

ciprian-chelba commented Sep 10, 2018 via email

bjascob commented Sep 11, 2018

ciprian-chelba commented Sep 11, 2018 via email

Some Training Data Duplicated in Heldout Data #5

Some Training Data Duplicated in Heldout Data #5

Comments

bjascob commented Sep 10, 2018

ciprian-chelba commented Sep 10, 2018 via email

bjascob commented Sep 10, 2018

ciprian-chelba commented Sep 10, 2018 via email

ciprian-chelba commented Sep 10, 2018 via email

ciprian-chelba commented Sep 10, 2018 via email

bjascob commented Sep 10, 2018 • edited Loading

ciprian-chelba commented Sep 10, 2018 via email

ciprian-chelba commented Sep 10, 2018 via email

bjascob commented Sep 11, 2018

ciprian-chelba commented Sep 11, 2018 via email

bjascob commented Sep 10, 2018 •

edited

Loading