-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some Training Data Duplicated in Heldout Data #5
Comments
Hi,
The training data is:
1-billion-word-language-modeling-benchmark/training-monolingual.tokenized.shuffled/news.en-000??-of-00100
As you will notice, the fileglob expansion is missing the
news.en-00000-of-00100 file which is used as held-out data:
1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100
Since that is a bit large, we sharded it 50-way, giving us 50 smaller sets
for evaluation, parameter tuning, etc. The test set on which we reported
results in the paper is:
1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050
You can find more details on all this in the README files at:
https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark
Hope this answers your questions,
-Ciprian
…On Mon, Sep 10, 2018 at 11:38 AM bjascob ***@***.***> wrote:
While using the preprocessed data from http://www.statmt.org/lm-benchmark/
<http://url> I noticed that some of the training data was duplicated in
the heldout (aka test). This is in addition to
*train/news.en-00000-of-00100* which appears to be a complete copy of all
the heldout data.
Using a simple python script to put the sentences into a *dict*, I see
303,465 unique heldout sentences and 3,223 duplicates to sentences in the
training directory. Attached is a file bw_duplicates.txt
<https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/files/2367910/bw_duplicates.txt>
with the duplicates. You can easily verify this by grep'ing for them in the
training directory.
Is this a known issue? My concern is that many people use this data for
benchmarking language models and the test data has about 1% of the training
data mixed into it. That's probably not going to change the results much
but it isn't desirable either.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AK_u-2mmb3N5IkQVSLOqb1FwMxBxrNRRks5uZq_PgaJpZM4Wh8H7>
.
--
-Ciprian
|
Yes, that was my understanding. |
On Mon, Sep 10, 2018 at 3:21 PM bjascob ***@***.***> wrote:
Yes, that was my understanding.
Sorry, I thought your concern was that somehow the entire held-out set is
also part of the training set.
I'm saying that sentences in the attached bw_duplicates.txt file show up
in both,
training-monolingual.tokenized.shuffled/news.en-00000-of-00100 and
training-monolingual.tokenized.shuffled/news.en-000xx-of-00100 (where xx
is 1 to 99)
For instance, the first duplicate sentence in the list "Bush is remembered
by many Haitians -- " shows up in training/news.en-00000-of-00100 and
training/news.en-00056-of-00100 exactly the same.
(Note that shards 59 and 75 also contain close to the same sentence but do
differ by the "--" )
Well, as I mention at:
https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/split-input-data.perl#L17
after tokenization I ran:
# $ sort -u --parallel=10 news.20XX.en.shuffled.tokenized
# --output=news.20XX.en.shuffled.tokenized.sorted
# to get the input data.
#
# A sample command line for running this:
# ./scripts/split-input-data.perl
#
--output_file_base="$PWD/training-monolingual.tokenized.shuffled.perl/news.en"
# --num_shards=100
#
--input_file=./training-monolingual.tokenized/news.20XX.en.shuffled.tokenized.sorted
So the problem is in the Unix sort then?! Hard to believe. Perhaps what
starts as different UTF-8 character sequences at sort time gets later
normalized to the same sequence?! Not sure where the duplicates could come
from...
This was originally done in MapReduce internally, but that data could not
be released for legal concerns; I could only release code. As I explained
at 3 in README.corpus_generation
<https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/README.corpus_generation>,
we went through a few iterations in making sure the results I was getting
on my machine were the same as the ones that Tony got on his; I guess some
bugs survived. :)
How much of the test
set 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050
overlaps with the training data?
Generally speaking, it was pointed out to me by users of this data that
some level of overlap between training and test is to be expected in
practice, especially at the short sentence end. So in that sense de-duping
the data is not ideal either... But we started from a far worse situation.
…--
-Ciprian
|
On Mon, Sep 10, 2018 at 4:14 PM Ciprian Chelba (personal account) <
ciprian.chelba@gmail.com> wrote:
On Mon, Sep 10, 2018 at 3:21 PM bjascob ***@***.***> wrote:
> Yes, that was my understanding.
>
Sorry, I thought your concern was that somehow the entire held-out set is
also part of the training set.
> I'm saying that sentences in the attached bw_duplicates.txt file show up
> in both,
> training-monolingual.tokenized.shuffled/news.en-00000-of-00100 and
> training-monolingual.tokenized.shuffled/news.en-000xx-of-00100 (where xx
> is 1 to 99)
> For instance, the first duplicate sentence in the list "Bush is
> remembered by many Haitians -- " shows up in
> training/news.en-00000-of-00100 and training/news.en-00056-of-00100 exactly
> the same.
> (Note that shards 59 and 75 also contain close to the same sentence but
> do differ by the "--" )
>
Well, as I mention at:
https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/split-input-data.perl#L17
after tokenization I ran:
# $ sort -u --parallel=10 news.20XX.en.shuffled.tokenized
# --output=news.20XX.en.shuffled.tokenized.sorted
# to get the input data.
#
# A sample command line for running this:
# ./scripts/split-input-data.perl
#
--output_file_base="$PWD/training-monolingual.tokenized.shuffled.perl/news.en"
# --num_shards=100
#
--input_file=./training-monolingual.tokenized/news.20XX.en.shuffled.tokenized.sorted
So the problem is in the Unix sort then?! Hard to believe. Perhaps what
starts as different UTF-8 character sequences at sort time gets later
normalized to the same sequence?! Not sure where the duplicates could come
from...
Nevermind, here is probably the reason:
Reading through:
https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/get-data.sh#L40
the unique sort of the data was done before running punctuation
normalization and tokenization, so that explains the origin of the
duplicates.
… This was originally done in MapReduce internally, but that data could not
be released for legal concerns; I could only release code. As I explained
at 3 in README.corpus_generation
<https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/README.corpus_generation>,
we went through a few iterations in making sure the results I was getting
on my machine were the same as the ones that Tony got on his; I guess some
bugs survived. :)
How much of the test
set 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050
overlaps with the training data?
Generally speaking, it was pointed out to me by users of this data that
some level of overlap between training and test is to be expected in
practice, especially at the short sentence end. So in that sense de-duping
the data is not ideal either... But we started from a far worse situation.
--
-Ciprian
--
-Ciprian
|
On Mon, Sep 10, 2018 at 4:21 PM Ciprian Chelba (personal account) <
ciprian.chelba@gmail.com> wrote:
On Mon, Sep 10, 2018 at 4:14 PM Ciprian Chelba (personal account) <
***@***.***> wrote:
>
>
> On Mon, Sep 10, 2018 at 3:21 PM bjascob ***@***.***> wrote:
>
>> Yes, that was my understanding.
>>
> Sorry, I thought your concern was that somehow the entire held-out set is
> also part of the training set.
>
>> I'm saying that sentences in the attached bw_duplicates.txt file show up
>> in both,
>> training-monolingual.tokenized.shuffled/news.en-00000-of-00100 and
>> training-monolingual.tokenized.shuffled/news.en-000xx-of-00100 (where xx
>> is 1 to 99)
>> For instance, the first duplicate sentence in the list "Bush is
>> remembered by many Haitians -- " shows up in
>> training/news.en-00000-of-00100 and training/news.en-00056-of-00100 exactly
>> the same.
>> (Note that shards 59 and 75 also contain close to the same sentence but
>> do differ by the "--" )
>>
> Well, as I mention at:
>
> https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/split-input-data.perl#L17
> after tokenization I ran:
>
> # $ sort -u --parallel=10 news.20XX.en.shuffled.tokenized
> # --output=news.20XX.en.shuffled.tokenized.sorted
> # to get the input data.
> #
> # A sample command line for running this:
> # ./scripts/split-input-data.perl
> #
> --output_file_base="$PWD/training-monolingual.tokenized.shuffled.perl/news.en"
> # --num_shards=100
> #
> --input_file=./training-monolingual.tokenized/news.20XX.en.shuffled.tokenized.sorted
>
> So the problem is in the Unix sort then?! Hard to believe. Perhaps what
> starts as different UTF-8 character sequences at sort time gets later
> normalized to the same sequence?! Not sure where the duplicates could come
> from...
>
Nevermind, here is probably the reason:
Reading through:
https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/get-data.sh#L40
the unique sort of the data was done before running punctuation
normalization and tokenization, so that explains the origin of the
duplicates.
P.p.s. Looking at the history on get-data.sh, it was the last commit
<3780330?diff=split>
that moved the sorting of the data before the normalization/tokenization.
The description explains the decision:
Sort the date before doing the perl pre-processing.
This will make sure the training/held-out partitioning of the data is
the same irrespective of how Perl handles Unicode peculiarities in the
raw text.
So it seems that (other than typing "data" instead of "date") we could not
have done it better. I am now relieved. :-)
Thanks for pointing this out! It would be great to know what percentage of
the test set sentences are observed as such in the training data.
…-Ciprian
So it seems that
>
> This was originally done in MapReduce internally, but that data could not
> be released for legal concerns; I could only release code. As I explained
> at 3 in README.corpus_generation
> <https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/README.corpus_generation>,
> we went through a few iterations in making sure the results I was getting
> on my machine were the same as the ones that Tony got on his; I guess some
> bugs survived. :)
>
> How much of the test
> set 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050
> overlaps with the training data?
>
> Generally speaking, it was pointed out to me by users of this data that
> some level of overlap between training and test is to be expected in
> practice, especially at the short sentence end. So in that sense de-duping
> the data is not ideal either... But we started from a far worse situation.
>
> --
> -Ciprian
>
--
-Ciprian
--
-Ciprian
|
For all 50 holdout shards, I see 303,465 unique sentences and 3,223 duplicates to sentences in the training directory, so roughly 1%. |
On Mon, Sep 10, 2018 at 4:48 PM bjascob ***@***.***> wrote:
For all 50 holdout shards, I see 303,465 unique sentences and 3,223
duplicates to sentences to those in the training directory, so roughly 1%.
Would it hard to get the exact number for
1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100
only?
I could do it too, but if you have the script handy and are willing to
re-run it...
…--
-Ciprian
|
On Mon, Sep 10, 2018 at 4:55 PM Ciprian Chelba (personal account) <
ciprian.chelba@gmail.com> wrote:
On Mon, Sep 10, 2018 at 4:48 PM bjascob ***@***.***> wrote:
> For all 50 holdout shards, I see 303,465 unique sentences and 3,223
> duplicates to sentences to those in the training directory, so roughly 1%.
>
Would it hard to get the exact number for
1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100
only?
Wrong copy/paste, sorry:
I meant:
1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050
…
I could do it too, but if you have the script handy and are willing to
re-run it...
--
-Ciprian
--
-Ciprian
|
For news.en.heldout-00000-of-00050 there were 6,005 unique sentences and 70 that were duplicates of training data. The duplicates sentences are listed in the following file: bw_dup_shard0.txt. |
Thanks!
…On Mon, Sep 10, 2018 at 5:14 PM bjascob ***@***.***> wrote:
For news.en.heldout-00000-of-00050 there were 6,005 unique sentences and
70 that were duplicates of training data. The duplicates sentences are
listed in the following file: bw_dup_shard0.txt
<https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/files/2368867/bw_dup_shard0.txt>
.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AK_u-4kX1z0wYO8mfSFqCp8ALe5Ytsqgks5uZwBTgaJpZM4Wh8H7>
.
--
-Ciprian
|
While using the preprocessed data from http://www.statmt.org/lm-benchmark/ I noticed that some of the training data was duplicated in the heldout (aka test). This is in addition to train/news.en-00000-of-00100 which appears to be a complete copy of all the heldout data.
Using a simple python script to put the sentences into a dict, I see 303,465 unique heldout sentences and 3,223 duplicates to sentences in the training directory. Attached is a file bw_duplicates.txt with the duplicates. You can easily verify this by grep'ing for them in the training directory.
Is this a known issue? My concern is that many people use this data for benchmarking language models and the test data has about 1% of the training data mixed into it. That's probably not going to change the results much but it isn't desirable either.
The text was updated successfully, but these errors were encountered: