-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
57 lines (39 loc) · 1.99 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
Europarl v6 Preprocessing Tools
===============================
written by Philipp Koehn and Josh Schroeder
Sentence Splitter
=================
Usage ./split-sentences.perl -l [en|de|...] < textfile > splitfile
Uses punctuation and Capitalization clues to split paragraphs of
sentences into files with one sentence per line. For example:
This is a paragraph. It contains several sentences. "But why," you ask?
goes to:
This is a paragraph.
It contains several sentences.
"But why," you ask?
See more information in the Nonbreaking Prefixes section.
Nonbreaking Prefixes Directory
==============================
Nonbreaking prefixes are loosely defined as any word ending in a
period that does NOT indicate an end of sentence marker. A basic
example is Mr. and Ms. in English.
The sentence splitter and tokenizer included with this release
both use the nonbreaking prefix files included in this directory.
To add a file for other languages, follow the naming convention
nonbreaking_prefix.?? and use the two-letter language code you
intend to use when calling split-sentences.perl and tokenizer.perl.
Both split-sentences and tokenizer will first look for a file for the
language they are processing, and fall back to English if a file
for that language is not found. If the nonbreaking_prefixes directory does
not exist at the same location as the split-sentences.perl and tokenizer.perl
files, they will not run.
For the splitter, normally a period followed by an uppercase word
results in a sentence split. If the word preceeding the period
is a nonbreaking prefix, this line break is not inserted.
For the tokenizer, a nonbreaking prefix is not separated from its
period with a space.
A special case of prefixes, NUMERIC_ONLY, is included for special
cases where the prefix should be handled ONLY when before numbers.
For example, "Article No. 24 states this." the No. is a nonbreaking
prefix. However, in "No. It is not true." No functions as a word.
See the example prefix files included here for more examples.