Text Cleaner

This script automates text cleaning over one or more text files. Input should consist one or more messages separated by pipe dividers ('|').

Run the following at the Terminal prompt to get started for Python 2.7:

$ python2 text_clean.py --input example_texts

or (for Python 3.x):

$ python3 text_clean.py --input example_texts

The core spell correcting function requires the Python nltk package. In order to install this in your current environment, input the following at your Terminal prompt:

$ pip install nltk

Currently the following cleaning functions are supported for each message in the text file:

1.) Repeating tokens are removed; first instance is kept:

"I am John John John" >>> "I am John"

2.) Mixed type tokens are removed, e.g. "John23", "$Max$", however some special cases are kept:

Dollars ($5, $5,000)
Percentages (2%, 2,000%)
HH:MM Times (4:00, 17:00)
Ordinals (5th, 22nd, 33rd, 71st)
Punctuation at end of token (Hello!, Yes?, Jacks', 101,)
Apostrophe tokens (Don't, Didn't, Jack's)

3.) Long tokens are removed (character string length greater than 13):

"1234567890123455"
"nowthishereisareallylongword"

4.) Tokens with three or more repeating characters are removed:

"Rogggger"
"1000000"

5.) All non-punctuation symbols are removed (@, ^, #, etc.), however math expressions are kept:

"2 + 2"
"5 * 5"
"7 - 7 = 0"

6.) Repeating quad-groups, tri-groups, and bi-groups are removed; first instance is kept:

"I am watching I am watching I am watching I am watching" >>> ""I am watching"

7.) Gibberish tokens are removed (this is based on the author's subjective discretion):

"alskdjfaasdlfjkasd"
"s"
"iaaiuuuuwu"

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
example_texts		example_texts
pkl_objects		pkl_objects
.gitignore		.gitignore
README.md		README.md
text_clean.py		text_clean.py
text_clean_utils.py		text_clean_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Cleaner

About

Releases

Packages

Languages

shellshock1911/Text-Cleaner

Folders and files

Latest commit

History

Repository files navigation

Text Cleaner

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages