A program for counting the number of words(word tokenize) in PDF files.
It should be noted that this program does not detect scanned files.
To run this file; Just use steps below:
- Install
python3
,pip
,PyPDF2
,nltk
. - Clone the project Word_counter
- NLTK library to identify stopwords
- About stopwords Read more...
- NLTK library to word tokenize
- About word tokenize Read more...
- Sample input file
- Sample program output
NLTK libraries are required.
If you want to install them on your system
You must
run the following code:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
You must modify the filename
variable to rename the input file:
filename = 'Your_file.pdf'
To change the number of output words, you must modify the variable count_word
:
count_word = 30
- Create a CSV file
- Create a Wordclouds