Keyword for research: Latent Semantic Indexing - WordNet
Using scrapy, nltk
library for automatically generate wordcloud for projects listed in eclipse iot site
- Define the following inputs:
- A project list. The steps follow will repeat for all the project in the list.
- A protocol keyword list: Json file contains array of items. For example
[
{
"IsCrawled": true,
"CrawlDepthLevel": 1,
"IsWordcloudGenerated": true,
"SiteUrl": "http://www.eclipse.org/paho/",
"ProjectName": "paho"
}
]
[
{
"id": 1,
"description": "Message Queuing Telemetry Transport",
"keys": [
"MQTT",
"ZMQ",
"RabbitMQ"]
}
]
- Crawl the site with re-defined depth level, extract all text into a .txt file (a series of the paragraph, we can use this later to find out the relationship between projects)
- Preprocessing crawled data
- Sentence Tokenize the paragraphs
- Word Tokenize the sentences
- Stem the words (since at this point, we don’t need the other forms of a word)
- Remove all stop words (with English list of stop words and our re-defined stop word, for example:
eclips
,github
,project
, etc.) - Extract all programming languages (If we want to include the programming languages in wordcloud, we must choose between the language the project is written in/the languages the project support)
- Draw the wordcloud
- Get frequency distribution of each keywords in step 3, select the 50 most common keywords (can choose any number, not just 50 😃 )
- Feed the drawing python lib to create the picture above, then save/serve the picture
- Analyze the crawled data with NLP
- Split sentences and tokenize
- Find the sentences containing the keywords
- Use Grammar rules to identify the relationship implied by the sentence
- Generate a graph of the relationship
- Draw the graph
For now, drawing grammar tree only worked on Windows
After each run, the ptidejWordcloud/sitelist.json
file marks the project crawled or wordcloud generated with true
value. Modify these values if you want to re-run any project
Requirement:
- python 3.7 or above
- Java 8 or above
Install Ghostscript
To handle exporting images from .ps file resulted of nltk grammar scan
# for Windows: using Chocolatey package manager
choco install ghostscript
# for Linux:
[sudo] apt-get install ghostscript
Install python packages
[sudo] python3 -m pip install Twisted
# for windows:
# pip install Twisted[windows_platform]
[sudo] python3 -m pip install Scrapy
[sudo] python3 -m pip install beautifulsoup4
[sudo] python3 -m pip install matplotlib
[sudo] python3 -m pip install Pillow
[sudo] python3 -m pip install Wordcloud
[sudo] python3 -m pip install tabulate
[sudo] python3 -m pip install pandas
[sudo] python3 -m pip install --upgrade gensim
[sudo] python3 -m pip install jsonpickle
Install Natural Language Processing toolkit (nltk)
Install package
[sudo] python3 -m pip install nltk
Download nltk.data
python3
>>> import nltk
>>> nltk.download('stopwords')
>>> nltk.download('punkt')
>>> nltk.download('averaged_perceptron_tagger')
>>> nltk.download('wordnet')
>>> quit()
For Windows Machine
Using Python on Windows machine require Microsoft Visual C++ Build Tools.
You can get the build tools at https://visualstudio.microsoft.com/downloads/.
Here is more about downloading nltk data
Stanford POS Tagger is resource-consuming. You will need to increase Java heap size to avoid java.lang.OutOfMemoryError
exception
Add/modify these parameters in your vscode
settings of Java
"java.jdt.ls.vmargs": "-Xmx4G -Xms512m [existing settings]"
cd rootProjectFolder
[sudo] python3 auto_runner.py
For debugging with Visual Studio Code:
- Choose Python: Run Scrapy and NLTK at debug menu list
- Put a breakpoint in any of the python code
- Press F5 or debug button to start debugging