My final year project aims to help self-learners who use internet as their active source of knowledge. The project is expected to have 3 main modules namely:
- Summarizing
- Question answering
- Integration of all learning platforms
This part focuses on summarizing them for easy skimming over large contents.
From pdfs by extracting text from using PyPdf2 library and performing few cleansing to it.
From website links too. In order to cleanse the html format returned from webscraping, regular expressions are used.
I have used TF-iDF algorithm to extract the most important setences.
STEPS:
- Tokenise,Lemmatize and remove special characters.
- Take up noun and verb tokens which is basically the importance provider of a sentence.
- Find their frequencies.
- Calculate TF and IDF using the formulae.
- Sort the sentences based on their importance score.
- Select the required percentage of sentences from the sorted list.
- Return them in the order of their occurance.
TA-DA!!😁
Check out this awesome link that I referred: https://medium.com/voice-tech-podcast/automatic-extractive-text-summarization-using-tfidf-3fc9a7b26f5
It is a model that helps in identifying the answer to a question or doubt posed by the user. Inorder to do this, the system returns top results for the subjects(finds by PoS tagging) in the user query and searches for the best anwer from the documents retrieved.
The idea is to bring in all the active courses in various MOOC platforms. This helps the users in keeping track of their active courses and schedule their day accordingly.