Uncover the brilliance: Explore profiles, groundbreaking work, and cutting-edge research by the exceptional minds of Coventry University.
The Coventry PureHuB Search Engine is a web application that allows users to search for research publications and authors affiliated with Coventry University. The application utilizes natural language processing techniques, such as stemming and TF-IDF, and other techniques like inverse indexer to provide accurate search results in a user-friendly manner.
-
Research Publication Search: Users can search for research publications by entering relevant keywords or phrases. The search engine employs advanced techniques such as stemming and TF-IDF to match the user's query with the indexed publication data accurately.
-
Author Search: Users can also search for specific authors by their names or related keywords. The search engine applies the same advanced techniques to match the user's input with the indexed author data. Stemming and TF-IDF: The search engine utilizes stemming to reduce words to their base or root form, enabling broader search coverage. Additionally, the application employs TF-IDF to calculate the importance of each term in the documents and generate relevance scores for the accurate ranking of search results.
-
Inverse Indexer: The search engine includes an inverse indexer that indexes and stores the publication and author data in a structured manner, enabling efficient retrieval and retrieval of relevant information.
-
Multinomial Naïve Bayes Classification: The search engine incorporates the Multinomial Naïve Bayes classification technique to categorize publications into different subject categories.
-
Cron job: The specific cron schedule used was "0 0 * * 0" along with the command file "Scrapper.py," indicating that the crawler would run every Sunday at midnight. This configuration ensured that the study remained up-to-date with the latest data by consistently retrieving fresh information at the beginning of each week.
-
Clone the repository:
git clone https://github.com/maladeep/Coventry-PureHub-Search-Engine.git
-
Install the required dependencies:
pip install -r requirements.txt
or
- Run locally
Streamlit run clone https://github.com/maladeep/Coventry-PureHub-Search-Engine.git
- Open the provided URL in your web browser.
- Enter your search query, select the search filter and search type, and click the "SEARCH" button.
- View the search results displayed in cards.
- Scroll down to view more search results.
The project has the following vital dependencies:
The Coventry PureHub Search Engine relies on the following dependencies:
- streamlit: The web application framework used for building the user interface.
- Pillow: A library for opening and manipulating images, used to display an image in the streamlit application.
- ujson: A fast JSON encoder and decoder library, used to load JSON data.
- scikit-learn: A machine learning library, used for text preprocessing, TF-IDF vectorization, and cosine similarity calculation.
- nltk: The Natural Language Toolkit, used for tokenization, stemming, and stop-word removal.
- numpy: A powerful library for numerical computations in Python.
- pandas: A data manipulation library, used for handling and processing structured data.
- seaborn: A data visualization library, used for creating attractive and informative plots.
- matplotlib: A versatile plotting library, used for generating various types of charts and graphs.
- scikit-multilearn: A library for multi-label classification, used for advanced search features.
- requests: A library for making HTTP requests, used for fetching external resources.
- beautifulsoup4: A library for web scraping, used for extracting data from web pages.
- selenium: A library for web automation, used for interacting with web pages.
- webdriver_manager: A library for managing web drivers, used for browser automation.
Contributions to this project are welcome. If you find any issues or would like to suggest improvements, please open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for more information.
This work is done for the partial fulfillment of STW7071CEM Information Retrieval coursework provided by Coventry University.