-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
625 changed files
with
135,341 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
## Setting Up the Environment | ||
|
||
### Prerequisites | ||
- **Python 3.x** | ||
- **pip** (Python package manager) | ||
|
||
### Installation | ||
|
||
1. **Clone the repository**: | ||
```bash | ||
git clone [Your Repository URL] | ||
cd [Your Repository Directory] | ||
|
||
2. Install the dependencies | ||
```bash | ||
pip install beautifulsoup4 lxml | ||
``` | ||
|
||
## Using the Parser | ||
|
||
1. Organize your HTML files: | ||
Ensure all the HTML files you want to parse are placed in a directory named article_html within the main repository directory. | ||
|
||
|
||
2. Run the parser: | ||
```bash | ||
python parser.py | ||
``` | ||
This script will process each HTML file, clean the text, and generate corresponding JSON files in a directory named parsed_json. | ||
3. Check the results: | ||
```bash | ||
ls parsed_json | ||
``` | ||
You should see a list of JSON files, each corresponding to an HTML file in the article_html directory. | ||
|
||
Check the generated JSON files in the parsed_json directory to see the parsed and cleaned data. | ||
If any articles returned null values for both the "question" and "article_text" fields, their filenames will be saved in null_articles.txt for review. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
there are some article that give me this { | ||
"question": null, | ||
"article_text": null | ||
} | ||
|
||
I want to get the article_text, but it is null, why? | ||
|
||
The characters \xa0 represent a non-breaking space in Unicode, and \n represents a newline character. They are often found in HTML content and may not be desired in some cleaned text data, especially when preparing data for AI training. | ||
|
||
To make the text even cleaner, we can modify the cleaning function to replace these characters: | ||
|
||
Replace all occurrences of \xa0 with a regular space. | ||
Replace all occurrences of multiple newline characters (\n\n, \n\n\n, etc.) with a single newline character to remove excessive line breaks. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# AI Chatbot with MongoDB Integration and Web Interface | ||
|
||
This project integrates a generative AI model with a MongoDB database to fetch relevant articles through a web interface. It involves parsing HTML articles, storing parsed data in MongoDB, developing a conversational UX powered by AI, and presenting results in a user-friendly web interface. | ||
|
||
## Table of Contents | ||
- [Parsing HTML Articles](#parsing-html-articles) | ||
- [Storing Data in MongoDB](#storing-data-in-mongodb) | ||
- [AI Chatbot Development](#ai-chatbot-development) | ||
- [Web Interface Development](#web-interface-development) | ||
- [Image References Handling](#image-references-handling) | ||
- [Metrics and Success Tracking](#metrics-and-success-tracking) | ||
|
||
## Parsing HTML Articles | ||
1. Extract content from the provided HTML articles. | ||
- Extract the question using the `id="kb_article_question"`. | ||
- Extract the article text using the `id="kb_article_text"`. | ||
- Identify and note image sources for later handling. | ||
|
||
## Storing Data in MongoDB | ||
2. After parsing, structure and store the data in MongoDB. For instance, the data can be organized as: | ||
|
||
```json | ||
{ | ||
"_id": ObjectId("someid"), | ||
"question": "How to reset my password?", | ||
"text": "To reset your password, follow these steps...", | ||
"images": [{ | ||
"src": "/help/chasek/images/Image/Bacula/Screen%20Shot%202023-08-11%20at%2011_40_43%20AM.png", | ||
"alt": "Description of image", | ||
"width": 502, | ||
"height": 387 | ||
}] | ||
} | ||
``` | ||
|
||
# AI Chatbot with MongoDB Integration and Web Interface | ||
|
||
This project aims to integrate a generative AI model with a MongoDB database to fetch relevant articles through a web interface. It also ensures proper display and accessibility of images referenced in the articles. | ||
|
||
## Table of Contents | ||
- [AI Chatbot Development](#ai-chatbot-development) | ||
- [Web Interface Development](#web-interface-development) | ||
- [Image References Handling](#image-references-handling) | ||
- [Metrics and Success Tracking](#metrics-and-success-tracking) | ||
|
||
## AI Chatbot Development | ||
- Utilize a Generative AI model such as GPT or a similar alternative. | ||
- Train or integrate the AI chatbot to understand natural language queries. | ||
- Connect the chatbot to search the MongoDB database for matching articles. | ||
|
||
## Web Interface Development | ||
- Use the Meteor framework with React for the web interface. | ||
- Style the interface using Bootstrap 5 to ensure a responsive design. | ||
- Implement a chatbox where users can input their questions. | ||
- Display results fetched from the MongoDB database when relevant matches are found. | ||
- Ensure proper rendering of images from the articles. | ||
|
||
## Image References Handling | ||
- If the images are hosted on the main page (https://www.hawaii.edu/), prepend this URL to the `src` attribute of each image. | ||
- For images not hosted on the main page, consider moving them to your server and updating their `src` paths accordingly to ensure accessibility. | ||
|
||
## Metrics and Success Tracking | ||
- Integrate a system to monitor and track metrics: | ||
- Monitor successful searches. | ||
- Track reductions in Help Desk ticket submissions. | ||
- Utilize the metrics to evaluate the effectiveness and success of the implemented solution. | ||
|
||
--- | ||
|
||
**Note:** Always make sure to follow best practices for development, testing, and deployment to ensure the robustness and reliability of the solution. |
Oops, something went wrong.