Skip to content

Commit

Permalink
Porting project to Regex repo
Browse files Browse the repository at this point in the history
  • Loading branch information
ThanhLy1 committed Oct 28, 2023
1 parent 6bc4fa9 commit 6c8b59b
Show file tree
Hide file tree
Showing 625 changed files with 135,341 additions and 0 deletions.
37 changes: 37 additions & 0 deletions PARSER.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
## Setting Up the Environment

### Prerequisites
- **Python 3.x**
- **pip** (Python package manager)

### Installation

1. **Clone the repository**:
```bash
git clone [Your Repository URL]
cd [Your Repository Directory]

2. Install the dependencies
```bash
pip install beautifulsoup4 lxml
```

## Using the Parser

1. Organize your HTML files:
Ensure all the HTML files you want to parse are placed in a directory named article_html within the main repository directory.


2. Run the parser:
```bash
python parser.py
```
This script will process each HTML file, clean the text, and generate corresponding JSON files in a directory named parsed_json.
3. Check the results:
```bash
ls parsed_json
```
You should see a list of JSON files, each corresponding to an HTML file in the article_html directory.

Check the generated JSON files in the parsed_json directory to see the parsed and cleaned data.
If any articles returned null values for both the "question" and "article_text" fields, their filenames will be saved in null_articles.txt for review.
14 changes: 14 additions & 0 deletions PROBLEM.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
there are some article that give me this {
"question": null,
"article_text": null
}

I want to get the article_text, but it is null, why?

The characters \xa0 represent a non-breaking space in Unicode, and \n represents a newline character. They are often found in HTML content and may not be desired in some cleaned text data, especially when preparing data for AI training.

To make the text even cleaner, we can modify the cleaning function to replace these characters:

Replace all occurrences of \xa0 with a regular space.
Replace all occurrences of multiple newline characters (\n\n, \n\n\n, etc.) with a single newline character to remove excessive line breaks.

70 changes: 70 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# AI Chatbot with MongoDB Integration and Web Interface

This project integrates a generative AI model with a MongoDB database to fetch relevant articles through a web interface. It involves parsing HTML articles, storing parsed data in MongoDB, developing a conversational UX powered by AI, and presenting results in a user-friendly web interface.

## Table of Contents
- [Parsing HTML Articles](#parsing-html-articles)
- [Storing Data in MongoDB](#storing-data-in-mongodb)
- [AI Chatbot Development](#ai-chatbot-development)
- [Web Interface Development](#web-interface-development)
- [Image References Handling](#image-references-handling)
- [Metrics and Success Tracking](#metrics-and-success-tracking)

## Parsing HTML Articles
1. Extract content from the provided HTML articles.
- Extract the question using the `id="kb_article_question"`.
- Extract the article text using the `id="kb_article_text"`.
- Identify and note image sources for later handling.

## Storing Data in MongoDB
2. After parsing, structure and store the data in MongoDB. For instance, the data can be organized as:

```json
{
"_id": ObjectId("someid"),
"question": "How to reset my password?",
"text": "To reset your password, follow these steps...",
"images": [{
"src": "/help/chasek/images/Image/Bacula/Screen%20Shot%202023-08-11%20at%2011_40_43%20AM.png",
"alt": "Description of image",
"width": 502,
"height": 387
}]
}
```

# AI Chatbot with MongoDB Integration and Web Interface

This project aims to integrate a generative AI model with a MongoDB database to fetch relevant articles through a web interface. It also ensures proper display and accessibility of images referenced in the articles.

## Table of Contents
- [AI Chatbot Development](#ai-chatbot-development)
- [Web Interface Development](#web-interface-development)
- [Image References Handling](#image-references-handling)
- [Metrics and Success Tracking](#metrics-and-success-tracking)

## AI Chatbot Development
- Utilize a Generative AI model such as GPT or a similar alternative.
- Train or integrate the AI chatbot to understand natural language queries.
- Connect the chatbot to search the MongoDB database for matching articles.

## Web Interface Development
- Use the Meteor framework with React for the web interface.
- Style the interface using Bootstrap 5 to ensure a responsive design.
- Implement a chatbox where users can input their questions.
- Display results fetched from the MongoDB database when relevant matches are found.
- Ensure proper rendering of images from the articles.

## Image References Handling
- If the images are hosted on the main page (https://www.hawaii.edu/), prepend this URL to the `src` attribute of each image.
- For images not hosted on the main page, consider moving them to your server and updating their `src` paths accordingly to ensure accessibility.

## Metrics and Success Tracking
- Integrate a system to monitor and track metrics:
- Monitor successful searches.
- Track reductions in Help Desk ticket submissions.
- Utilize the metrics to evaluate the effectiveness and success of the implemented solution.

---

**Note:** Always make sure to follow best practices for development, testing, and deployment to ensure the robustness and reliability of the solution.
Loading

0 comments on commit 6c8b59b

Please sign in to comment.