Porting project to Regex repo

HACC2023 · Oct 28, 2023 · 6c8b59b · 6c8b59b
1 parent 6bc4fa9
commit 6c8b59b
Show file tree

Hide file tree

Showing 625 changed files with 135,341 additions and 0 deletions.
diff --git a/PARSER.md b/PARSER.md
@@ -0,0 +1,37 @@
+## Setting Up the Environment
+
+### Prerequisites
+- **Python 3.x**
+- **pip** (Python package manager)
+
+### Installation
+
+1. **Clone the repository**:
+   ```bash
+   git clone [Your Repository URL]
+   cd [Your Repository Directory]
+
+2. Install the dependencies
+    ```bash
+    pip install beautifulsoup4 lxml
+    ```
+
+## Using the Parser
+
+1. Organize your HTML files:
+Ensure all the HTML files you want to parse are placed in a directory named article_html within the main repository directory.
+
+
+2. Run the parser:
+    ```bash
+   python parser.py
+   ```
+    This script will process each HTML file, clean the text, and generate corresponding JSON files in a directory named parsed_json.
+3. Check the results:
+    ```bash
+   ls parsed_json
+   ```
+    You should see a list of JSON files, each corresponding to an HTML file in the article_html directory.
+
+   Check the generated JSON files in the parsed_json directory to see the parsed and cleaned data.
+   If any articles returned null values for both the "question" and "article_text" fields, their filenames will be saved in null_articles.txt for review.
diff --git a/PROBLEM.md b/PROBLEM.md
@@ -0,0 +1,14 @@
+there are some article that give me this {
+"question": null,
+"article_text": null
+}
+
+I want to get the article_text, but it is null, why?
+
+The characters \xa0 represent a non-breaking space in Unicode, and \n represents a newline character. They are often found in HTML content and may not be desired in some cleaned text data, especially when preparing data for AI training.
+
+To make the text even cleaner, we can modify the cleaning function to replace these characters:
+
+Replace all occurrences of \xa0 with a regular space.
+Replace all occurrences of multiple newline characters (\n\n, \n\n\n, etc.) with a single newline character to remove excessive line breaks.
+
diff --git a/TODO.md b/TODO.md
@@ -0,0 +1,70 @@
+# AI Chatbot with MongoDB Integration and Web Interface
+
+This project integrates a generative AI model with a MongoDB database to fetch relevant articles through a web interface. It involves parsing HTML articles, storing parsed data in MongoDB, developing a conversational UX powered by AI, and presenting results in a user-friendly web interface.
+
+## Table of Contents
+- [Parsing HTML Articles](#parsing-html-articles)
+- [Storing Data in MongoDB](#storing-data-in-mongodb)
+- [AI Chatbot Development](#ai-chatbot-development)
+- [Web Interface Development](#web-interface-development)
+- [Image References Handling](#image-references-handling)
+- [Metrics and Success Tracking](#metrics-and-success-tracking)
+
+## Parsing HTML Articles
+1. Extract content from the provided HTML articles.
+    - Extract the question using the `id="kb_article_question"`.
+    - Extract the article text using the `id="kb_article_text"`.
+    - Identify and note image sources for later handling.
+
+## Storing Data in MongoDB
+2. After parsing, structure and store the data in MongoDB. For instance, the data can be organized as:
+
+```json
+{
+  "_id": ObjectId("someid"),
+  "question": "How to reset my password?",
+  "text": "To reset your password, follow these steps...",
+  "images": [{
+    "src": "/help/chasek/images/Image/Bacula/Screen%20Shot%202023-08-11%20at%2011_40_43%20AM.png",
+    "alt": "Description of image",
+    "width": 502,
+    "height": 387
+  }]
+}
+```
+
+# AI Chatbot with MongoDB Integration and Web Interface
+
+This project aims to integrate a generative AI model with a MongoDB database to fetch relevant articles through a web interface. It also ensures proper display and accessibility of images referenced in the articles.
+
+## Table of Contents
+- [AI Chatbot Development](#ai-chatbot-development)
+- [Web Interface Development](#web-interface-development)
+- [Image References Handling](#image-references-handling)
+- [Metrics and Success Tracking](#metrics-and-success-tracking)
+
+## AI Chatbot Development
+- Utilize a Generative AI model such as GPT or a similar alternative.
+- Train or integrate the AI chatbot to understand natural language queries.
+- Connect the chatbot to search the MongoDB database for matching articles.
+
+## Web Interface Development
+- Use the Meteor framework with React for the web interface.
+- Style the interface using Bootstrap 5 to ensure a responsive design.
+- Implement a chatbox where users can input their questions.
+- Display results fetched from the MongoDB database when relevant matches are found.
+- Ensure proper rendering of images from the articles.
+
+## Image References Handling
+- If the images are hosted on the main page (https://www.hawaii.edu/), prepend this URL to the `src` attribute of each image.
+- For images not hosted on the main page, consider moving them to your server and updating their `src` paths accordingly to ensure accessibility.
+
+## Metrics and Success Tracking
+- Integrate a system to monitor and track metrics:
+    - Monitor successful searches.
+    - Track reductions in Help Desk ticket submissions.
+- Utilize the metrics to evaluate the effectiveness and success of the implemented solution.
+
+---
+
+**Note:** Always make sure to follow best practices for development, testing, and deployment to ensure the robustness and reliability of the solution.