Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocess updates #46

Merged
merged 5 commits into from
Nov 21, 2024
Merged

Preprocess updates #46

merged 5 commits into from
Nov 21, 2024

Conversation

RileyLePrell
Copy link
Collaborator

Staff Facing App
The app is built in Streamlit and designed with a user-friendly interface comprising three main pages:

  1. Home Page

    • Serves as an introduction to the website.
    • Provides a high-level overview of the app's purpose and functionality.
  2. Upload Files

    • Allows staff to upload various types of files, including:
      • Audio recordings (e.g., meeting recordings).
      • Minutes (PDF format).
      • Agendas (PDF format).
    • Metadata is required before uploading files for better organization.
  3. View Documents

    • Displays all files currently stored in Azure Blob Storage.
    • Allows staff to view files at various stages of the preprocessing pipeline.

Preprocessing Pipeline
The pipeline processes files uploaded through the app, enabling seamless integration with Weaviate Cloud for advanced querying. Below is a breakdown of the pipeline steps:

  1. Metadata Input

    • Users input essential metadata (e.g., meeting date, type, and file type) before uploading files.
  2. File Conversion

    • Audio files: Transcribed using AssemblyAI and stored as a "dirty" text file.
    • PDF files: (Minutes/Agendas) Converted directly into "dirty" text files.
  3. Text Cleaning

    • Employs OpenAI's GPT with tailored prompt engineering to clean and standardize the "dirty" text.
  4. Tokenization and Vector Embedding

    • Tokenizer: Integrated Tiktoken, configured with chunk sizes of 250 tokens for efficient processing.
    • Embedder: Utilizes OpenAI’s "text-embedding-ada-002" model to embed text.
  5. Storage

    • All processed data is stored in Weaviate Cloud, enabling efficient storage and search functionalities.

Cloud Functionality

  1. Azure Blob Storage:

    • Serves as the file repository.
    • Streamlit app interacts with the Azure Blob for storing files at various stages of the pipeline (e.g., raw, dirty, clean).
  2. Weaviate Cloud:

    • Hosts the tokenized and embedded vectors.
    • Facilitates future querying and advanced analytics.

Current Status

  • Pipeline: Fully operational.
  • Staff Facing App:
    • Pages implemented and functional.
    • Metadata handling and file upload integrated with preprocessing pipeline.
  • View Documents Page: Displays uploaded files from Azure Blob Storage.
  • Tokenizer and Embedder:
    • Tokenization via Tiktoken.
    • Embedding with text-embedding-ada-002.

RileyLePrell and others added 5 commits November 13, 2024 14:40
- Preliminary outline of the Staff Facing App; it includes three pages:

1. Home Page - Intro to website.
2. Upload Files - Where staff can upload audio recording, minutes, or agendas.
3. View Documents - Allows to view whats currently in the cloud.

I've created a preprocessing pipeline as well.

1. Users input metadata before uploading file
2. If audio file is transcribed by Assembly-AI then converted into a dirty text file. If minute/agenda converts pdf to dirty text file.
3. Uses OpenAI + Prompt Engineering to clean the text files
4. Chunks Tokens Vectors.
5. Sits in Weviate Cloud
- Create Azure Blob; Streamlit interacts and stores files at different stages of preprocessing pipeline.

- Pipeline is up and running

- View Document Page to see files uploaded
- Added Tokenizer: Tiktoken
- Chunks set to 250
- Embedder: "text-embedding-ada-002"
- Files were capped at 200mb; Increased this to 1 gig.
- Formatting Issues + Fixing some problems w/ streamlit not running

# Step 2: Upload the audio file to AssemblyAI
print("Uploading audio file to AssemblyAI...")
upload_response = requests.post(

Check warning

Code scanning / Bandit

Call to requests without timeout Warning

Call to requests without timeout
if speaker_labels:
transcription_payload["speaker_labels"] = True

transcription_response = requests.post(

Check warning

Code scanning / Bandit

Call to requests without timeout Warning

Call to requests without timeout

# Step 4: Poll for transcription result
while True:
status_response = requests.get(

Check warning

Code scanning / Bandit

Call to requests without timeout Warning

Call to requests without timeout
@neal-logan neal-logan merged commit 8920d30 into main Nov 21, 2024
5 of 7 checks passed
@neal-logan neal-logan deleted the Preprocess-Updates branch November 22, 2024 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants