-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocess updates #46
Conversation
- Preliminary outline of the Staff Facing App; it includes three pages: 1. Home Page - Intro to website. 2. Upload Files - Where staff can upload audio recording, minutes, or agendas. 3. View Documents - Allows to view whats currently in the cloud. I've created a preprocessing pipeline as well. 1. Users input metadata before uploading file 2. If audio file is transcribed by Assembly-AI then converted into a dirty text file. If minute/agenda converts pdf to dirty text file. 3. Uses OpenAI + Prompt Engineering to clean the text files 4. Chunks Tokens Vectors. 5. Sits in Weviate Cloud
- Create Azure Blob; Streamlit interacts and stores files at different stages of preprocessing pipeline. - Pipeline is up and running - View Document Page to see files uploaded
- Added Tokenizer: Tiktoken - Chunks set to 250 - Embedder: "text-embedding-ada-002"
- Files were capped at 200mb; Increased this to 1 gig. - Formatting Issues + Fixing some problems w/ streamlit not running
|
||
# Step 2: Upload the audio file to AssemblyAI | ||
print("Uploading audio file to AssemblyAI...") | ||
upload_response = requests.post( |
Check warning
Code scanning / Bandit
Call to requests without timeout Warning
if speaker_labels: | ||
transcription_payload["speaker_labels"] = True | ||
|
||
transcription_response = requests.post( |
Check warning
Code scanning / Bandit
Call to requests without timeout Warning
|
||
# Step 4: Poll for transcription result | ||
while True: | ||
status_response = requests.get( |
Check warning
Code scanning / Bandit
Call to requests without timeout Warning
Staff Facing App
The app is built in Streamlit and designed with a user-friendly interface comprising three main pages:
Home Page
Upload Files
View Documents
Preprocessing Pipeline
The pipeline processes files uploaded through the app, enabling seamless integration with Weaviate Cloud for advanced querying. Below is a breakdown of the pipeline steps:
Metadata Input
File Conversion
Text Cleaning
Tokenization and Vector Embedding
Storage
Cloud Functionality
Azure Blob Storage:
Weaviate Cloud:
Current Status