DataScout is an intelligent data scraping tool designed to streamline data retrieval and structured information extraction using AI-driven automation. It reads through an uploaded dataset, either from a CSV file or Google Sheets, and performs web searches based on custom prompts defined by the user. A Large Language Model (LLM) parses the results and extracts specific information for each entity, (such as company names), which is from a column selected by the user.
- Google Sheets Support: Import and export your data directly from Google Sheets.
- Custom Search Queries: Define custom search prompts to tailor the data extraction process.
- AI-Powered Search: Utilize LLMs to refine search queries and extract precise information from the search results.
- Parallel Processing: Handles multiple search queries simultaneously.
Datascout is a boilerplate for any custom data scraping project. Here are some things you can develop with DataScout:
- Retrieve company profiles, industry trends, and product information for competitors.
- Extract key data points like headquarters, CEO names, and recent financial highlights.
- Gather information on potential clients or partners from specific sectors or regions.
- Collect job openings, roles, and company hiring trends from public job boards.
- Analyze job descriptions to extract essential skills and qualifications.
- Collect data points from multiple sources for use in academic projects or reports.
- Collect and analyze reviews of specific products or services from online platforms.
DataScout's user interface is super simple and intuitive. All of this is done in 3 simple steps:
-
Choose Data Source: Upon opening the dashboard, you’ll be prompted to select either:
-
Preview Data: Once the file is uploaded or Google Sheet is connected, a preview of your data will be displayed. Verify the data to ensure you have the correct file and columns.
-
Select Column: From the dropdown menu, choose the primary column containing the entities you want to search (e.g., "Company" or "Product").
-
Enter Search Prompt: Define a custom search prompt to specify the type of information you want to retrieve. Use placeholders, such as
{company}
, to dynamically insert each entity name.- Example Prompt:
Get the email address and description for {company}
.
- Example Prompt:
-
Refine Prompt: Click the Refine button to let DataScout enhance the prompt, optimizing it for accurate search results.
-
Start Search: Click the Scout the Internet button to initiate the search. DataScout will conduct a web search for each entity based on the refined prompt, leveraging AI to extract the specific information.
-
View Results: Once the search is complete, the extracted data will be displayed in a structured table on the dashboard.
-
Export Options:
-
Confirmation: After exporting, you will receive a confirmation message that your data has been successfully written to the chosen format.
- Go to Google Cloud Console.
- Select the project dropdown and click New Project to create a new project.
- Navigate to APIs & Services > Library in the Cloud Console.
- Go to Google Sheets API and select Enable.
- Go to APIs & Services > Credentials.
- Select Create Credentials > Service Account, fill in the required details, and assign the Editor role for Sheets access.
- In the Keys tab of your service account, select Add Key > Create New Key, choose JSON, and download the key file securely for later use.
- Open APIs & Services > Credentials in the Google Cloud Console.
- Create OAuth Client ID:
- Configure the OAuth consent screen (only required once).
- Under Credentials, select Create Credentials > OAuth client ID.
- Select Web application as the application type.
- Under Authorized redirect URIs, add
http://localhost:8501
andhttp://localhost:8500
(or any other URIs that you might be using for DataScout). - Download the JSON file containing your OAuth 2.0 credentials.
- Sign up on the Tavily website.
- From the home page, generate a new API key and save it securely for future integration.
- Visit Google AI Studio or Groq.
- Sign up, create a new project, and navigate to API settings to generate an API key.
- Save this API key securely for connecting to the LLM.
- Ensure Python 3.7 or a later version is installed on your system. Download Python if needed.
-
Clone the Repository:
git clone https://github.com/AhmedBaari/DataScout.git cd DataScout
-
Install Dependencies:
pip install -r requirements.txt
-
Set up API Keys and Environment Variables: In the
.streamlit
directory, rename thesecrets.toml.example
file tosecrets.toml
and follow the instructions below.- Enter your Google Gemini API key (or Groq API Key) and Tavily API key by getting them from the respective websites.
GOOGLE_API_KEY="your gemini api key here" GROQ_API_KEY="or place the groq api key here" TAVILY_API_KEY="your tavily search api key here"
- Enter your Google Sheets API credentials and OAuth2.0 credentials by copying the respective JSON files' contents into the
secrets.toml
file. The email client json file may look like this:
{ "type": "service_account", "project_id": "datascout", "private_key_id": "", "private_key": "", "client_email": "ABC@X.iam.gserviceaccount.com", "client_id": "123", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/X", "universe_domain": "googleapis.com" }
Example of the corresponding part of
secrets.toml
file:type= "service_account" project_id= "datascout" private_key_id= "" private_key= "" client_email= "ABC@X.iam.gserviceaccount.com" client_id= "123" auth_uri= "https://accounts.google.com/o/oauth2/auth" token_uri= "https://oauth2.googleapis.com/token" auth_provider_x509_cert_url= "https://www.googleapis.com/oauth2/v1/certs" client_x509_cert_url= "https://www.googleapis.com/robot/v1/metadata/x509/X" universe_domain= "googleapis.com"
- Enter your Google OAuth2.0 credentials by copying the respective JSON file's contents into the
secrets.toml
file. The OAuth2.0 client json file may look like this:
{ "web": { "client_id": "123", "project_id": "datascout", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_secret": "abc", "redirect_uris": ["http://localhost:8501", "http://localhost:8500"], "javascript_origins": ["http://localhost:8501", "http://localhost:8500"] } }
Example of the corresponding part of
secrets.toml
file:client_id= "123" project_id= "datascout" auth_uri= "https://accounts.google.com/o/oauth2/auth" token_uri= "https://oauth2.googleapis.com/token" auth_provider_x509_cert_url= "https://www.googleapis.com/oauth2/v1/certs" client_secret= "abc" redirect_uris= ["http://localhost:8501", "http://localhost:8500"] javascript_origins= ["http://localhost:8501", "http://localhost:8500"]
-
Run the Application:
streamlit run src/app.py
Contributions are welcome! Open an issue or submit a pull request.