This repository includes the meat of a pipeline that generates a reliable news database. An AI browser extension, powered by retrievel-augmented LLMs, that finds low-credibility posts on Facebook and allows users to generate bridging responses rooted in reliable news. This database is meant to be employed by an AI browser extension, powered by retrieval-augmented LLMs, that finds low-credibility posts on Facebook and allows users to generate bridging responses rooted in reliable news.
Note:
- serp is used to search for recent Google News articles in the US from a specific list of domains
- The text of these articles is programmatically scraped.
- We use
newspaper3k
to do this automatically. Note that scraping works as of July 2024 for the current list of domains. If you change this or a great deal of time has passed, the scrapping process may not work anymore. You should check your data!
- Each article is summarized using OpenAI's ChatGPT-3.5 Turbo.
- Summary text is inserted into a vector database for fast semantic search by the browser extension.
The entire pipeline is run by a single bash
script code/collect_summarize_update_vdb.sh
code/
: contains all code/scriptsdata/
: contains all data