This project contains a collection of Python scripts designed to scrape data from PDF files hosted on the U.S. Securities and Exchange Commission (SEC) website. The scripts aim to extract information about companies that either use specific software solutions or are involved in dealing with cryptocurrencies. The extracted data can provide insights into the adoption of certain technologies or the prevalence of cryptocurrency-related activities among publicly traded companies.
- PDF Scraping: Utilizes
pypdf
to extract text and data from PDF documents. - Keyword Search: Searches for specific keywords related to software usage or cryptocurrency activities within the extracted text.
- Data Output: Provides structured data output, such as CSV files or database entries, for further analysis.
- Customizable: Easily customizable to adapt to different search criteria or PDF formats.
main.py
: Scrapes PDF documents to identify companies that mention specific software solutions.cryptor.py
: Extracts information about companies involved in cryptocurrency-related activities from PDF files.
- Clone the repository to your local machine:
git clone https://github.com/otisscott/sec_scraping.git
- Install the required dependencies:
pip install -r requirements.txt
- Run the desired script, providing necessary arguments such as keywords or file paths:
python main.py
- Python 3.x
- Dependencies listed in
requirements.txt
- Access to the internet to download PDF files from the SEC website.
- A locally saved copy of the XML file containing all of the registered investment advisers found here: https://adviserinfo.sec.gov/compilation
Contributions are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.
This project is intended for educational and research purposes only. The information extracted from SEC filings should be verified and used responsibly. The creators of this project are not responsible for any misuse of the data obtained through these scripts.