A robust Python project for downloading and parsing SEC EDGAR filings with parallel processing capabilities, proxy support, and metadata extraction.
sec_edgar_bulker/
├── __init__.py # Package initialization
├── cli.py # Command-line interface
├── config.py # Configuration handling
├── exceptions.py # Custom exceptions
├── core/ # Core functionality
│ ├── __init__.py
│ └── downloader.py # Main downloader implementation
├── models/ # Data models
│ ├── __init__.py
│ ├── exhibit.py # Exhibit data models
│ └── filing.py # Filing data models
├── network/ # Network handling
│ ├── __init__.py
│ ├── headers.py # HTTP headers management
│ ├── proxy.py # Proxy management
│ └── session.py # Session management
├── parsers/ # Filing parsers
│ ├── __init__.py
│ ├── parser.py # Main parser implementation
│ └── saver.py # File saving utilities
├── utils/ # Utility functions
│ ├── __init__.py
│ ├── config_generator.py # Config file generation
│ ├── logging.py # Logging configuration
│ ├── progress.py # Progress tracking
│ └── validation.py # Input validation
└── tests/ # Test suite
├── __init__.py
├── conftest.py # Test configuration
├── test_filing_parser.py
├── test_metadata_download.py
├── data/ # Test data
│ └── proxies.txt
├── integration/ # Integration tests
│ └── test_downloader.py
├── network/ # Network tests
│ ├── test_headers.py
│ └── test_session.py
└── parsers/ # Parser tests
├── test_filing_header.py
└── test_filing_processor.py
- Quarterly Bulk Downloads: Download entire quarters of SEC filings in parallel
- Smart Resume: Automatically continues from last downloaded file if interrupted
- Proxy Support: Rotates through multiple proxies to avoid rate limiting
- Configurable Retry Logic: Attempts up to 10 retries on network failures
- Progress Tracking: Real-time progress bars using tqdm
- Form Types: Downloads various SEC forms including: (coming soon)
- 10-K (Annual Reports)
- 10-Q (Quarterly Reports)
- 8-K (Current Reports)
- 20-F (Foreign Private Issuer Reports)
- DEF 14A (Proxy Statements)
- Parallel Processing: Multi-threaded downloading and parsing
- Raw File Storage: Saves unprocessed files for custom parsing
- Incremental Updates: Only downloads new or modified filings
- Company Filtering: Filter downloads by CIK or company name
- Date Range Filtering: Specify custom date ranges for downloads
- Index File Processing: Parses master.idx files for efficient lookups
- Graceful Error Recovery: Continues processing despite network issues
- Detailed Logging: Comprehensive error and warning messages
- Network Error Handling: Manages proxy timeouts and connection issues
- Corrupt File Detection: Validates downloaded files for integrity
- Structured Output: Organizes files by year, quarter, and form type
- Metadata Extraction: Pulls key information from filing headers
- CIK Directory Structure: Maintains company-specific file organization
- Index Generation: Creates searchable indices of downloaded content
- Memory Efficient: Streams large files instead of loading entirely
- Bandwidth Management: Throttles downloads to respect SEC limits
- Cache Management: Maintains local cache of frequently accessed data
- Resource Monitoring: Tracks CPU and memory usage during processing
- Progress Visualization: Shows download and processing progress
- Status Updates: Displays current operation and estimated completion
- Error Reporting: Clear error messages with suggested solutions
- Activity Logs: Maintains detailed logs of all operations
- API Integration: Easy integration with data analysis pipelines
- Parsing Capabilities: Exports metadata in JSONL format
# Required configurations
years: [2022, 2023] # List of years to process
quarters: [1, 2, 3, 4] # List of quarters to process
http:
settings:
static_headers:
User-Agent: 'Mozilla/5.0 ... SEC-Downloader your.email@example.com' # SEC requires your email
Accept-Encoding: 'gzip, deflate'
Host: 'www.sec.gov'
While proxies are not strictly required, they are recommended for large-scale downloads to avoid rate limiting. Proxies should be provided in a separate file (proxies.txt) in the format:
ip:port:username:password
Example proxy entry (one per line on proxies.txt):
191.96.104.139:5876:username:password
Configure proxy usage in config.yaml:
proxies:
enabled: true
file: "proxies.txt" # Path to proxy list file
max_retries: 10 # Retries per proxy
The downloader saves files in JSONL format with the following structure:
{"_id": "11d89d83-e718-440b-bca2-d56d3731610f", "url": "https://www.sec.gov/Archives/edgar/data/1821534/000114036121038137/brhc10030255_ex99-1.htm", "sec_document": "0001140361-21-038137.txt : 20211116", "document_type": "ADD EXHB", "content_type": "text", "metadata": {"cik": "1821534", "company_name": "Exodus Movement, Inc.", "form_type": "1-U", "date_filed": "2021-11-16", "accession_number": "000114036121038137", "master_file": "master42021", "sec_document": "0001140361-21-038137.txt : 20211116", "filename": "brhc10030255_ex99-1.htm", "sequence": "2", "description": "EXHIBIT 99.1", "title": "", "filing_form_type": "1-U", "conformed_submission_type": "1-U", "standard_industrial_classification": "FINANCE SERVICES", "classification_number": "6199"}, "document_content": "<DOCUMENT>\n<TYPE>ADD EXHB\n<SEQUENCE>2\n<FILENAME>brhc10030255_ex99-1.htm\n<DESCRIPTION>EXHIBIT 99.1\n<TEXT>\n<html>\n <head>\n <title></title>\n <!-- Licensed to: Broadridge\n Document created using EDGARfilings PROfile 8.0.0.0\n Copyright 1995 - 2021 Broadridge -->\n </head></DOCUMENT>\n"}
download:
pdfs: true # Download PDF versions if available
metadata_only: false # Only download metadata, skip documents
max_workers: 400 # Parallel download workers
worker_timeout: 4000 # Seconds before worker timeout
batch_size: 100 # Items per batch for processing
output:
filings_to_disk: false # Save raw filings to disk
whole_filings: false # Save entire filing as single JSON
directories:
output: "output" # Main output directory
exhibits: "exhibits_download"
filings: "filings_download"
logs: "logs"
The downloader processes each document with the following structure:
document:
output_format:
main_keys: # Primary document fields
- _id # Unique document identifier
- url # SEC.gov URL
- sec_document # SEC document identifier
- type # Document type
- document_content # Actual content
metadata_keys: # Metadata fields
- cik
- company_name
- form_type
- date_filed
- accession_number
- master_file
- content_type
- filename
- sequence
- description
- title
Available metadata fields:
-
Default Fields (Always Extracted):
- cik
- company_name
- form_type
- date_filed
- accession_number
- master_file
- sec_document
- filename
- sequence
- description
- title
- filing_form_type
- conformed_submission_type
- standard_industrial_classification
- classification_number
-
Optional Header Fields:
header: enabled: false fields: - acceptance_datetime - public_document_count - conformed_period_of_report - filed_as_of_date - date_as_of_change - sec_act - sec_file_number - film_number
-
Optional Company Data:
company: enabled: false fields: - irs_number - state_of_incorporation - fiscal_year_end - business_address - mail_address
The downloader supports custom header parsing patterns:
header_parsing:
use_config: false # Use default parsing if false
default:
patterns:
accession_number: "ACCESSION NUMBER:\\s*(\\S+)"
company_name: "COMPANY CONFORMED NAME:\\s*(.+?)\\n"
cik: "CENTRAL INDEX KEY:\\s*(\\d+)"
custom:
sections:
- name: "IDENTIFICATION"
start: "<SEC-HEADER>"
end: "FILING VALUES"
fields:
- name: "accession_number"
pattern: "ACCESSION NUMBER:\\s*(\\S+)"
- name: "company_name"
pattern: "COMPANY CONFORMED NAME:\\s*(.+?)\\n"
Required headers for SEC.gov access:
navigation_headers:
use_random: false # Use static headers (recommended)
settings:
static_headers:
User-Agent: 'Mozilla/5.0 ... SEC-Downloader your.email@example.com'
Accept-Encoding: 'gzip, deflate'
Host: 'www.sec.gov'
-
Worker Configuration:
- Adjust
max_workers
based on available CPU cores - Set
worker_timeout
based on network conditions - Tune
batch_size
for memory optimization
- Adjust
-
Network Settings:
- Configure proxy timeouts (default: 300 seconds)
- Set retry attempts (default: 10)
- Adjust request delays to comply with SEC.gov rate limits
-
Storage Options:
- Enable
filings_to_disk
for raw file storage - Use
metadata_only
for quick indexing - Enable
whole_filings
for complete filing preservation
- Enable
logging:
level: "INFO"
format: "%(asctime)s - %(levelname)s - %(message)s"
file_enabled: true
Monitor progress through:
- Download statistics
- Worker status
- Network errors
- Processing completion rates
[Progress Bars]
Processing master12022.idx: 100%|██████████████| 104259/104259 [27:37<00:00, 62.88it/s]
[Information Messages]
2024-XX-XX XX:XX:XX - INFO - Found 291583 submissions in masterXXXXX.idx
2024-XX-XX XX:XX:XX - INFO - Resuming from last checkpoint
[Warning Messages]
2024-XX-XX XX:XX:XX - ERROR - Network error with proxy XXX.XXX.XXX.XXX:XXXX (attempt X/10)
2024-XX-XX XX:XX:XX - ERROR - Timeout error with proxy XXX.XXX.XXX.XXX:XXXX (attempt X/10)
[Start Messages]
2024-12-08 19:21:08,618 - INFO - No progress file found
2024-12-08 19:21:09,397 - INFO - Found 291583 submissions in master22022.idx
2024-12-08 19:21:09,400 - INFO - Found 291583 total submissions
2024-12-08 19:21:09,950 - INFO - Found 291583 pending submissions
2024-XX-XX XX:XX:XX - INFO - Resuming from {'cik': 'XXXXXXX', 'company_name': 'ACME INC', 'form_type': 'XX-X', 'date_filed': 'YYYY-MM-DD', 'filename': 'edgar/data/XXXXXXX/XXXXXXXXXX-XX-XXXXXX.txt', 'accession_number': 'XXXXXXXXXXXXX', 'master_file': 'masterXXXXX'}
- Typical quarter processing time: ~1 hour
- Processing speed: ~60-70 items per second
- Network dependent: May vary based on proxy performance
We recommend using uv for fast, reliable Python package management:
uv ```bash pip install sec-edgar-downloader
## Usage
```python
from sec_edgar_downloader import SECDownloader, DownloaderConfig
# Load configuration from config.yaml
config = DownloaderConfig.from_yaml()
# Initialize downloader
downloader = SECDownloader(config)
# Download filings for a specific year range
await downloader.download_years()
# Check statistics
print(f"Statistics: {downloader.stats}")
The downloader keeps track of the following statistics:
total_processed
: Total number of filings processedex10_matches
: Number of filings containing EX-10 exhibitsnot_ex10_matches
: Number of filings without EX-10 exhibits
MIT License