Time Magazine Scraper, Text Extraction (OCR), and Data Exploration with Topic Modelling
01.ipynb: Code
Open in Colab to explore the topics (and their dominant terms) or run the code.
Part 1 : Scraping from Time Vault from 1923-2015.
Scraped Data
- Part 1 (1923 to 1930)
- Part 2 (1931 to 1940)
- Part 3 (1941 to 1950)
- Part 4 (1951 to 1960)
- Part 5 (1961 to 1970)
- Part 6 (1971 to 1980)
- Part 7 (1981 to 1990)
- Part 8 (1991 to 2000)
- Part 9 (2001 to 2015)
Part 2: Text Extraction with Tesseract OCR.
Currently, the text is extracted only from 2000-2015, since the process is slow.
And yes, extracted text has lots of noise.
Part 3: Data Exploration with Topic Modelling.
TODO: For all years, and interpretation.