📇 Tools to Work with the Web Archive Ecosystem in R
-
Updated
Aug 20, 2017 - R
📇 Tools to Work with the Web Archive Ecosystem in R
From WARC records to MongoDB documents
This is part of my 2022 Summer Internship, it's mainly about web scraping.
Parse And Create Web ARChive (WARC) files with node.js
Discovering French Digital Literature (LIFRANUM ANR project)
Process web archives (WARC format) with StormCrawler and index content into Elasticsearch or Solr
Parser for WARC (aka WebArchive) files
metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
Process Common Crawl data with Python and Spark
Common Crawl's processing tools
Add a description, image, and links to the warc-files topic page so that developers can more easily learn about it.
To associate your repository with the warc-files topic, visit your repo's landing page and select "manage topics."