Add ability to archive the monthly text/gzip files

philgyford · Oct 12, 2013 · 0b560b9 · 0b560b9
2 parents 35ef67e + faa8890
commit 0b560b9
Show file tree

Hide file tree

Showing 4 changed files with 84 additions and 12 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1 +1,2 @@
+*.pyc
 MailmanArchiveScraper.cfg
diff --git a/MailmanArchiveScraper.py b/MailmanArchiveScraper.py
@@ -1,6 +1,6 @@
 """
 * Scrapes the archive pages of one or more lists in a Mailman installation and republishes the contents, with an optional RSS feed.
-* v1.13, 2010-01-15
+* v1.2, 2013-10-12
 * http://github.com/philgyford/mailman-archive-scraper/
 * 
 * Only works with Monthly archives at the moment.
@@ -33,7 +33,7 @@ def publish_extensions(self, handler):
         PyRSS2Gen._opt_element(handler, "content:encoded", self.content)
 
 
-class MailmanArchiveScraper:
+class MailmanArchiveScraper(object):
     """
     Scrapes the archive pages of one or more lists in a Mailman installation and republishes the contents.
     """

diff --git a/MailmanGzTextScraper.py b/MailmanGzTextScraper.py
@@ -0,0 +1,43 @@
+import os
+from BeautifulSoup import BeautifulSoup
+from MailmanArchiveScraper import MailmanArchiveScraper
+
+"""
+Download the gzip text file with the month's messages.
+"""
+class MailmanGzTextScraper(MailmanArchiveScraper):
+
+    def __init__(self):
+        super(MailmanGzTextScraper, self).__init__()
+        self.local_dir = self.publish_dir + 'text'
+        if not os.path.exists(self.local_dir):
+            os.mkdir(self.local_dir)
+
+    """
+    fetch the the whole month's message as gzipped text
+    """
+    def scrapeList(self):
+        source = self.fetchPage(self.list_url)
+        filtered_source = self.filterPage(source)
+        soup = BeautifulSoup(source)
+
+
+        for row in soup.first('table')('tr')[1:]:
+            rel_url = row('td')[2]('a')[0].get('href')
+            source = self.fetchPage(self.list_url + '/' + rel_url)
+
+            local_month = open(self.local_dir + '/' + rel_url, 'w')
+            local_month.write(source)
+            local_month.close()
+
+
+
+def main():
+    scraper = MailmanGzTextScraper()
+    scraper.scrape()
+
+
+if __name__ == "__main__":
+    main()
+
+
diff --git a/README.markdown b/README.markdown
@@ -1,19 +1,19 @@
 # Mailman Archive Scraper
 
 By Phil Gyford <phil@gyford.com>  
-v1.13, 2010-01-15
+v1.2, 2013-10-12
 
 Latest version is available from <http://github.com/philgyford/mailman-archive-scraper/>
 
-This script will scrape the archive pages generated by the Mailman mailing list manager <http://www.gnu.org/software/mailman/index.html> and republish them as files on the local file system. In addition it can optionally do a number of things:
+These scripts will scrape the archive pages generated by the Mailman mailing list manager <http://www.gnu.org/software/mailman/index.html> and republish them as files on the local file system. In addition it can optionally do a number of things:
 
 * Create an RSS feed of recent messages.
 * Scrape private Mailman archives (if you have a valid email address and password).
 * Remove all email addresses from the files (both those in 'phil@gyford.com' and 'phil at gyford dot com' format).
 * Replace the URL for the 'more info on this list' links with another.
 * Remove one or more levels of quoted emails.
 * Search and replace any custom strings you specify.
-* Add custom HTML into the <head></head> section of the re-published pages.
+* Add custom HTML into the `<head></head>` section of the re-published pages.
 
 Why would you want to do this? Three reasons:
 
@@ -33,28 +33,56 @@ This script doesn't store any state locally between sessions so every time it's
 ## Installation
 
 1. Put the directory containing the MailmanArchiveScraper.py script somewhere you want to run it from.
-2. Make a copy of the MailmanArchiveScraper-example.cfg file and call it MailmanArchiveScraper.cfg.
+
+2. Make a copy of the `MailmanArchiveScraper-example.cfg` file and name it `MailmanArchiveScraper.cfg`.
+
 3. Set the configuration options in that file (see below).
+
 4. Install the required extra python modules:
 	* BeautifulSoup <http://www.crummy.com/software/BeautifulSoup/>
 	* ClientForm <http://wwwsearch.sourceforge.net/ClientForm/>
 	* Mechanize <http://wwwsearch.sourceforge.net/mechanize/>
 	* PyRSS2Gen <http://www.dalkescientific.com/Python/PyRSS2Gen.html>
+
    Best done with [pip](https://pypi.python.org/pypi/pip) and `pip install -r requirements.txt`.
-5. Make sure the MailmanArchiveScraper.py script is executable (chmod +x).
+
+5. Make sure the `MailmanArchiveScraper.py` script is executable (`chmod +x`). And the `MailmanGzTextScraper.py` script if you need that too.
 
 
 ## Configuration
 
 There is help in the configuration file for each setting. The minimum things you'll need to set are:
 
-1. domain -- The domain name that your Mailman pages are on.
-2. list_name -- Name of your mailing list.
-3. email and password -- Required if your Mailman archive is password protected.
-4. publish_dir -- The path to the local directory the files should be republished to.
-5. publish_url - If you're going to publish the messages to a website.
+1. `domain` -- The domain name that your Mailman pages are on.
+2. `list_name` -- Name of your mailing list.
+3. `email` and `password` -- Required if your Mailman archive is password protected.
+4. `publish_dir` -- The path to the local directory the files should be republished to.
+5. `publish_url` -- If you're going to publish the messages to a website.
+
+
+## Usage
+
+Once configuration is done, run the script:
+
+	$ python ./MailmanArchiveScraper.py
+
+All being well, the HTML archive files will be downloaded. Set the `verbose` setting in the configuration file to see a list of which files are being fetched.
+
+If you want to download the plaintext files that Mailman saves for each month's messages (which may be gzipped), then run this script:
+
+	$ python ./MailmanGzTextScraper.py
+
+After an initial run, you can run the script via cron to keep an updated copy of the HTML and/or text files. Note the `hours_to_go_back` setting in the config file, which wil probably need to be different for the first run compared to subsequent, regular runs.
 
 
 ## What would also be nice:
 
 * Sending each message on as an email. I can't see how to do this simply, given that we retain no state between times the script is run, so can't tell which emails haven't previously been sent.
+
+
+## Contributors
+
+Many thanks to:
+
+* [CyberRodent](https://github.com/cyberrodent) for the text/gzip file archiving.
+