Skip to content

Commit

Permalink
Add ability to archive the monthly text/gzip files
Browse files Browse the repository at this point in the history
  • Loading branch information
philgyford committed Oct 12, 2013
2 parents 35ef67e + faa8890 commit 0b560b9
Show file tree
Hide file tree
Showing 4 changed files with 84 additions and 12 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
*.pyc
MailmanArchiveScraper.cfg
4 changes: 2 additions & 2 deletions MailmanArchiveScraper.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""
* Scrapes the archive pages of one or more lists in a Mailman installation and republishes the contents, with an optional RSS feed.
* v1.13, 2010-01-15
* v1.2, 2013-10-12
* http://github.com/philgyford/mailman-archive-scraper/
*
* Only works with Monthly archives at the moment.
Expand Down Expand Up @@ -33,7 +33,7 @@ def publish_extensions(self, handler):
PyRSS2Gen._opt_element(handler, "content:encoded", self.content)


class MailmanArchiveScraper:
class MailmanArchiveScraper(object):
"""
Scrapes the archive pages of one or more lists in a Mailman installation and republishes the contents.
"""
Expand Down
43 changes: 43 additions & 0 deletions MailmanGzTextScraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import os
from BeautifulSoup import BeautifulSoup
from MailmanArchiveScraper import MailmanArchiveScraper

"""
Download the gzip text file with the month's messages.
"""
class MailmanGzTextScraper(MailmanArchiveScraper):

def __init__(self):
super(MailmanGzTextScraper, self).__init__()
self.local_dir = self.publish_dir + 'text'
if not os.path.exists(self.local_dir):
os.mkdir(self.local_dir)

"""
fetch the the whole month's message as gzipped text
"""
def scrapeList(self):
source = self.fetchPage(self.list_url)
filtered_source = self.filterPage(source)
soup = BeautifulSoup(source)


for row in soup.first('table')('tr')[1:]:
rel_url = row('td')[2]('a')[0].get('href')
source = self.fetchPage(self.list_url + '/' + rel_url)

local_month = open(self.local_dir + '/' + rel_url, 'w')
local_month.write(source)
local_month.close()



def main():
scraper = MailmanGzTextScraper()
scraper.scrape()


if __name__ == "__main__":
main()


48 changes: 38 additions & 10 deletions README.markdown
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
# Mailman Archive Scraper

By Phil Gyford <phil@gyford.com>
v1.13, 2010-01-15
v1.2, 2013-10-12

Latest version is available from <http://github.com/philgyford/mailman-archive-scraper/>

This script will scrape the archive pages generated by the Mailman mailing list manager <http://www.gnu.org/software/mailman/index.html> and republish them as files on the local file system. In addition it can optionally do a number of things:
These scripts will scrape the archive pages generated by the Mailman mailing list manager <http://www.gnu.org/software/mailman/index.html> and republish them as files on the local file system. In addition it can optionally do a number of things:

* Create an RSS feed of recent messages.
* Scrape private Mailman archives (if you have a valid email address and password).
* Remove all email addresses from the files (both those in 'phil@gyford.com' and 'phil at gyford dot com' format).
* Replace the URL for the 'more info on this list' links with another.
* Remove one or more levels of quoted emails.
* Search and replace any custom strings you specify.
* Add custom HTML into the <head></head> section of the re-published pages.
* Add custom HTML into the `<head></head>` section of the re-published pages.

Why would you want to do this? Three reasons:

Expand All @@ -33,28 +33,56 @@ This script doesn't store any state locally between sessions so every time it's
## Installation

1. Put the directory containing the MailmanArchiveScraper.py script somewhere you want to run it from.
2. Make a copy of the MailmanArchiveScraper-example.cfg file and call it MailmanArchiveScraper.cfg.

2. Make a copy of the `MailmanArchiveScraper-example.cfg` file and name it `MailmanArchiveScraper.cfg`.

3. Set the configuration options in that file (see below).

4. Install the required extra python modules:
* BeautifulSoup <http://www.crummy.com/software/BeautifulSoup/>
* ClientForm <http://wwwsearch.sourceforge.net/ClientForm/>
* Mechanize <http://wwwsearch.sourceforge.net/mechanize/>
* PyRSS2Gen <http://www.dalkescientific.com/Python/PyRSS2Gen.html>

Best done with [pip](https://pypi.python.org/pypi/pip) and `pip install -r requirements.txt`.
5. Make sure the MailmanArchiveScraper.py script is executable (chmod +x).

5. Make sure the `MailmanArchiveScraper.py` script is executable (`chmod +x`). And the `MailmanGzTextScraper.py` script if you need that too.


## Configuration

There is help in the configuration file for each setting. The minimum things you'll need to set are:

1. domain -- The domain name that your Mailman pages are on.
2. list_name -- Name of your mailing list.
3. email and password -- Required if your Mailman archive is password protected.
4. publish_dir -- The path to the local directory the files should be republished to.
5. publish_url - If you're going to publish the messages to a website.
1. `domain` -- The domain name that your Mailman pages are on.
2. `list_name` -- Name of your mailing list.
3. `email` and `password` -- Required if your Mailman archive is password protected.
4. `publish_dir` -- The path to the local directory the files should be republished to.
5. `publish_url` -- If you're going to publish the messages to a website.


## Usage

Once configuration is done, run the script:

$ python ./MailmanArchiveScraper.py

All being well, the HTML archive files will be downloaded. Set the `verbose` setting in the configuration file to see a list of which files are being fetched.

If you want to download the plaintext files that Mailman saves for each month's messages (which may be gzipped), then run this script:

$ python ./MailmanGzTextScraper.py

After an initial run, you can run the script via cron to keep an updated copy of the HTML and/or text files. Note the `hours_to_go_back` setting in the config file, which wil probably need to be different for the first run compared to subsequent, regular runs.


## What would also be nice:

* Sending each message on as an email. I can't see how to do this simply, given that we retain no state between times the script is run, so can't tell which emails haven't previously been sent.


## Contributors

Many thanks to:

* [CyberRodent](https://github.com/cyberrodent) for the text/gzip file archiving.

0 comments on commit 0b560b9

Please sign in to comment.