Skip to content

igrishaev/remus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Remus

An attentive RSS and Atom feed parser for Clojure. It's built on top of well-known and powerful ROME Tools Java library. Remus deals with weird encoding and non-standard XML tags. The library fetches as much information from a feed as possible.

Table of Contents

Benefits

  • Gets all the known fields from a feed and turns them into plain Clojure data structures;
  • relies on up-to-date ROME release;
  • uses the power of clj-http client instead of deprecated ROME Fetcher;
  • preserves all the non-standard XML tags for further processing (see example below).

Installation

Leiningen/Boot:

[remus "0.2.4"]

Clojure CLI/deps.edn

remus/remus {:mvn/version "0.2.4"}

Usage

The library provides a one-word top namespace remus so it's easier to remember.

(ns your.project
  (:require [remus :refer [parse-url parse-file]]))

or:

(require '[remus :refer [parse-url parse-file]])

Parsing a URL

Let's parse Planet Clojure:

(def result (parse-url "http://planet.clojure.in/atom.xml"))

The variable result is a map of two keys: :response and :feed. These are an HTTP response and a parsed feed. Below, there is a truncated version of a feed:

(def feed (:feed result))

(println feed)

;;;;
;; just a small subset
;;;;

{:description nil,
 :feed-type "atom_1.0"
 :entries
 [{:description nil
   :updated-date #inst "2018-08-13T10:00:00.000-00:00"
   :extra {:tag :extra, :attrs nil, :content ()}
   :title
   "PurelyFunctional.tv Newsletter 287: DataScript, GraphQL, CRDTs"
   :author "Eric Normand"
   :link
   "https://purelyfunctional.tv/issues/purelyfunctional-tv-newsletter-287-datascript-graphql-crdts/"
   :uri "https://purelyfunctional.tv/?p=28660"
   :contents
   ({:type "html"
     :mode nil
     :value
     "<div class=\" reset\">\n<p><em>Issue 287 August 13, 2018 <a href=\"https://purelyfunctional.tv/newsletter-archives/\">Archives</a> <a href=\"https://purelyfunctional.tv/newsletter/\" title=\"Thanks, Jeff!\">Subscribe</a></em></p>\n<p>Hi Clojurationists,</p>\n<p>I've just been digging <a href=\"https://twitter.com/puredanger/status/1028103654241443840\" title=\"\">this lovely tweet from Alex Miller</a>.</p>\n<p>Rock on!<br /><a href=\"http://twitter.com/ericnormand\">Eric Normand</a> &lt;<a href=\"mailto: ... "}),
 :published-date #inst "2018-08-13T11:59:11.000-00:00"
 :entry-links
 ({:rel "alternate"
   :href "http://planet.clojure.in/"
   :length 0}
  {:rel "self"
   :href "http://planet.clojure.in/atom.xml",
   :length 0})
 :title "Planet Clojure"
 :language nil
 :link "http://planet.clojure.in/"
 :uri "http://planet.clojure.in/atom.xml"
 :authors ()}

As for HTTP response, it's the same data structure that clj-http.client/response function returns. You might need that data to save some of the HTTP headers for further requests (see below).

Parsing a file

(def feed (parse-file "/path/to/some/atom.xml"))

This function just returns a parsed feed.

Parsing an input stream

Just in case you're getting a feed from a stream, here is a function for that:

(def feed (parse-stream (clojure.java.io/input-stream some-source)))

Like parse-file, it returns a parsed feed as a data structure.

HTTP tweaks

Since Remus relies on clj-http library for HTTP communication, you are welcome to use all its features. For example, to control redirects, security validation, authentication, etc. When calling parse-url, pass an optional map with HTTP parameters:

;; Do not check an untrusted SSL certificate.
(parse-url "http://planet.clojure.in/atom.xml"
           {:insecure? true})


;; Parse a user/pass protected HTTP resource.
(parse-url "http://planet.clojure.in/atom.xml"
           {:basic-auth ["username" "password"]})


;; Pretending being a browser. Some sites protect access by "User-Agent" header.
(parse-url "http://planet.clojure.in/atom.xml"
           {:headers {"User-Agent" "Mozilla/5.0 (Macintosh; Intel Mac...."}})

Remus overrides just one option which is :as. No matter what you put into it, the value becomes :stream. We need a streamed HTTP response because ROME relies on an input stream.

Errors and exceptions

It's up to you how to deal with non-200 HTTP responses. Even if you pass {:throw-exceptions false}, the feed only be parsed when the status code is 200.

(let [result (parse-url "http://example.com/non-existing-url"
                               {:throw-exceptions false})
             {:keys [response feed]} result]
         (when-not feed
           (process-non-200 response)))

Or just skip the :throw-exceptions flag and wrap everything into the standard try/catch form:

(try
  (parse-url "http://non-existing-url")
  (catch clojure.lang.ExceptionInfo e
    (let [response (ex-data e)
          {:keys [status headers]} response]
      (println status headers)
      ;; do anything you want
      )))

Alternately, you may use the Slingshot approach to catch HTTP-thrown exceptions as the official manual describes.

Saving headers

When parsing a URL, a good option would be to pass the If-None-Match and If-Modified-Since headers with the values from the Etag and Last-Modified ones from the previous response. This trick is know as conditional GET. It might prevent server from sending the data you've already received before:

;; returns the whole feed
(def result (parse-url "http://planet.lisp.org/rss20.xml"))

;; split the result
(def feed (:feed result))
(def response (:response result))

;; ensure we got the data
(:length response)
48082

;; save the headers
(def etag (-> response :headers :etag))
;; "5b71766f-2f597"

(def last-modified (-> response :headers :last-modified))
;; Mon, 19 Oct 2020 12:15:27 GMT

;;;;
;; Now, try to fetch data passing conditionals headers:
;;;;

(def result-new
  (parse-url "http://planet.lisp.org/rss20.xml"
             {:headers {"If-None-Match" etag
                        "If-Modified-Since" last-modified}}))

(-> result-new :response :status)
304

(-> result-new :response :length)
0

(-> result-new :feed)
nil

Since the server returned non-200 but positive status code (304 in our case), we don't parse the response at all. So the :feed field in the result-new variable will be nil.

Non-standard tags

Sometimes, a feed ships additional data with non-standard tags. A good example might be a typical YouTube feed. Let's examine one of its entries:

<entry>
  <id>yt:video:TbthtdBw93w</id>
  <yt:videoId>TbthtdBw93w</yt:videoId>
  <yt:channelId>UCaLlzGqiPE2QRj6sSOawJRg</yt:channelId>
  <title>Datomic Ions in Seven Minutes</title>
  <link rel="alternate" href="https://www.youtube.com/watch?v=TbthtdBw93w"/>
  <author>
    <name>ClojureTV</name>
    <uri>
      https://www.youtube.com/channel/UCaLlzGqiPE2QRj6sSOawJRg
    </uri>
  </author>
  <published>2018-07-03T21:16:16+00:00</published>
  <updated>2018-08-09T16:29:51+00:00</updated>
  <media:group>
    <media:title>Datomic Ions in Seven Minutes</media:title>
    <media:content url="https://www.youtube.com/v/TbthtdBw93w?version=3" type="application/x-shockwave-flash" width="640" height="390"/>
    <media:thumbnail url="https://i1.ytimg.com/vi/TbthtdBw93w/hqdefault.jpg" width="480" height="360"/>
    <media:description>
      Stuart Halloway introduces Ions for Datomic Cloud on AWS.
    </media:description>
    <media:community>
      <media:starRating count="67" average="5.00" min="1" max="5"/>
      <media:statistics views="1977"/>
    </media:community>
  </media:group>
</entry>

In addition to the standard fields, the feed carries information about the video ID, channel ID and statistics: views count, the number of times the video was starred and its average rating. You would probably want to use that data.

Alternately, if you parse a geo-related feed, you'll get lat/lot coordinates, location names, tracks, etc.

Other RSS parsers either drop this data or require you to write a custom extension. Remus provides all the non-standard tags as a parsed XML structure. It puts that data into an :extra field for each entry and on the top level of a feed. This is how you can reach it:

(def result (parse-url "https://www.youtube.com/feeds/videos.xml?channel_id=UCaLlzGqiPE2QRj6sSOawJRg"))

(def feed (:feed result))

;;;;
;; Get entry-specific custom data
;;;;

;; Extra data from the first entry:
(-> feed :entries first :extra)

{:tag :rome/extra
 :attrs nil
 :content
 ({:tag :yt/videoId :attrs nil :content ["faoXSarGgEI"]}
  {:tag :yt/channelId :attrs nil :content ["UCaLlzGqiPE2QRj6sSOawJRg"]}
  {:tag :media/group
   :attrs nil
   :content
   ({:tag :media/title :attrs nil :content ["Datomic Cloud - Datoms"]}
    {:tag :media/content
     :attrs
     {:url "https://www.youtube.com/v/faoXSarGgEI?version=3"
      :type "application/x-shockwave-flash"
      :width "640"
      :height "390"}
     :content nil}
    {:tag :media/thumbnail
     :attrs
     {:url "https://i3.ytimg.com/vi/faoXSarGgEI/hqdefault.jpg"
      :width "480"
      :height "360"}
     :content nil}
    {:tag :media/description
     :attrs nil
     :content
     ["Check out the live animated tutorial: https://docs.datomic.com/cloud/livetutorial/datoms.html\n\nYour Datomic database consists of datoms. What are Datoms?"]}
    {:tag :media/community
     :attrs nil
     :content
     ({:tag :media/starRating
       :attrs {:count "72" :average "5.00" :min "1" :max "5"}
       :content nil}
      {:tag :media/statistics :attrs {:views "2014"} :content nil})})})}


;;;;
;; Get feed-specific extra:
;;;;

(-> feed :extra)

{:tag :rome/extra
 :attrs nil
 :content
 ({:tag :yt/channelId :attrs nil :content ["UCaLlzGqiPE2QRj6sSOawJRg"]})}

The :extra fields follow the standard XML-friendly structure so they can be processed with any XML-related technics like walking, zippers, etc.

Encoding issues

All the parse-<something> functions mentioned above take additional ROME-related options. Use them to solve XML-decoding issues when dealing with weird or non-set HTTP headers. ROME's got a solid algorithm to guess encoding, but sometimes it might need your help.

At the moment, Remus supports :lenient, :encoding and content-type options with has the following meaning:

  • lenient: a boolean flag which makes Rome to be more loyal to some mistakes in XML markup;

  • encoding: a string which represents the encoding of the feed. When parsing a URL, it comes from the Content-Encoding HTTP header. Possible values are listed here: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html

  • content-type: a string meaning the MIME type of the feed, e.g. application/rss or something. When parsing a URL, it comes from the Content-Type header.

Dealing with Windows encoding and unset Content-type or Content-Encoding headers:

(parse-url "https://some/rss.xml" nil {:lenient true :encoding "cp1251"})

The same options work for parsing a file or a stream:

(parse-file "https://another/atom.xml" {:lenient true :encoding "cp1251"})

(parse-stream in-source {:lenient true :encoding "cp1251"})

License

Copyright © 2020 Ivan Grishaev

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.