Skip to content

Troubleshooting, testing and live coding

methylene edited this page Oct 17, 2014 · 6 revisions

Troubleshooting

Catching data errors with traps

You can use Cascading Traps with Cascalog to capture tuples whose processing fails. To store those tuples into a sink tap (for example a local file or hfs-textline), use the :trap keyword with an error sink:

(def errors (lfs-textline "file:///tmp/people.bad_records" :sinkmode :replace)) 
;; or (stdout) or (hfs-textline "hdfs:///tmp/...") if running on Hadoop

(<- [?name ?age]
      (people ?name ?age)
      (:trap errors)
      (< ?age 40))

Testing

You may use the functions and macros from the cascalog.testing namespace together with clojure.test test your queries. See Cascalog's own tests for examples.

It uses for example fact?- to execute a query and compare its outputs with the expected ones or something like (facts query => (produces [[3 10] [1 5] [5 11]]) where (def query (<- ...)). Read Sam Ritchie's blog post Cascalog Testing 2.0 for more details and examples of midje-cascalog 0.4.0.

Live coding

There are certain features that support live, interactive coding:

  • Use simple Clojure collections as data sources ((def people [["ben" 21] ["jim" 42]]))
  • You can during development easily change some parts of Cascalog code to standard Clojure functions and call them from the REPL, for example a custom operator by replacing (defaggregateop with (defn .
  • Queries can be of course executed from the REPL

Logging with Log4j (local mode only)

When all the taps in a job are lfs-textlines or vectors (or stdout), you can run the -main in your jar directly using java -jar, instead of submitting it with hadoop jar. This is sometimes called local mode.

When your jobs are running in this local mode, you can have a lot of information logged with log4j just by putting a standard log4j.xml in the classpath root of your jar. Any exceptions thrown in jobs will be printed to the configured log file with their full stacktrace.