-
Notifications
You must be signed in to change notification settings - Fork 178
Troubleshooting, testing and live coding
You can use Cascading Traps with Cascalog to capture tuples whose processing fails. To store those tuples into a sink tap (for example a local file or hfs-textline), use the :trap
keyword with an error sink:
(def errors (lfs-textline "file:///tmp/people.bad_records" :sinkmode :replace))
;; or (stdout) or (hfs-textline "hdfs:///tmp/...") if running on Hadoop
(<- [?name ?age]
(people ?name ?age)
(:trap errors)
(< ?age 40))
You may use the functions and macros from the cascalog.testing namespace together with clojure.test test your queries. See Cascalog's own tests for examples.
It uses for example fact?-
to execute a query and compare its outputs with the expected ones or something like (facts query => (produces [[3 10] [1 5] [5 11]])
where (def query (<- ...))
. Read Sam Ritchie's blog post Cascalog Testing 2.0 for more details and examples of midje-cascalog 0.4.0.
There are certain features that support live, interactive coding:
- Use simple Clojure collections as data sources (
(def people [["ben" 21] ["jim" 42]])
) - You can during development easily change some parts of Cascalog code to standard Clojure functions and call them from the REPL, for example a custom operator by replacing
(defaggregateop
with(defn
. - Queries can be of course executed from the REPL
When all the taps in a job are lfs-textline
s or vectors (or stdout), you can run the -main
in your jar directly using java -jar
, instead of submitting it with hadoop jar
. This is sometimes called local mode.
When your jobs are running in this local mode, you can have a lot of information logged with log4j just by putting a standard log4j.xml in the classpath root of your jar. Any exceptions thrown in jobs will be printed to the configured log file with their full stacktrace.