- utils.R - Check if dependencies are serialized correctly
- Similar to
stats.py
in Python, add support for mean, median, stdev etc. - Extend
addPackage
so that any given R file can be sourced in the worker before functions are run. - Add a
lookup
method to get an element of a pair RDD object by key. hashCode
support for arbitrary R objects.- Support for other storage types like storing RDDs on disk.
- Extend input formats to support
sequenceFile
.
- Write hash functions in C and use .Call to call into them.
- Use long-running R worker daemons to avoid forking a process each time.
- Memoizations of frequently queried vals in RDD, such as numPartitions, count etc.
- Pipelined RRDD to execute multiple functions with one call.
- Profile serialization overhead and see if there is anything better we can do.
- Integration with ML Lib to run ML algorithms from R.
- RRDDs are distributed lists. Extend them to create a distributed data frame.
- Support accumulators in R.
- Reduce code duplication between SparkR and PySpark.
- Add more machine learning examples and some performance benchmarks.