spark-xarray is an open source project and Python package that seeks to integrate PySpark and xarray for Climate Data Analysis. It is built on top of PySpark - Spark Python API and xarray.
spark-xarray was originally conceived during the Summer of 2017 as part of PySpark for "Big" Atmospheric & Oceanic Data Analysis - A CISL/SIParCS Research Project.
It is currently maintained by Anderson Banihirwe.
Documentation is available at https://andersy005.github.io/spark-xarray/.
We will guide you how to install spark-xarray. However, we will assume that an Apache Spark installation is available.
For the installation of spark-xarray
, the following packages are required:
- Spark 2.0+
- netcdf4-python (>=1.2.8)
xarray (>=0.9.5)
dask (>=0.15.1)
toolz (>=0.8.2)
Clone the repository directly from GitHub and install it aftwards using $ python setup.py
. This will also resolve possible missing dependencies.
$ git clone https://github.com/andersy005/spark-xarray.git
$ cd spark-xarray
$ python setup.py install
We welcome new contributors of all experience levels.
- Official source code repo: https://github.com/andersy005/spark-xarray
- Issue tracker: https://github.com/andersy005/spark-xarray/issues
>>> from sparkxarray.reader import ncread
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName('spark-rdd').getOrCreate()
>>> sc = spark.SparkContext
>>> filepath='spark-xarray/sparkxarray/tests/data/air.sig995.2012.nc'
>>> # Create an RDD
>>> rdd = ncread(sc, filepath, mode='single', partition_on=['time'], partitions=100)
>>> rdd.first() # Get the first element
<xarray.Dataset>
Dimensions: (lat: 73, lon: 144, time: 1)
Coordinates:
* lat (lat) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 70.0 67.5 ...
* lon (lon) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 ...
* time (time) datetime64[ns] 2012-01-01
Data variables:
air (time, lat, lon) float64 234.5 234.5 234.5 234.5 234.5 234.5 ...
Attributes:
Conventions: COARDS
title: mean daily NMC reanalysis (2012)
history: created 2011/12 by Hoop (netCDF2.3)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
>>> rdd.count() # Get a count of elements in the rdd
366
>>> # The count above corresponds to number of timesteps in the netCDF file
>>> rdd.getNumPartitions() # Get the number of partitions
100
>>> # Compute the daily average for each day (element) in RDD
>>> daily_average = rdd.map(lambda x: x.mean(dim=['lat', 'lon']))
>>> daily_average.take(3)
[<xarray.Dataset>
Dimensions: (time: 1)
Coordinates:
* time (time) datetime64[ns] 2012-01-01
Data variables:
air (time) float64 277.0, <xarray.Dataset>
Dimensions: (time: 1)
Coordinates:
* time (time) datetime64[ns] 2012-01-02
Data variables:
air (time) float64 276.8, <xarray.Dataset>
Dimensions: (time: 1)
Coordinates:
* time (time) datetime64[ns] 2012-01-03
Data variables:
air
>>> from sparkxarray.reader import ncread
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName('spark-rdd').getOrCreate()
>>> sc = spark.SparkContext
>>> paths='spark-xarray/sparkxarray/tests/data/NCEP/*.nc'
>>> multi_rdd = ncread(sc, paths, mode='multi', partition_on=['lat', 'lon'], partitions=300)
>>> multi_rdd.count()
16020
>>> multi_rdd.first()
<xarray.Dataset>
Dimensions: (lat: 1, lon: 1, nv: 2, time: 4, zlev: 1)
Coordinates:
* zlev (zlev) float32 0.0
* lat (lat) float32 -88.0
* lon (lon) float32 0.0
* time (time) datetime64[ns] 1854-01-15 1854-02-15 1854-03-15 1854-04-15
Dimensions without coordinates: nv
Data variables:
lat_bnds (time, lat, nv) float32 -89.0 -87.0 -89.0 -87.0 -89.0 -87.0 ...
lon_bnds (time, lon, nv) float32 -1.0 1.0 -1.0 1.0 -1.0 1.0 -1.0 1.0
sst (time, zlev, lat, lon) float64 nan nan nan nan
anom (time, zlev, lat, lon) float64 nan nan nan nan
Attributes:
Conventions: CF-1.6
Metadata_Conventions: CF-1.6, Unidata Dataset Discovery v1.0
metadata_link: C00884
id: ersst.v4.185401
naming_authority: gov.noaa.ncdc
title: NOAA Extended Reconstructed Sea Surface Tempe...
summary: ERSST.v4 is developped based on v3b after rev...
institution: NOAA/NESDIS/NCDC
creator_name: Boyin Huang
creator_email: boyin.huang@noaa.gov
date_created: 2014-10-24
production_version: Beta Version 4
history: Version 4 based on Version 3b
publisher_name: Boyin Huang
publisher_email: boyin.huang@noaa.gov
publisher_url: http://www.ncdc.noaa.gov
creator_url: http://www.ncdc.noaa.gov
license: No constraints on data access or use
time_coverage_start: 1854-01-15T000000Z
time_coverage_end: 1854-01-15T000000Z
geospatial_lon_min: -1.0f
geospatial_lon_max: 359.0f
geospatial_lat_min: -89.0f
geospatial_lat_max: 89.0f
geospatial_lat_units: degrees_north
geospatial_lat_resolution: 2.0
geospatial_lon_units: degrees_east
geospatial_lon_resolution: 2.0
spatial_resolution: 2.0 degree grid
cdm_data_type: Grid
processing_level: L4
standard_name_vocabulary: CF Standard Name Table v27
keywords: Earth Science > Oceans > Ocean Temperat...
keywords_vocabulary: NASA Global Change Master Directory (GCMD) Sc...
project: NOAA Extended Reconstructed Sea Surface Tempe...
platform: Ship and Buoy SSTs from ICOADS R2.5 and NCEP GTS
instrument: Conventional thermometers
source: ICOADS R2.5 SST, NCEP GTS SST, HadISST ice, N...
comment: SSTs were observed by conventional thermomete...
references: Huang et al, 2014: Extended Reconstructed Sea...
climatology: Climatology is based on 1971-2000 SST, Xue, Y...
description: In situ data: ICOADS2.5 before 2007 and NCEP ...