The repo contains all necessary material to build a Docker container that run notebooks demonstrating replicable and scalable data download pipelines, data extraction, and analysis. The data pipelines included are representative of pipelines now in production and scaled up in a larger Kubernetes cluster in order to be used in AidData's Geoquery platform.
Three datasets are retrieved for this dataset: 1) NDVI, 2) population, and 3) mining related Chinese finance project locations.
The LTDR (Long-Term Data Record) is a project at NASA that "produces, validates and distributes a climate data record." NDVI (Normalized Difference Vegetation Index) "provides continuity with NOAA's AVHRR NDVI time series record for historical and climate applications." This script downloads daily global NDVI data at 5km resolution, unpacks them from HDF files and converts them into the GeoTIFF format, and then create monthly and yearly aggregates.
The population count data downloaded by the notebook is from the WorldPop Hub. It conatins population data in the form of global mosaiced 1km resolution datasets.
AidData's Geospatial Chinese Development Finance dataset (Geo GCDF v3) provides the locations of a wide range of projects financed by China around the world between 2000 and 2023. For this use case, we isolate seven projects and their locations which are associated with mining activities. More information on the GeoGCDF can be found in the dataset's GitHub repository as well as the associated publication in Nature's Scientific Data journal.
-
Generate a token:
- Navigate to the LAADS DAAC website
- Click on "Login" at the top right of the screen
- Click on "Generate Token"
- Copy the generated token into
code/config.ini
(Note that the token currently listed in the file is inactive and will not work)
-
Customize
code/config.ini
to meet your needs- Choose which years you'd like to download and process
- Set your raw and output directories
- Add your EarthData token (see step 2)
- You can also modify the run options to scale up the backend methods and number of workers for increased performance, but we suggest starting with the defaults
- The source code which runs the backend processing is available on GitHub here.
- Performance of alternative run options will depend on the hardware you build and run the container on and can vary significantly (i.e., running a container locally on a laptop vs deploying in a Kubernetes cluster).
-
Build the Docker container
- define your local dir to use as mount / source (typically the folder where this readme file resides)
src_dir=/home/userx/git/geo-container-demo
- open permissions on src dir
chmod -R 777 $src_dir
- set the tag you want to use:
-
demo_tag=geodemo
- run the command in your terminal (use --no-cache if you need a fresh rebuild)
-
docker build -t geodemo:$demo_tag ./
- you may need to use sudo before it
- after building has completed run
docker run --rm -it -p 8888:8888 -v $src_dir:/home/jovyan:Z localhost/geodemo:$demo_tag
- from the terminal output, copy the link given and open in your browser to access the notebook environment where you will run the subsequent steps.
- define your local dir to use as mount / source (typically the folder where this readme file resides)
The data pipelines are made available through three Python notebooks in the code
directory, within the Jupyter notebook you built and opened in the previous steps.
Note: If you only want to replicate the analysis, the output of the data extraction (/data/extract.csv
) is included in the repository and you can skip to step 9.
Note: This will require nearly 40GB of space for downloads and final data products using the default settings.
-
First, open the
code/ndvi.ipynb
notebook and run through all the steps. Depending on the run options and years selected (and your internet connection) this can take a couple of hours to more than a day. -
Open the
code/pop.ipynb
notebook and run through all the steps. The time to run this will similarly vary, but for the default options it should not take more than an hour with a typical laptop and internet connection. -
Open the
code/gcdf.ipynb
notebook and run through all the steps. This should only take a few minutes.
- Open the
code/extract.ipynb
notebook and run through all the steps. This should only take a few minutes at most.
- Open the
code/analysis.ipynb
notebook and run through all the steps. This should run nearly instantly.