The Challenge
How do we work with very large raster data?
Specifically...
How do we work with the
NASA NEX Down-sampled Climate Projections (NEX-DCP30)
open data set?
What is NEX Climate Projection data?
Global Circulation Models
Models for predicting world temperature and precipitation.
IPCC Assessment Report
- IPCC = Intergovernmental Panel on Climate Change
- Assessment Report 5 (AR5) published in 2014.
- More than 800 authors
3 Key Categories:
-
Model
- 33 different models
- Model Ensembling
-
Dataset
- Temperature MAX
- Temperature MIN
- Precipitation
-
Scenario
Representative Concentration Pathways
NEX Down-sampled Data
-
Monthly data over conterminous US
- Historical from 1950 - 2006
- 4 RCP scenarios from 2006 - 2099
- 8190 netCDF files on S3 - s3://nasanex/NEX-DCP30
- 15.3 TB in compressed GeoTiff tiles.
- RCP 8.5, max for datatype/model combo: 90.92 GB
Our workflow for processing NEX data
- Scala library for doing all things geospatial.
- framework for doing distributed raster processing on Akka and Spark.
- Includes local, zonal, focal, and global operations on rasters.
- Currently in incubation at
- Fast and general engine for large-scale data processing
- Does things Hadoop doesn't, like cache intermediate results in memory.
- Written in Scala!
- Also has bindings for Python and Java
Accumulo
- Big table implementation
- Has sorted indexing
- Columnar database
- Also used by GeoMesa, another Scala project at LocationTech
Step 1:
Export the netCDF data into 512x512 GeoTiff tiles.
Step 2:
Ingest the data into Accumulo using GeoTrellis-Spark.
- Ingest the GeoTiffs to Accumulo in parallel across a cluster.
- Ingest consists of
- reprojection
- mosaicing to tile scheme (TMS)
- pyramiding up zoom levels
- Calculate index splits.
Analysis of NEX data
Live coding session...
Thanks!
Take it away Johan...
How do you read GeoTiffs on the JVM?
GDAL, Geospatial C lib, fast!
GeoTools, Geospatial Java lib, speed?
Why yet another GeoTiff Reader?
- GeoTools large dependency
- GDAL Java bindings hard to install
- Go-To raster file format at GeoTrellis
- GeoTrellis is all about speed, everything optimized and benchmarked
What is the GeoTiff file format?
- Extension to the Tiff File Format
- Used for images with Geospatial Metadata
- Adds a bounding box and the CRS through tags
Geodata?
- Bounding Box easy to read
- Coordinate Reference System horrible to read
- Turn it into a proj4 string and use the proj4j lib to read
Compressions
- Huffman, CCITT3, CCITT4, Packbits
- LZW
- Zip
Benchmark Disclaimer
- Ran on my development computer
- Conducted with Caliper
- Microbenchmarks, look at relative speed, not speed
- GDAL is read through the Java bindings, into GeoTrellis rasters
- GeoTools is also turned into GeoTrellis rasters
~same for CCITT3 and CCITT4
~same for CCITT3 and CCITT4
Sidenote about Speed
- Scala slow when using functional mappings
- Arrays, while loops and bit operations
- Skip Big-O time complexity analyzation (O(n) - duh), use microbenchmarks
Future?
- Tons of compressions, JPEG hard but needed
- Keep up to date with custom tags
- Add a shape file reader (GeoTools is fast!)