Distributed Tile Processing

with

GeoTrellis and Spark

Rob Emanuele / @lossyrob

The Challenge

How do we work with very large raster data?

Specifically...

How do we work with the
NASA NEX Down-sampled Climate Projections (NEX-DCP30)
open data set?

What is NEX Climate Projection data?

Global Circulation Models

Models for predicting world temperature and precipitation.

IPCC Assessment Report

IPCC = Intergovernmental Panel on Climate Change
Assessment Report 5 (AR5) published in 2014.
More than 800 authors

3 Key Categories:

Model
- 33 different models
- Model Ensembling
Dataset
- Temperature MAX
- Temperature MIN
- Precipitation
Scenario
- Historical
- Future RCPs

Representative Concentration Pathways

NEX Down-sampled Data

Monthly data over conterminous US
- Historical from 1950 - 2006
- 4 RCP scenarios from 2006 - 2099
8190 netCDF files on S3 - s3://nasanex/NEX-DCP30
15.3 TB in compressed GeoTiff tiles.
RCP 8.5, max for datatype/model combo: 90.92 GB

Our workflow for processing NEX data

The Tools

Scala library for doing all things geospatial.
framework for doing distributed raster processing on Akka and Spark.
Includes local, zonal, focal, and global operations on rasters.
Currently in incubation at

Fast and general engine for large-scale data processing
Does things Hadoop doesn't, like cache intermediate results in memory.
Written in Scala!
Also has bindings for Python and Java

Accumulo

Big table implementation
Has sorted indexing
Columnar database
Also used by GeoMesa, another Scala project at LocationTech

Strategies for working with Big Rasters

Tiles

Source: http://www.geovista.psu.edu/publications/teamJRM/figs/Fig6b.JPG

Indexing tiles

$Source:http://www.mathcurve.com/fractals/lebesgue/zcurve.gif$ Source: http://media.tumblr.com/tumblr_m0dlqv2Vpq1qir7tc.png

RasterRDD[K]

K is key type, based on tile indexing.

SpatialKey
TemporalKey
SpaceTimeKey

Data loading

Step 1:

Export the netCDF data into 512x512 GeoTiff tiles.

Python code using GDAL and rasterio.
AWS Auto scaling groups and SQS.
Code: https://github.com/lossyrob/nex-chunker-worker

Step 2:

Ingest the data into Accumulo using GeoTrellis-Spark.

Ingest the GeoTiffs to Accumulo in parallel across a cluster.
Ingest consists of
- reprojection
- mosaicing to tile scheme (TMS)
- pyramiding up zoom levels
- Calculate index splits.

Analysis of NEX data

Live coding session...

Thanks!

Take it away Johan...

The GeoTiff File Format

with

GeoTrellis and Scala

Johan Stenberg / @johanstenbergg

How do you read GeoTiffs on the JVM?

GDAL, Geospatial C lib, fast!

GeoTools, Geospatial Java lib, speed?

Why yet another GeoTiff Reader?

GeoTools large dependency

GDAL Java bindings hard to install

Go-To raster file format at GeoTrellis

GeoTrellis is all about speed, everything optimized and benchmarked

What is the GeoTiff file format?

Extension to the Tiff File Format

Used for images with Geospatial Metadata

Adds a bounding box and the CRS through tags

Geodata?

Bounding Box easy to read

Coordinate Reference System horrible to read

Turn it into a proj4 string and use the proj4j lib to read

Compressions

Huffman, CCITT3, CCITT4, Packbits

Benchmark Time!

Benchmark Disclaimer

Ran on my development computer

Conducted with Caliper

Microbenchmarks, look at relative speed, not speed

GDAL is read through the Java bindings, into GeoTrellis rasters

GeoTools is also turned into GeoTrellis rasters

~same for CCITT3 and CCITT4

Sidenote about Speed

Scala slow when using functional mappings

Arrays, while loops and bit operations

Skip Big-O time complexity analyzation (O(n) - duh), use microbenchmarks

Future?

Tons of compressions, JPEG hard but needed

Keep up to date with custom tags

Add a shape file reader (GeoTools is fast!)

QUESTIONS?

Benchmarks found at https://github.com/geotrellis/benchmark

Distributed Tile Processing

with

GeoTrellis and Spark

The Challenge

How do we work with very large raster data?

Specifically...

How do we work with the NASA NEX Down-sampled Climate Projections (NEX-DCP30) open data set?

What is NEX Climate Projection data?

Global Circulation Models

IPCC Assessment Report

3 Key Categories:

Model

Dataset

Scenario

Representative Concentration Pathways

NEX Down-sampled Data

Our workflow for processing NEX data

The Tools

Accumulo

Strategies for working with Big Rasters

Tiles

Tiles

Indexing tiles

Data loading

Step 1:

Export the netCDF data into 512x512 GeoTiff tiles.

Step 2:

Ingest the data into Accumulo using GeoTrellis-Spark.

Analysis of NEX data

Thanks!

The GeoTiff File Format

with

GeoTrellis and Scala

How do you read GeoTiffs on the JVM?

GDAL, Geospatial C lib, fast!

GeoTools, Geospatial Java lib, speed?

Why yet another GeoTiff Reader?

What is the GeoTiff file format?

Geodata?

Compressions

Benchmark Time!

Benchmark Disclaimer

~same for CCITT3 and CCITT4

~same for CCITT3 and CCITT4

Sidenote about Speed

Future?

QUESTIONS?

http://geotrellis.io

How do we work with the
NASA NEX Down-sampled Climate Projections (NEX-DCP30)
open data set?