Distributed Tile Processing

with

GeoTrellis and Spark

Rob Emanuele / @lossyrob

The Challenge


How do we work with very large raster data?

Specifically...


How do we work with the
NASA NEX Down-sampled Climate Projections (NEX-DCP30)
open data set?

What is NEX Climate Projection data?

Global Circulation Models

Models for predicting world temperature and precipitation.

IPCC Assessment Report


  • IPCC = Intergovernmental Panel on Climate Change
  • Assessment Report 5 (AR5) published in 2014.
  • More than 800 authors

3 Key Categories:

  • Model

    • 33 different models
    • Model Ensembling
  • Dataset

    • Temperature MAX
    • Temperature MIN
    • Precipitation
  • Scenario

    • Historical
    • Future RCPs

Representative Concentration Pathways

NEX Down-sampled Data

  • Monthly data over conterminous US
    • Historical from 1950 - 2006
    • 4 RCP scenarios from 2006 - 2099
  • 8190 netCDF files on S3 - s3://nasanex/NEX-DCP30
  • 15.3 TB in compressed GeoTiff tiles.
  • RCP 8.5, max for datatype/model combo: 90.92 GB

Our workflow for processing NEX data

The Tools

GeoTrellis
  • Scala library for doing all things geospatial.
  • framework for doing distributed raster processing on Akka and Spark.
  • Includes local, zonal, focal, and global operations on rasters.
  • Currently in incubation at
Spark
  • Fast and general engine for large-scale data processing
  • Does things Hadoop doesn't, like cache intermediate results in memory.
  • Written in Scala!
  • Also has bindings for Python and Java
Accumulo

Accumulo

  • Big table implementation
  • Has sorted indexing
  • Columnar database
  • Also used by GeoMesa, another Scala project at LocationTech

Strategies for working with Big Rasters

Tiles

Tiles

Indexing tiles

RasterRDD[K]



K is key type, based on tile indexing.

  • SpatialKey
  • TemporalKey
  • SpaceTimeKey

Data loading

Step 1:

Export the netCDF data into 512x512 GeoTiff tiles.

Step 2:

Ingest the data into Accumulo using GeoTrellis-Spark.

  • Ingest the GeoTiffs to Accumulo in parallel across a cluster.
  • Ingest consists of
    • reprojection
    • mosaicing to tile scheme (TMS)
    • pyramiding up zoom levels
    • Calculate index splits.

Analysis of NEX data

Live coding session...

Thanks!

Take it away Johan...

The GeoTiff File Format

with

GeoTrellis and Scala

Johan Stenberg / @johanstenbergg

How do you read GeoTiffs on the JVM?


  • GDAL, Geospatial C lib, fast!


  • GeoTools, Geospatial Java lib, speed?

Why yet another GeoTiff Reader?


  • GeoTools large dependency

  • GDAL Java bindings hard to install

  • Go-To raster file format at GeoTrellis

  • GeoTrellis is all about speed, everything optimized and benchmarked

What is the GeoTiff file format?


  • Extension to the Tiff File Format

  • Used for images with Geospatial Metadata

  • Adds a bounding box and the CRS through tags

Geodata?


  • Bounding Box easy to read

  • Coordinate Reference System horrible to read

  • Turn it into a proj4 string and use the proj4j lib to read

Compressions


  • Huffman, CCITT3, CCITT4, Packbits

  • LZW

  • Zip

Benchmark Time!


Benchmark Disclaimer


  • Ran on my development computer

  • Conducted with Caliper

  • Microbenchmarks, look at relative speed, not speed

  • GDAL is read through the Java bindings, into GeoTrellis rasters

  • GeoTools is also turned into GeoTrellis rasters

~same for CCITT3 and CCITT4

~same for CCITT3 and CCITT4

Sidenote about Speed


  • Scala slow when using functional mappings

  • Arrays, while loops and bit operations

  • Skip Big-O time complexity analyzation (O(n) - duh), use microbenchmarks

Future?


  • Tons of compressions, JPEG hard but needed

  • Keep up to date with custom tags

  • Add a shape file reader (GeoTools is fast!)

QUESTIONS?


Benchmarks found at https://github.com/geotrellis/benchmark


http://geotrellis.io