Collection Config Guide

Introduction

The Collection Config is a configuration file that defines collections to be ingested and maintained in SDAP. Currently, it supports defining collections of NetCDF data that will be processed into the custom NEXUS protobuf tile format or gridded Zarr data which can be used by SDAP directly with no need for processing. SDAP Ingester currently supports source data stored in AWS S3 or on the local filesystem (currently, however, not both at the same time).

This guide will explain how to set up both protobuf and Zarr collections.

Basic Structure

The Collection Config is a YAML file containing a single list named collections:

collections: []

The items in this list are the collections defined and they have the basic structure:

- id: <single variable collection name>
  path: <root collection location. Local path or S3 URI>
  priority: <queue priority>
  projection: <Grid | Swath>
  dimensionNames:
    latitude: <name of the latitude coordinate in the data>
    longitude: <name of the longitude coordinate in the data>
    time: <name of the time coordinate in the data>
    variable: <variable name>
- id: <multi variable collection name>
  path: <root collection location. Local path or S3 URI>
  priority: <queue priority>
  projection: <GridMulti | SwathMulti>
  dimensionNames:
    latitude: <name of the latitude coordinate in the data>
    longitude: <name of the longitude coordinate in the data>
    time: <name of the time coordinate in the data>
    variables:
    - <variable name 1>
    - <variable name 2>
    - <variable name 3>

There are slight variations and additions to this structure depending on the type of collection, which will be covered below.

NetCDF - Protobuf Collections

For NetCDF data, you’ll also need to tell the Ingester how big you want to make the tiles. This is set with the slices object, which is a dictionary mapping dimension names to slice lengths. Omitted dimensions are assumed to be 1. It is important to set tile sizes that are not too big as to result in excess unnecessary data transfer, but also not too small as to result in an explosion in the number of generated tiles, which will lead to excessive metadata storage overhead and possible performance degradations. For gridded data, we recommend tile sizes between 30 x 30 and 100 x 100, we also strongly recommend swath tiles be sized no larger than 15 x 15, as the current method for handling swath data is very memory inefficient scaled rapidly by tile size.

Note

The source dataset dimension names are used in slice definitions, not the coordinate names as in the dimensionNames object. In gridded datasets, these names are often the same, but this is not the case for swath data.

Example:

collections:
- id: MUR25-JPL-L4-GLOB-v04.2
  path: s3://mur-sst/zarr-v1/
  priority: 1
  projection: Grid
  dimensionNames:
    latitude: lat
    longitude: lon
    time: time
    variable: analysed_sst
  slices:
    lat: 100
    lon: 100
    time: 1
- id: ASCATB-L2-Coastal
  path: s3://example-bucket/swath-path/
  priority: 1
  projection: SwathMulti
  dimensionNames:
    latitude: lat
    longitude: lon
    time: time
    variables:
    - wind_speed
    - wind_dir
  slices:
    NUMROWS: 15
    NUMROWS: 15

Zarr Collections

To specify a collection as a Zarr collection, simply add storeType: zarr to the collection object. If the data is local, this is all you need to do.

id: <collection name>
path: <root collection location. Local path>
priority: <queue priority>
projection: <Grid | GridMulti>
storeType: zarr
dimensionNames:
  latitude: <name of the latitude coordinate in the data>
  longitude: <name of the longitude coordinate in the data>
  time: <name of the time coordinate in the data>
  variable: <variable name>

For data in S3, you need to provide information on how to access the data. This is currently done with the config.aws object.

You will need to provide credentials to access the bucket, or specify if it is public:

Example:

collections:
- id: MUR_SST
  path: s3://mur-sst/zarr-v1/
  priority: 1
  projection: Grid
  storeType: zarr
  dimensionNames:
    latitude: lat
    longitude: lon
    time: time
    variable: analysed_sst
  config:
    aws:
      public: true
- id: private_data
  path: s3://example-bucket/zarr/path/
  priority: 1
  projection: GridMulti
  storeType: zarr
  dimensionNames:
    latitude: lat
    longitude: lon
    time: time
    variables:
    - var1
    - var2
    - var3
  config:
    aws:
      accessKeyID: <secret>
      secretAccessKey: <secret>
      public: false