Remote LLC4320 Access (Kerchunk + Dask) for https://mghp.osn.xsede.org

This module provides lazy, remote access to LLC4320 model output and grid files using kerchunk references, xarray, and Dask. Data are accessed directly from an S3-compatible endpoint without downloading raw binaries.

Raw Data Source

There are multiple sources for the llc4320 data. This codebase currently only supports the following source : https://mghp.osn.xsede.org. This data was kindly archived by Spencer Jones.

The majority of our code to download data from this source is taken from this repo: https://github.com/cspencerjones/OSN_LLC4320/blob/main/Open_llc4320_surface_velocities.ipynb

This dataset is a subset of the total output data from LLC4320. It contains only surface depth and features : Theta, U, V, W, Salt, Eta. The features are stored at float 32 at hourly intervals. The expert on this is Spencer Jones.

Design Overview

Kerchunk JSON files map LLC4320 binaries to Zarr-style references.

xarray.open_dataset(…, engine=”zarr”) opens these references lazily.

Dask defers all I/O and enables parallel reads across faces.

Individual LLC faces are merged using xr.combine_by_coords.

Custom close handlers ensure all underlying file references are released.

Import

import data_ingestion.get_raw_data as get_raw_data

Dask Setup and Usage

First, set up a dask client for effecient loading.

from dask.distributed import Client
client = Client()  # uses all local cores by default

Set up a range of iteration you want to sample.

LLC4320 output is indexed by iteration number.

Important values :

  • First valid wind/forcing record: ~1180

  • Last iteration: 1495008

  • Timesteps per hour: 144 (because the llc model has a cadence of 25 seconds but we only have snapshots from every hour)

Example iteration range starting with the first valid wind record and looking at the next 12 hours

timestep_hours = 12                     # how many hours to load
sampling_step = 1                       # stride in timesteps
ts_per_hour = 144                       # model cadence this is constant

iter_step = sampling_step * ts_per_hour 

start_record = 1180

start_iter = 10368 + start_record * ts_per_hour
end_iter = start_iter + timestep_hours * ts_per_hour

iter_range = np.arange(start_iter, end_iter, iter_step)

Load in grid file

co = get_raw_data.get_remote_gridfile(endpoint_url)

Loads and combines grid variables (XC, YC, metrics, CS/SN, etc.) for all 13 faces.

Loads all requested faces for one iteration. This should be within a loop.

it = iter_range[0]
ds = get_raw_data.get_remote_llc_data(endpoint_url, it, face_range)

Returns surface-level fields only (time=0, k=0, k_l=0)

Data remain lazy until .compute() is called

Example:

theta_mean = ds["Theta"].mean().compute()

Notes

Anonymous, read-only S3 access

One kerchunk JSON per face per iteration

Horizontal chunking applied (i, j)