Remote LLC4320 Access (Kerchunk + Dask) for https://mghp.osn.xsede.org
This module provides lazy, remote access to LLC4320 model output and grid files using kerchunk references, xarray, and Dask. Data are accessed directly from an S3-compatible endpoint without downloading raw binaries.
Raw Data Source
There are multiple sources for the llc4320 data. This codebase currently only supports the following source : https://mghp.osn.xsede.org. This data was kindly archived by Spencer Jones.
The majority of our code to download data from this source is taken from this repo: https://github.com/cspencerjones/OSN_LLC4320/blob/main/Open_llc4320_surface_velocities.ipynb
This dataset is a subset of the total output data from LLC4320. It contains only surface depth and features : Theta, U, V, W, Salt, Eta. The features are stored at float 32 at hourly intervals. The expert on this is Spencer Jones.
Design Overview
Kerchunk JSON files map LLC4320 binaries to Zarr-style references.
xarray.open_dataset(…, engine=”zarr”) opens these references lazily.
Dask defers all I/O and enables parallel reads across faces.
Individual LLC faces are merged using xr.combine_by_coords.
Custom close handlers ensure all underlying file references are released.
Import
import data_ingestion.get_raw_data as get_raw_data
Dask Setup and Usage
First, set up a dask client for effecient loading.
from dask.distributed import Client
client = Client() # uses all local cores by default
Set up a range of iteration you want to sample.
LLC4320 output is indexed by iteration number.
Important values :
First valid wind/forcing record: ~1180
Last iteration: 1495008
Timesteps per hour: 144 (because the llc model has a cadence of 25 seconds but we only have snapshots from every hour)
Example iteration range starting with the first valid wind record and looking at the next 12 hours
timestep_hours = 12 # how many hours to load
sampling_step = 1 # stride in timesteps
ts_per_hour = 144 # model cadence this is constant
iter_step = sampling_step * ts_per_hour
start_record = 1180
start_iter = 10368 + start_record * ts_per_hour
end_iter = start_iter + timestep_hours * ts_per_hour
iter_range = np.arange(start_iter, end_iter, iter_step)
Load in grid file
co = get_raw_data.get_remote_gridfile(endpoint_url)
Loads and combines grid variables (XC, YC, metrics, CS/SN, etc.) for all 13 faces.
Loads all requested faces for one iteration. This should be within a loop.
it = iter_range[0]
ds = get_raw_data.get_remote_llc_data(endpoint_url, it, face_range)
Returns surface-level fields only (time=0, k=0, k_l=0)
Data remain lazy until .compute() is called
Example:
theta_mean = ds["Theta"].mean().compute()
Notes
Anonymous, read-only S3 access
One kerchunk JSON per face per iteration
Horizontal chunking applied (i, j)