# Remote LLC4320 Access (Kerchunk + Dask) for https://mghp.osn.xsede.org

This module provides lazy, remote access to LLC4320 model output and grid files using kerchunk references, xarray, and Dask. Data are accessed directly from an S3-compatible endpoint without downloading raw binaries.

## Raw Data Source 
There are multiple sources for the llc4320 data. This codebase currently only supports the following source : https://mghp.osn.xsede.org. 
This data was kindly archived by Spencer Jones.

The majority of our code to download data from this source is taken from this repo: https://github.com/cspencerjones/OSN_LLC4320/blob/main/Open_llc4320_surface_velocities.ipynb 

This dataset is a subset of the total output data from LLC4320. It contains only surface depth and features : Theta, U, V, W, Salt, Eta. The features are stored at float 32 at hourly intervals. The expert on this is Spencer Jones.

## Design Overview

Kerchunk JSON files map LLC4320 binaries to Zarr-style references.

xarray.open_dataset(..., engine="zarr") opens these references lazily.

Dask defers all I/O and enables parallel reads across faces.

Individual LLC faces are merged using xr.combine_by_coords.

Custom close handlers ensure all underlying file references are released.

## Import 
```
import data_ingestion.get_raw_data as get_raw_data
```

## Dask Setup and Usage

First, set up a dask client for effecient loading. 
```
from dask.distributed import Client
client = Client()  # uses all local cores by default
```

Set up a range of iteration you want to sample. 

LLC4320 output is indexed by iteration number.

Important values :
- First valid wind/forcing record: ~1180
- Last iteration: 1495008
- Timesteps per hour: 144 (because the llc model has a cadence of 25 seconds but we only have snapshots from every hour)


Example iteration range starting with the first valid wind record and looking at the next 12 hours
```
timestep_hours = 12                     # how many hours to load
sampling_step = 1                       # stride in timesteps
ts_per_hour = 144                       # model cadence this is constant

iter_step = sampling_step * ts_per_hour 

start_record = 1180

start_iter = 10368 + start_record * ts_per_hour
end_iter = start_iter + timestep_hours * ts_per_hour

iter_range = np.arange(start_iter, end_iter, iter_step)
```

Load in grid file
```
co = get_raw_data.get_remote_gridfile(endpoint_url)
```

Loads and combines grid variables (XC, YC, metrics, CS/SN, etc.) for all 13 faces.

Loads all requested faces for one iteration. This should be within a loop. 

```
it = iter_range[0]
ds = get_raw_data.get_remote_llc_data(endpoint_url, it, face_range)
```

Returns surface-level fields only (time=0, k=0, k_l=0)

Data remain lazy until .compute() is called

Example:
```
theta_mean = ds["Theta"].mean().compute()
```

Notes

Anonymous, read-only S3 access

One kerchunk JSON per face per iteration

Horizontal chunking applied (i, j)