Dataset¶

class earth_data_kit.stitching.Dataset(name, source, engine, format, clean=True)

The Dataset class is the main class implemented by the stitching module. It acts as a dataset wrapper and maps to a single remote dataset. A remote dataset can contain multiple files.

Initialize a new dataset instance.

Parameters:

name (str) – Unique identifier for the dataset
source (str) – Source identifier (S3 URI or Earth Engine collection ID)
engine (str) – Data source engine - s3, earth_engine or stac
format (str) – Data format - geotiff, netcdf, earth_engine or stac_asset
clean (bool, optional) – Whether to clean temporary files before processing. Defaults to True

Raises:

Exception – If the provided engine is not supported

Example

>>> from earth_data_kit.stitching.classes.dataset import Dataset
>>> # Earth Engine example
>>> ds = Dataset("example_dataset", "LANDSAT/LC08/C01/T1_SR", "earth_engine", "earth_engine")
>>> # S3 example
>>> ds = Dataset("example_dataset", "s3://your-bucket/path", "s3", "netcdf")
>>> # STAC example
>>> ds = Dataset("example_dataset", "https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a", "stac", "stac_asset")

set_timebounds(start, end, resolution=None)

Set time bounds for data download and optional temporal resolution for combining images.

Parameters:

start (datetime) – Start date
end (datetime) – End date (inclusive)
resolution (str, optional) – Temporal resolution (e.g., ‘D’ for daily, ‘W’ for weekly, ‘M’ for monthly) See pandas offset aliases for full list: https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases

Example

>>> import datetime
>>> from earth_data_kit.stitching import Dataset
>>> ds = Dataset("example_dataset", "LANDSAT/LC08/C01/T1_SR", "earth_engine", clean=True)
>>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 12, 31))
>>> # Set daily resolution
>>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 12, 31), resolution='D')
>>> # Set monthly resolution
>>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 12, 31), resolution='M')

set_spacebounds(bbox, grid_dataframe=None)

Configure spatial constraints for the dataset using a bounding box and, optionally, a grid dataframe.

This method sets up the spatial filtering parameters by specifying a bounding box defined by four coordinates in EPSG:4326. Additionally, if a grid dataframe is provided, the method will utilize it to accurately pinpoint the scene files to download based on the spatial variables in the source path.

Parameters:

bbox (tuple[float, float, float, float]) – A tuple of four coordinates in the order (min_longitude, min_latitude, max_longitude, max_latitude)/(xmin, ymin, xmax, ymax) defining the spatial extent.
grid_dataframe (geopandas.GeoDataFrame, optional) – A GeoDataFrame containing grid cells with columns that match the spatial variables in the source path (e.g., ‘h’, ‘v’ for MODIS grid). Each row should have a geometry column defining the spatial extent of the grid cell.

Example

>>> import earth_data_kit as edk
>>> import geopandas as gpd
>>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket/path/{h}/{v}/B01.TIF", "s3", "geotiff")
>>>
>>> # Setting spatial bounds using a bounding box:
>>> ds.set_spacebounds((19.3044861183, 39.624997667, 21.0200403175, 42.6882473822))
>>>
>>> # Setting spatial bounds with a grid dataframe:
>>> gdf = gpd.GeoDataFrame()
>>> # Assume gdf has columns 'h', 'v' that match the spatial variables in the source path
>>> # and a 'geometry' column with the spatial extent of each grid cell
>>> ds.set_spacebounds((19.3044861183, 39.624997667, 21.0200403175, 42.6882473822), grid_dataframe=gdf)

discover(band_locator='description')

Scans the dataset source to identify, catalog, and save the intersecting tiles based on provided time and spatial constraints.

Parameters:: band_locator (str, optional) – Specifies how to locate bands in the dataset. Defaults to “description”. Valid options are “description”, “color_interp”, “filename”.
Returns:: None
Raises:: Exception – Propagates any exceptions encountered during scanning, metadata retrieval, spatial filtering, or catalog saving.

Example

>>> import datetime
>>> import earth_data_kit as edk
>>> import geopandas as gpd
>>> ds = edk.stitching.Dataset(
...     "modis-pds",
...     "s3://modis-pds/MCD43A4.006/{h}/{v}/%Y%j/*_B0?.TIF",
...     "s3",
...     "geotiff",
...     True
... )
>>> ds.set_timebounds(datetime.datetime(2017, 1, 1), datetime.datetime(2017, 1, 2))
>>> # Load grid dataframe
>>> gdf = gpd.read_file("tests/fixtures/modis.kml")
>>> gdf['h'] = gdf['Name'].str.split(' ').str[0].str.split(':').str[1].astype(int).astype(str).str.zfill(2)
>>> gdf['v'] = gdf['Name'].str.split(' ').str[1].str.split(':').str[1].astype(int).astype(str).str.zfill(2)
>>> ds.set_spacebounds((19.30, 39.62, 21.02, 42.69), grid_dataframe=gdf)
>>> ds.discover() # This will scan the dataset and save the catalog of intersecting tiles

get_bands()

Retrieve unique band configurations from tile metadata.

Aggregates metadata from each tile by extracting attributes such as resolution (x_res, y_res) and coordinate reference system (crs). The data is then grouped by columns: band index inside tile (source_idx), band description, data type (dtype), x_res, y_res, and crs.

Returns:

A DataFrame with unique band configurations, where each row represents: a unique band configuration with the following columns: - source_idx: Band index within the source files - description: Band description - dtype: Data type of the band - x_res: X resolution - y_res: Y resolution - crs: Coordinate reference system - tiles: List of Tile objects that contain this band configuration

Return type:

pd.DataFrame

Example

>>> import datetime
>>> import earth_data_kit as edk
>>> import geopandas as gpd
>>> # Initialize the dataset
>>> ds = edk.stitching.Dataset("modis-pds", "s3://modis-pds/MCD43A4.006/{h}/{v}/%Y%j/*_B0?.TIF", "s3", "geotiff", True)
>>> ds.set_timebounds(datetime.datetime(2017, 1, 1), datetime.datetime(2017, 1, 2))
>>> # Load grid dataframe
>>> gdf = gpd.read_file("tests/fixtures/modis.kml")
>>> gdf['h'] = gdf['Name'].str.split(' ').str[0].str.split(':').str[1].astype(int).astype(str).str.zfill(2)
>>> gdf['v'] = gdf['Name'].str.split(' ').str[1].str.split(':').str[1].astype(int).astype(str).str.zfill(2)
>>> ds.set_spacebounds((19.30, 39.62, 21.02, 42.69), grid_dataframe=gdf)
>>> ds.discover()
>>> bands_df = ds.get_bands()
>>> print(bands_df.head())
   source_idx                description    dtype  x_res  y_res         crs                                              tiles
0           1  Nadir_Reflectance_Band1  uint16   30.0   30.0   EPSG:4326  [<earth_data_kit.stitching.classes.tile.Tile object...
1           1  Nadir_Reflectance_Band2  uint16   30.0   30.0   EPSG:4326  [<earth_data_kit.stitching.classes.tile.Tile object...
2           1  Nadir_Reflectance_Band3  uint16   30.0   30.0   EPSG:4326  [<earth_data_kit.stitching.classes.tile.Tile object...

Notes

The ‘source_idx’ column typically represents the band index within the source files. In some cases, this value will be 1 for all bands, especially when each band is stored in a separate file.

mosaic(bands, sync=False, overwrite=False, resolution=None, dtype=None, crs=None)

Identifies and extracts the required bands from the tile metadata for each unique date. For each band, it creates a single-band VRT that is then mosaiced together. These individual band mosaics are finally stacked into a multi-band VRT according to the ordered band arrangement provided.

Parameters:

bands (list[string]) – Ordered list of band descriptions to output as VRTs.
sync (bool, optional) – Whether to sync the remote data sources before processing. Default False.
overwrite (bool, optional) – Whether to overwrite existing synced files. Default False.
resolution (float, optional) – Desired output resolution in meters. If provided, reprojects all data to this resolution. If not provided, the output resolution is determined by the input data.
dtype (str, optional) – Desired output data type. If provided, casts all data to this dtype. If not provided, the output dtype is determined by the input data.
crs (str, optional) – Desired output CRS. If provided, reprojects all data to this CRS. If not provided, the output CRS is determined by the input data.

Example

>>> import datetime
>>> import earth_data_kit as edk
>>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket-name/path/to/data", "s3", "geotiff")
>>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 1, 31))
>>> ds.discover()  # Discover available scene files before stitching
>>> bands = ["red", "green", "blue"]
>>> ds.mosaic(bands, sync=True, overwrite=True, resolution=(10, -10), dtype="float32", crs="EPSG:4326")
>>> ds.save()  # Save the output VRTs to a JSON file

save()

Saves the mosaiced VRTs into a combined JSON file.

This method should be called after the mosaic() method to save the generated VRTs. The resulting JSON path is stored in the json_path attribute.

Returns:: None

to_dataarray()

Converts the dataset to an xarray DataArray.

This method opens the JSON file created by save() using xarray with the ‘edk_dataset’ engine and returns the DataArray corresponding to this dataset.

Returns:: A DataArray containing the dataset’s data with dimensions for time, bands, and spatial coordinates.
Return type:: xarray.DataArray

Example

>>> import earth_data_kit as edk
>>> import datetime
>>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket/path", "s3", "geotiff")
>>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 1, 31))
>>> ds.discover()
>>> ds.mosaic(bands=["red", "green", "blue"])
>>> ds.save()
>>> data_array = ds.to_dataarray()

Note

This method requires that mosaic() and save() have been called first to generate the JSON file.

static Dataset.dataarray_from_file(json_path)

Creates an xarray DataArray from a JSON file created by the save() method.

Automatically determines optimal chunking based on the underlying raster block size.

Parameters:: json_path (str) – Path to the JSON file containing dataset information.
Returns:: DataArray with dimensions for time, bands, and spatial coordinates.
Return type:: xarray.DataArray

Example

>>> import earth_data_kit as edk
>>> data_array = edk.stitching.Dataset.dataarray_from_file("path/to/dataset.json")

Note

Loads a previously saved dataset without needing to recreate the Dataset object.

static Dataset.combine(ref_da, das, method=None)

Combine a list of DataArrays by interpolating each to the grid of the reference DataArray, using the specified interpolation methods for each DataArray.

The reference DataArray (ref_da) and the DataArrays in das are typically returned by the .to_dataarray() function, and are expected to have dimensions: “time”, “band”, “x”, and “y”.

Parameters:

ref_da (xarray.DataArray) – The reference DataArray whose grid will be used for interpolation.
das (list of xarray.DataArray) – List of DataArrays to combine (excluding the reference DataArray).
method (str or list of str, optional) – Interpolation method(s) to use for each DataArray in das. If a single string is provided, it is used for all DataArrays. If a list is provided, it must be the same length as das. Default is “linear” for all.

Returns:

Concatenated DataArray with a new ‘band’ dimension, with the reference DataArray as the first band.

Return type:

xarray.DataArray