Dataset¶
- class earth_data_kit.stitching.Dataset(name, source, engine, format, clean=True)
The Dataset class is the main class implemented by the stitching module. It acts as a dataset wrapper and maps to a single remote dataset. A remote dataset can contain multiple files.
Initialize a new dataset instance.
- Parameters:
name (str) – Unique identifier for the dataset
source (str) – Source identifier (S3 URI or Earth Engine collection ID)
engine (str) – Data source engine -
s3,earth_engineorstacformat (str) – Data format -
geotiff,netcdf,earth_engineorstac_assetclean (bool, optional) – Whether to clean temporary files before processing. Defaults to True
- Raises:
Exception – If the provided engine is not supported
Example
>>> from earth_data_kit.stitching.classes.dataset import Dataset >>> # Earth Engine example >>> ds = Dataset("example_dataset", "LANDSAT/LC08/C01/T1_SR", "earth_engine", "earth_engine") >>> # S3 example >>> ds = Dataset("example_dataset", "s3://your-bucket/path", "s3", "netcdf") >>> # STAC example >>> ds = Dataset("example_dataset", "https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a", "stac", "stac_asset")
- set_timebounds(start, end, resolution=None)
Set time bounds for data download and optional temporal resolution for combining images.
- Parameters:
start (datetime) – Start date
end (datetime) – End date (inclusive)
resolution (str, optional) – Temporal resolution (e.g., ‘D’ for daily, ‘W’ for weekly, ‘M’ for monthly) See pandas offset aliases for full list: https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases
Example
>>> import datetime >>> from earth_data_kit.stitching import Dataset >>> ds = Dataset("example_dataset", "LANDSAT/LC08/C01/T1_SR", "earth_engine", clean=True) >>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 12, 31)) >>> # Set daily resolution >>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 12, 31), resolution='D') >>> # Set monthly resolution >>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 12, 31), resolution='M')
- set_spacebounds(bbox, grid_dataframe=None)
Configure spatial constraints for the dataset using a bounding box and, optionally, a grid dataframe.
This method sets up the spatial filtering parameters by specifying a bounding box defined by four coordinates in EPSG:4326. Additionally, if a grid dataframe is provided, the method will utilize it to accurately pinpoint the scene files to download based on the spatial variables in the source path.
- Parameters:
bbox (tuple[float, float, float, float]) – A tuple of four coordinates in the order (min_longitude, min_latitude, max_longitude, max_latitude)/(xmin, ymin, xmax, ymax) defining the spatial extent.
grid_dataframe (geopandas.GeoDataFrame, optional) – A GeoDataFrame containing grid cells with columns that match the spatial variables in the source path (e.g., ‘h’, ‘v’ for MODIS grid). Each row should have a geometry column defining the spatial extent of the grid cell.
Example
>>> import earth_data_kit as edk >>> import geopandas as gpd >>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket/path/{h}/{v}/B01.TIF", "s3", "geotiff") >>> >>> # Setting spatial bounds using a bounding box: >>> ds.set_spacebounds((19.3044861183, 39.624997667, 21.0200403175, 42.6882473822)) >>> >>> # Setting spatial bounds with a grid dataframe: >>> gdf = gpd.GeoDataFrame() >>> # Assume gdf has columns 'h', 'v' that match the spatial variables in the source path >>> # and a 'geometry' column with the spatial extent of each grid cell >>> ds.set_spacebounds((19.3044861183, 39.624997667, 21.0200403175, 42.6882473822), grid_dataframe=gdf)
- discover(band_locator='description')
Scans the dataset source to identify, catalog, and save the intersecting tiles based on provided time and spatial constraints.
- Parameters:
band_locator (str, optional) – Specifies how to locate bands in the dataset. Defaults to “description”. Valid options are “description”, “color_interp”, “filename”.
- Returns:
None
- Raises:
Exception – Propagates any exceptions encountered during scanning, metadata retrieval, spatial filtering, or catalog saving.
Example
>>> import datetime >>> import earth_data_kit as edk >>> import geopandas as gpd >>> ds = edk.stitching.Dataset( ... "modis-pds", ... "s3://modis-pds/MCD43A4.006/{h}/{v}/%Y%j/*_B0?.TIF", ... "s3", ... "geotiff", ... True ... ) >>> ds.set_timebounds(datetime.datetime(2017, 1, 1), datetime.datetime(2017, 1, 2)) >>> # Load grid dataframe >>> gdf = gpd.read_file("tests/fixtures/modis.kml") >>> gdf['h'] = gdf['Name'].str.split(' ').str[0].str.split(':').str[1].astype(int).astype(str).str.zfill(2) >>> gdf['v'] = gdf['Name'].str.split(' ').str[1].str.split(':').str[1].astype(int).astype(str).str.zfill(2) >>> ds.set_spacebounds((19.30, 39.62, 21.02, 42.69), grid_dataframe=gdf) >>> ds.discover() # This will scan the dataset and save the catalog of intersecting tiles
- get_bands()
Retrieve unique band configurations from tile metadata.
Aggregates metadata from each tile by extracting attributes such as resolution (x_res, y_res) and coordinate reference system (crs). The data is then grouped by columns: band index inside tile (source_idx), band description, data type (dtype), x_res, y_res, and crs.
- Returns:
- A DataFrame with unique band configurations, where each row represents
a unique band configuration with the following columns: - source_idx: Band index within the source files - description: Band description - dtype: Data type of the band - x_res: X resolution - y_res: Y resolution - crs: Coordinate reference system - tiles: List of Tile objects that contain this band configuration
- Return type:
pd.DataFrame
Example
>>> import datetime >>> import earth_data_kit as edk >>> import geopandas as gpd >>> # Initialize the dataset >>> ds = edk.stitching.Dataset("modis-pds", "s3://modis-pds/MCD43A4.006/{h}/{v}/%Y%j/*_B0?.TIF", "s3", "geotiff", True) >>> ds.set_timebounds(datetime.datetime(2017, 1, 1), datetime.datetime(2017, 1, 2)) >>> # Load grid dataframe >>> gdf = gpd.read_file("tests/fixtures/modis.kml") >>> gdf['h'] = gdf['Name'].str.split(' ').str[0].str.split(':').str[1].astype(int).astype(str).str.zfill(2) >>> gdf['v'] = gdf['Name'].str.split(' ').str[1].str.split(':').str[1].astype(int).astype(str).str.zfill(2) >>> ds.set_spacebounds((19.30, 39.62, 21.02, 42.69), grid_dataframe=gdf) >>> ds.discover() >>> bands_df = ds.get_bands() >>> print(bands_df.head()) source_idx description dtype x_res y_res crs tiles 0 1 Nadir_Reflectance_Band1 uint16 30.0 30.0 EPSG:4326 [<earth_data_kit.stitching.classes.tile.Tile object... 1 1 Nadir_Reflectance_Band2 uint16 30.0 30.0 EPSG:4326 [<earth_data_kit.stitching.classes.tile.Tile object... 2 1 Nadir_Reflectance_Band3 uint16 30.0 30.0 EPSG:4326 [<earth_data_kit.stitching.classes.tile.Tile object...
Notes
The ‘source_idx’ column typically represents the band index within the source files. In some cases, this value will be 1 for all bands, especially when each band is stored in a separate file.
- mosaic(bands, sync=False, overwrite=False, resolution=None, dtype=None, crs=None)
Identifies and extracts the required bands from the tile metadata for each unique date. For each band, it creates a single-band VRT that is then mosaiced together. These individual band mosaics are finally stacked into a multi-band VRT according to the ordered band arrangement provided.
- Parameters:
bands (list[string]) – Ordered list of band descriptions to output as VRTs.
sync (bool, optional) – Whether to sync the remote data sources before processing. Default False.
overwrite (bool, optional) – Whether to overwrite existing synced files. Default False.
resolution (float, optional) – Desired output resolution in meters. If provided, reprojects all data to this resolution. If not provided, the output resolution is determined by the input data.
dtype (str, optional) – Desired output data type. If provided, casts all data to this dtype. If not provided, the output dtype is determined by the input data.
crs (str, optional) – Desired output CRS. If provided, reprojects all data to this CRS. If not provided, the output CRS is determined by the input data.
Example
>>> import datetime >>> import earth_data_kit as edk >>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket-name/path/to/data", "s3", "geotiff") >>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 1, 31)) >>> ds.discover() # Discover available scene files before stitching >>> bands = ["red", "green", "blue"] >>> ds.mosaic(bands, sync=True, overwrite=True, resolution=(10, -10), dtype="float32", crs="EPSG:4326") >>> ds.save() # Save the output VRTs to a JSON file
- save()
Saves the mosaiced VRTs into a combined JSON file.
This method should be called after the mosaic() method to save the generated VRTs. The resulting JSON path is stored in the json_path attribute.
- Returns:
None
- to_dataarray()
Converts the dataset to an xarray DataArray.
This method opens the JSON file created by save() using xarray with the ‘edk_dataset’ engine and returns the DataArray corresponding to this dataset.
- Returns:
A DataArray containing the dataset’s data with dimensions for time, bands, and spatial coordinates.
- Return type:
xarray.DataArray
Example
>>> import earth_data_kit as edk >>> import datetime >>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket/path", "s3", "geotiff") >>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 1, 31)) >>> ds.discover() >>> ds.mosaic(bands=["red", "green", "blue"]) >>> ds.save() >>> data_array = ds.to_dataarray()
Note
This method requires that mosaic() and save() have been called first to generate the JSON file.
- static Dataset.dataarray_from_file(json_path)
Creates an xarray DataArray from a JSON file created by the save() method.
Automatically determines optimal chunking based on the underlying raster block size.
- Parameters:
json_path (str) – Path to the JSON file containing dataset information.
- Returns:
DataArray with dimensions for time, bands, and spatial coordinates.
- Return type:
xarray.DataArray
Example
>>> import earth_data_kit as edk >>> data_array = edk.stitching.Dataset.dataarray_from_file("path/to/dataset.json")
Note
Loads a previously saved dataset without needing to recreate the Dataset object.
- static Dataset.combine(ref_da, das, method=None)
Combine a list of DataArrays by interpolating each to the grid of the reference DataArray, using the specified interpolation methods for each DataArray.
The reference DataArray (ref_da) and the DataArrays in das are typically returned by the .to_dataarray() function, and are expected to have dimensions: “time”, “band”, “x”, and “y”.
- Parameters:
ref_da (xarray.DataArray) – The reference DataArray whose grid will be used for interpolation.
das (list of xarray.DataArray) – List of DataArrays to combine (excluding the reference DataArray).
method (str or list of str, optional) – Interpolation method(s) to use for each DataArray in das. If a single string is provided, it is used for all DataArrays. If a list is provided, it must be the same length as das. Default is “linear” for all.
- Returns:
Concatenated DataArray with a new ‘band’ dimension, with the reference DataArray as the first band.
- Return type:
xarray.DataArray