Adding the MEASURES data to the google bucket#
In the ice flow chapter we loaded some widely used antarcitc surface velocity data from the cloud. This notebook demonstrates how to download that data, manipulate it so that it is in a format that makes cloud computing efficient, and upload it to a google bucket.
To actually download and upload these data you will need your own NSIDC credentials (for the download) and google bucket token (for the upload).
Download#
To download the data from NSIDC to your local machine, run the following command. You will need an free account with NASA Earthdata Login account. More details can be found here. Then replace USERNAME and PASSWORD in the command below with your Earthdata Login username and password.
!wget --http-user=USERNAME --http-password=PASSWORD https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0484.002/1996.01.01/antarctica_ice_velocity_450m_v2.nc
--2022-12-12 10:46:24-- https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0484.002/1996.01.01/antarctica_ice_velocity_450m_v2.nc
Resolving n5eil01u.ecs.nsidc.org (n5eil01u.ecs.nsidc.org)... 128.138.97.102
Connecting to n5eil01u.ecs.nsidc.org (n5eil01u.ecs.nsidc.org)|128.138.97.102|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://urs.earthdata.nasa.gov/oauth/authorize?app_type=401&client_id=_JLuwMHxb2xX6NwYTb4dRA&response_type=code&redirect_uri=https%3A%2F%2Fn5eil01u.ecs.nsidc.org%2FOPS%2Fredirect&state=aHR0cHM6Ly9uNWVpbDAxdS5lY3MubnNpZGMub3JnL01FQVNVUkVTL05TSURDLTA0ODQuMDAyLzE5OTYuMDEuMDEvYW50YXJjdGljYV9pY2VfdmVsb2NpdHlfNDUwbV92Mi5uYw [following]
--2022-12-12 10:46:24-- https://urs.earthdata.nasa.gov/oauth/authorize?app_type=401&client_id=_JLuwMHxb2xX6NwYTb4dRA&response_type=code&redirect_uri=https%3A%2F%2Fn5eil01u.ecs.nsidc.org%2FOPS%2Fredirect&state=aHR0cHM6Ly9uNWVpbDAxdS5lY3MubnNpZGMub3JnL01FQVNVUkVTL05TSURDLTA0ODQuMDAyLzE5OTYuMDEuMDEvYW50YXJjdGljYV9pY2VfdmVsb2NpdHlfNDUwbV92Mi5uYw
Resolving urs.earthdata.nasa.gov (urs.earthdata.nasa.gov)... 198.118.243.33
Connecting to urs.earthdata.nasa.gov (urs.earthdata.nasa.gov)|198.118.243.33|:443... connected.
HTTP request sent, awaiting response... 401 Unauthorized
Authentication selected: Basic realm="Please enter your Earthdata Login credentials. If you do not have a Earthdata Login, create one at https://urs.earthdata.nasa.gov//users/new"
Reusing existing connection to urs.earthdata.nasa.gov:443.
HTTP request sent, awaiting response... 302 Found
Location: https://n5eil01u.ecs.nsidc.org/OPS/redirect?code=22ade1fdebe174b1ee00d376b4166731b0aed7ee1cc461732d93296ecf4bef39&state=aHR0cHM6Ly9uNWVpbDAxdS5lY3MubnNpZGMub3JnL01FQVNVUkVTL05TSURDLTA0ODQuMDAyLzE5OTYuMDEuMDEvYW50YXJjdGljYV9pY2VfdmVsb2NpdHlfNDUwbV92Mi5uYw [following]
--2022-12-12 10:46:24-- https://n5eil01u.ecs.nsidc.org/OPS/redirect?code=22ade1fdebe174b1ee00d376b4166731b0aed7ee1cc461732d93296ecf4bef39&state=aHR0cHM6Ly9uNWVpbDAxdS5lY3MubnNpZGMub3JnL01FQVNVUkVTL05TSURDLTA0ODQuMDAyLzE5OTYuMDEuMDEvYW50YXJjdGljYV9pY2VfdmVsb2NpdHlfNDUwbV92Mi5uYw
Connecting to n5eil01u.ecs.nsidc.org (n5eil01u.ecs.nsidc.org)|128.138.97.102|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0484.002/1996.01.01/antarctica_ice_velocity_450m_v2.nc [following]
--2022-12-12 10:46:25-- https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0484.002/1996.01.01/antarctica_ice_velocity_450m_v2.nc
Reusing existing connection to n5eil01u.ecs.nsidc.org:443.
HTTP request sent, awaiting response... No data received.
Retrying.
--2022-12-12 10:46:26-- (try: 2) https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0484.002/1996.01.01/antarctica_ice_velocity_450m_v2.nc
Connecting to n5eil01u.ecs.nsidc.org (n5eil01u.ecs.nsidc.org)|128.138.97.102|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6814851561 (6.3G) [application/x-netcdf]
Saving to: ‘antarctica_ice_velocity_450m_v2.nc’
antarctica_ice_velo 100%[===================>] 6.35G 10.1MB/s in 16m 54s
2022-12-12 11:03:20 (6.41 MB/s) - ‘antarctica_ice_velocity_450m_v2.nc’ saved [6814851561/6814851561]
Load#
Load the data lazily (so that it isnt all loaded into memory at once) using xarray
import xarray as xr
ds = xr.open_dataset('antarctica_ice_velocity_450m_v2.nc')
Inspect the size of the dataset and take a look at the coordinates, variables and dimensions.
print(f"the dataset is {ds.nbytes/1e9} Gb")
the dataset is 6.814832221 Gb
ds
<xarray.Dataset> Dimensions: (x: 12445, y: 12445) Coordinates: * x (x) float64 -2.8e+06 -2.8e+06 -2.799e+06 ... 2.799e+06 2.8e+06 * y (y) float64 2.8e+06 2.8e+06 2.799e+06 ... -2.799e+06 -2.8e+06 lat (y, x) float64 -54.67 -54.68 -54.68 ... -54.68 -54.68 -54.68 lon (y, x) float64 315.0 315.0 315.0 315.0 ... 135.0 135.0 135.0 Data variables: coord_system |S1 b'' VX (y, x) float32 nan nan nan nan nan nan ... nan nan nan nan nan VY (y, x) float32 nan nan nan nan nan nan ... nan nan nan nan nan STDX (y, x) float32 nan nan nan nan nan nan ... nan nan nan nan nan STDY (y, x) float32 nan nan nan nan nan nan ... nan nan nan nan nan ERRX (y, x) float32 nan nan nan nan nan nan ... nan nan nan nan nan ERRY (y, x) float32 nan nan nan nan nan nan ... nan nan nan nan nan CNT (y, x) int32 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 Attributes: (12/26) Conventions: CF-1.6 Metadata_Conventions: CF-1.6, Unidata Dataset Discovery v1.0, GDS v2.0 standard_name_vocabulary: CF Standard Name Table (v22, 12 February 2013) id: vel_nsidc.CF16.nc title: MEaSURES Antarctica Ice Velocity Map 450m spacing product_version: ... ... spatial_resolution: 450m time_coverage_start: 1995-01-01 time_coverage_end: 2016-12-31 project: NASA/MEaSUREs creator_name: J. Mouginot license: No restrictions on access or use
Rechunk#
Zarr stores are ways of stored multi-dimensional data in a way this is optimized for fast access from distributed cloud computing. Zarr stores use a concept called chunks. Chunks are the smallest units of data that can be downloaded one-at-a-time. It is best to make them smaller than the total size fo the dataset, because then you can avoid downloading ~7 Gb every time, but making them too small introduces overheads that slow things down. The chunk size that the dataset has by default after loading from a netcdf (as we did above) may not be ideal, so one needs to inspect the chunk size and ‘rechunk’ is nessesary.
For this dataset, it turns out that if you split each variable into four chunks you get about the right size of chunk. The following cell does this.
import numpy as np
nx = ds.x.shape[0]
ny = ds.y.shape[0]
ds_rechunked = ds.chunk({'y': np.ceil(ny/2), 'x': np.ceil(nx/2)})
ds_rechunked
<xarray.Dataset> Dimensions: (x: 12445, y: 12445) Coordinates: * x (x) float64 -2.8e+06 -2.8e+06 -2.799e+06 ... 2.799e+06 2.8e+06 * y (y) float64 2.8e+06 2.8e+06 2.799e+06 ... -2.799e+06 -2.8e+06 lat (y, x) float64 dask.array<chunksize=(6223, 6223), meta=np.ndarray> lon (y, x) float64 dask.array<chunksize=(6223, 6223), meta=np.ndarray> Data variables: coord_system |S1 b'' VX (y, x) float32 dask.array<chunksize=(6223, 6223), meta=np.ndarray> VY (y, x) float32 dask.array<chunksize=(6223, 6223), meta=np.ndarray> STDX (y, x) float32 dask.array<chunksize=(6223, 6223), meta=np.ndarray> STDY (y, x) float32 dask.array<chunksize=(6223, 6223), meta=np.ndarray> ERRX (y, x) float32 dask.array<chunksize=(6223, 6223), meta=np.ndarray> ERRY (y, x) float32 dask.array<chunksize=(6223, 6223), meta=np.ndarray> CNT (y, x) int32 dask.array<chunksize=(6223, 6223), meta=np.ndarray> Attributes: (12/26) Conventions: CF-1.6 Metadata_Conventions: CF-1.6, Unidata Dataset Discovery v1.0, GDS v2.0 standard_name_vocabulary: CF Standard Name Table (v22, 12 February 2013) id: vel_nsidc.CF16.nc title: MEaSURES Antarctica Ice Velocity Map 450m spacing product_version: ... ... spatial_resolution: 450m time_coverage_start: 1995-01-01 time_coverage_end: 2016-12-31 project: NASA/MEaSUREs creator_name: J. Mouginot license: No restrictions on access or use
Write to bucket#
We will write both the default (small chunked) dataset and the rechunked dataset to the google bucket, for use elsewhere in the book.
To write this to the google bucket, we require an authentication token, that is private. To do yourself you will need your own google bucket and token specific to that bucket.
import zarr
import json
import gcsfs
import xarray as xr
The cell below uses the token to generate a ‘file-like object’ called mapper
, which can then be used with the xarray method to_zarr
to write the dataset to the zarr store.
with open('/Users/jkingslake/Documents/science/ldeo-glaciology-bc97b12df06b.json') as token_file:
token = json.load(token_file)
gcs = gcsfs.GCSFileSystem(token=token)
mapper = gcs.get_mapper('gs://ldeo-glaciology/measures/measures')
mapper_rechunked = gcs.get_mapper('gs://ldeo-glaciology/measures/measures_rechunked')
ds.to_zarr(mapper)
ds_rechunked.to_zarr(mapper_rechunked)
<xarray.backends.zarr.ZarrStore at 0x145e98270>
Reload#
To check that the data was uploaded correctly, reload both the dataset using the syntax that will be used in the main page making use of these data.
import fsspec
mapper_reload = fsspec.get_mapper('gs://ldeo-glaciology/measures/measures')
ds_reloaded = xr.open_zarr(mapper_reload)
ds_reloaded
<xarray.Dataset> Dimensions: (y: 12445, x: 12445) Coordinates: lat (y, x) float64 dask.array<chunksize=(389, 778), meta=np.ndarray> lon (y, x) float64 dask.array<chunksize=(389, 778), meta=np.ndarray> * x (x) float64 -2.8e+06 -2.8e+06 -2.799e+06 ... 2.799e+06 2.8e+06 * y (y) float64 2.8e+06 2.8e+06 2.799e+06 ... -2.799e+06 -2.8e+06 Data variables: CNT (y, x) int32 dask.array<chunksize=(778, 778), meta=np.ndarray> ERRX (y, x) float32 dask.array<chunksize=(778, 778), meta=np.ndarray> ERRY (y, x) float32 dask.array<chunksize=(778, 778), meta=np.ndarray> STDX (y, x) float32 dask.array<chunksize=(778, 778), meta=np.ndarray> STDY (y, x) float32 dask.array<chunksize=(778, 778), meta=np.ndarray> VX (y, x) float32 dask.array<chunksize=(778, 778), meta=np.ndarray> VY (y, x) float32 dask.array<chunksize=(778, 778), meta=np.ndarray> coord_system |S1 ... Attributes: (12/26) Conventions: CF-1.6 Metadata_Conventions: CF-1.6, Unidata Dataset Discovery v1.0, GDS v2.0 cdm_data_type: Grid creator_name: J. Mouginot date_created: 2017-04-06T17:47:44.00004923343322Z geospatial_lat_max: -60 ... ... spatial_resolution: 450m standard_name_vocabulary: CF Standard Name Table (v22, 12 February 2013) summary: time_coverage_end: 2016-12-31 time_coverage_start: 1995-01-01 title: MEaSURES Antarctica Ice Velocity Map 450m spacing
mapper_reload = fsspec.get_mapper('gs://ldeo-glaciology/measures/measures_rechunked')
ds_rechunked_reloaded = xr.open_zarr(mapper_reload)
ds_rechunked_reloaded
<xarray.Dataset> Dimensions: (y: 12445, x: 12445) Coordinates: lat (y, x) float64 dask.array<chunksize=(6223, 6223), meta=np.ndarray> lon (y, x) float64 dask.array<chunksize=(6223, 6223), meta=np.ndarray> * x (x) float64 -2.8e+06 -2.8e+06 -2.799e+06 ... 2.799e+06 2.8e+06 * y (y) float64 2.8e+06 2.8e+06 2.799e+06 ... -2.799e+06 -2.8e+06 Data variables: CNT (y, x) int32 dask.array<chunksize=(6223, 6223), meta=np.ndarray> ERRX (y, x) float32 dask.array<chunksize=(6223, 6223), meta=np.ndarray> ERRY (y, x) float32 dask.array<chunksize=(6223, 6223), meta=np.ndarray> STDX (y, x) float32 dask.array<chunksize=(6223, 6223), meta=np.ndarray> STDY (y, x) float32 dask.array<chunksize=(6223, 6223), meta=np.ndarray> VX (y, x) float32 dask.array<chunksize=(6223, 6223), meta=np.ndarray> VY (y, x) float32 dask.array<chunksize=(6223, 6223), meta=np.ndarray> coord_system |S1 ... Attributes: (12/26) Conventions: CF-1.6 Metadata_Conventions: CF-1.6, Unidata Dataset Discovery v1.0, GDS v2.0 cdm_data_type: Grid creator_name: J. Mouginot date_created: 2017-04-06T17:47:44.00004923343322Z geospatial_lat_max: -60 ... ... spatial_resolution: 450m standard_name_vocabulary: CF Standard Name Table (v22, 12 February 2013) summary: time_coverage_end: 2016-12-31 time_coverage_start: 1995-01-01 title: MEaSURES Antarctica Ice Velocity Map 450m spacing