Sec. 8 Lab exercise 3: Navigating across three dimensions of data

Last update: Tue Feb 14 08:28:19 2023 -0500 (3570082)

Now that we have acquired our GCM output data (Lab 2, Sec. 6), we need to learn more about how to access it and begin to analyze it. In this lab, you will find a spatial and temporal subset of your GCM data matching your observed climate baseline using two different methods, and will begin to understand the future climate using change factors. By the end of this lab, you will have data from more than 50 GCMs. In Lab 4 (Sec. 9 and 10), we will see how we can narrow-down our selection of the models that will inform our climate change impact assessment.

We are going to begin by exploring some of the attributes of three dimensional data contained in the downloaded CMIP6 netCDF files. We are going to use the netCDF4 module to access the information in our netCDF files. It is strongly recommended that you have a browser tab open with the netCDF4 library documentation for reference.

Once we have a better sense of the structure of our data, we will then use Google Cloud and the xarray package to access the CMIP6 data in zarr format in a more efficient way.

We begin, as always, by importing the libraries that we will need.

Python

import datetime as dt
import numpy as np
import pandas as pd
import re
from matplotlib import pyplot as plt
from netCDF4 import Dataset, date2index, num2date, date2num

Now, create a list of the netCDF files that we will examine. We have created some miniature netCDF files for use in this book. If you have been following this manual in order, then you should substitute the directories and filenames below with paths leading to the netCDF files that you downloaded in Lab 2.

Python

past = "data/mini/tas_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412_mini.nc"
future = ["data/mini/tas_Amon_CanESM5_ssp119_r1i1p1f1_gn_201501-210012_mini.nc","data/mini/tas_Amon_CanESM5_ssp126_r1i1p1f1_gn_201501-210012_mini.nc", "data/mini/tas_Amon_CanESM5_ssp245_r1i1p1f1_gn_201501-210012_mini.nc", "data/mini/tas_Amon_CanESM5_ssp370_r1i1p1f1_gn_201501-210012_mini.nc","data/mini/tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_201501-210012_mini.nc"]

8.1 The past (from scratch)

We will begin by looking at our “historical” netCDF file. We can load this file into memory using the Dataset() function that we imported from netCDF4.

Python

nc = Dataset(past)

First, we recommend that you get familiar with this data set. For instance, what are the dimensions of your data?

Python

nc.dimensions

## {'lat': <class 'netCDF4._netCDF4.Dimension'>: name = 'lat', size = 7, 'bnds': <class 'netCDF4._netCDF4.Dimension'>: name = 'bnds', size = 2, 'lon': <class 'netCDF4._netCDF4.Dimension'>: name = 'lon', size = 32, 'time': <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'time', size = 1980}

Notice that we have three-dimensional data, stored along X (longitude), Y (latitude), and T (time). Some netCDF files also have a Z dimension (e.g. height). There is no hard limit to the number of dimensions a netCDF file can have, though it is suggested not to exceed 1024. Rest assured, it is unlikely that you will encounter a 1024-D file in the wild. To get a list of the variables stored in your netCDF file, query the variables attribute of your loaded data set.

Python

nc.variables

## {'height': <class 'netCDF4._netCDF4.Variable'>
## float64 height()
##     units: m
##     axis: Z
##     positive: up
##     long_name: height
##     standard_name: height
## unlimited dimensions: 
## current shape = ()
## filling on, default _FillValue of 9.969209968386869e+36 used, 'lat': <class 'netCDF4._netCDF4.Variable'>
## float64 lat(lat)
##     bounds: lat_bnds
##     units: degrees_north
##     axis: Y
##     long_name: Latitude
##     standard_name: latitude
## unlimited dimensions: 
## current shape = (7,)
## filling on, default _FillValue of 9.969209968386869e+36 used, 'lat_bnds': <class 'netCDF4._netCDF4.Variable'>
## float64 lat_bnds(lat, bnds)
## unlimited dimensions: 
## current shape = (7, 2)
## filling on, default _FillValue of 9.969209968386869e+36 used, 'lon': <class 'netCDF4._netCDF4.Variable'>
## float64 lon(lon)
##     bounds: lon_bnds
##     units: degrees_east
##     axis: X
##     long_name: Longitude
##     standard_name: longitude
## unlimited dimensions: 
## current shape = (32,)
## filling on, default _FillValue of 9.969209968386869e+36 used, 'lon_bnds': <class 'netCDF4._netCDF4.Variable'>
## float64 lon_bnds(lon, bnds)
## unlimited dimensions: 
## current shape = (32, 2)
## filling on, default _FillValue of 9.969209968386869e+36 used, 'tas': <class 'netCDF4._netCDF4.Variable'>
## float32 tas(time, lat, lon)
##     standard_name: air_temperature
##     long_name: Near-Surface Air Temperature
##     comment: ST+273.16, CMIP_table_comment: near-surface (usually, 2 meter) air temperature
##     units: K
##     original_name: ST
##     history: degctok 2019-04-30T17:32:17Z altered by CMOR: Treated scalar dimension: 'height'. 2019-04-30T17:32:17Z altered by CMOR: Reordered dimensions, original order: lat lon time. 2019-04-30T17:32:17Z altered by CMOR: replaced missing value flag (1e+38) with standard missing value (1e+20).
##     cell_methods: area: time: mean
##     cell_measures: area: areacella
##     coordinates: height
##     missing_value: 1e+20
##     _FillValue: 1e+20
## unlimited dimensions: time
## current shape = (1980, 7, 32)
## filling on, 'time': <class 'netCDF4._netCDF4.Variable'>
## float64 time(time)
##     bounds: time_bnds
##     units: days since 1850-01-01 0:0:0.0
##     calendar: 365_day
##     axis: T
##     long_name: time
##     standard_name: time
## unlimited dimensions: time
## current shape = (1980,)
## filling on, default _FillValue of 9.969209968386869e+36 used, 'time_bnds': <class 'netCDF4._netCDF4.Variable'>
## float64 time_bnds(time, bnds)
## unlimited dimensions: time
## current shape = (1980, 2)
## filling on, default _FillValue of 9.969209968386869e+36 used}

You can query a specific variable as well, e.g. time.

Python

nc.variables['time']

## <class 'netCDF4._netCDF4.Variable'>
## float64 time(time)
##     bounds: time_bnds
##     units: days since 1850-01-01 0:0:0.0
##     calendar: 365_day
##     axis: T
##     long_name: time
##     standard_name: time
## unlimited dimensions: time
## current shape = (1980,)
## filling on, default _FillValue of 9.969209968386869e+36 used

The above shows us some attributes of our variable, such as units, current shape, and whether it is an unlimited variable or not.

We can store the variable in a Python object, which will help us to get its values.

Python

nc_time = nc.variables['time']
nc_time[0:10]

## masked_array(data=[ 15.5,  45. ,  74.5, 105. , 135.5, 166. , 196.5, 227.5,
##                    258. , 288.5],
##              mask=False,
##        fill_value=1e+20)

8.1.1 Get subset indices

The time variable isn’t very easy for us to read, as it is stored as “Days since January 1, 1850”. Interpreting the time variable is where xarray excels and why we will ultimately use this package going forward. Nevertheless, it is instructive to work through how to handle the time dimension in the absence of xarray. Luckily, there are a few helper functions to help us get our period of interest: 1981 to 2010.

Python

time_start = date2index(dt.datetime(1980, 12, 1), nc_time, select="nearest")
time_end = date2index(dt.datetime(2010, 10, 31), nc_time, select="nearest")

Note that we have selected the index that contains data nearest to our desired end date.

Python

num2date(nc_time[time_end], units=nc_time.units, calendar=nc_time.calendar)

## cftime.DatetimeNoLeap(2010, 12, 16, 12, 0, 0, 0, has_year_zero=True)

We now know the slice of T that we need. How about X and Y? We will select a single grid cell, based on our station target. First, I will define that target. I have chosen to use a Python dictionary, because I have a tendency to get my “lats” and my “lons” confused. You can use a regular list if you prefer.

Python

# Toronto station coords
target = {"lat": 43.67, "lon": -79.40}

We are used to using negatives for western longitudes (i.e. 180°W to 180°E), but the CMIP6 models use longitudes from 0°E to 360°E instead. We’ll need to convert our longitude.

Python

if target['lon'] < 0:
    target['lon'] = 360 + target['lon']

target['lon']

## 280.6

Now we can get the data for our latitude and longitude variables. Notice the syntax below. We query the variable using nc.variables.get(), then we choose the full range of the data using [:], which returns a masked array. We pull the values out of that array with .data.

Python

nc_lat = nc.variables['lat'][:].data
nc_lon = nc.variables['lon'][:].data

Now we can find the latitude and longitude cell that is closest to our target.

Python

lat_cell = np.argmin(np.abs(nc_lat - target['lat']))
lon_cell = np.argmin(np.abs(nc_lon - target['lon']))

The minimal netCDF files that we’ve created for this book do not include “bounds” variables, however, these can be useful. It is not always clear what the “latitude” and “longitude” of a cell refers to: is it the centre point? Is it the top-left corner of the grid box? This is where the bounds variables are really useful. The latitude bounds provide the latitude of the top and bottom of each grid box. The longitude bounds provide the right and left sides of the box. Using the bounds can be a safer way to identify the grid box that we are interested in. Depending on the model you have selected, look for a variable called lat_bnds or lat_bounds.

Python

nc_lat_bnds = nc.variables["lat_bnds"][:].data
lat_cell = np.where((nc_lat_bnds[:,0] < target['lat']) & (nc_lat_bnds[:,1] > target['lat']))[0][0]
nc_lon_bnds = nc.variables["lon_bnds"][:].data
lon_cell = np.where((nc_lon_bnds[:,0] < target['lon']) & (nc_lon_bnds[:,1] > target['lon']))[0][0]

8.1.2 Getting the data

Now that we know the indices of X, Y, and T that we are interested in, we can pull these out of the netCDF file.

Python

dat = nc.variables['tas'][time_start:(time_end + 1), lat_cell, lon_cell]
dat[0:10]

## masked_array(data=[266.87363, 261.24707, 272.11676, 278.9586 , 288.98645,
##                    292.85455, 295.75702, 294.00974, 288.09775, 283.35312],
##              mask=False,
##        fill_value=1e+20,
##             dtype=float32)

You may have noticed temperatures in the 200s. Our data is in Kelvin. To convert to Celsius, we can subtract 273.15 from every value. For convenience, I will also pull the values out of the masked array using .data.

Python

dat = np.subtract(dat.data, 273.15)

Let’s make sure that we got the right amount of time and data: 12 months times 30 years for 360 data points.

Python

time_end - time_start + 1

## 360

len(dat)

## 360

We can convert our time deltas into dates as follows:

Python

dates = num2date(nc_time[time_start:time_end + 1], units=nc_time.units, calendar=nc_time.calendar)

You may have noticed that, in choosing a single grid cell from a 3D gridded data set, we’ve ended up dropping both spatial dimensions entirely and are left with a one-dimensional array of temperature values. However, we want to retain important label information, such as the time for which the value was projected, so we need a 2D data structure, and will make use of the same tools that we used for our observed data in 4. We can take our two separate (1D) arrays (Date and MeanTemp) and use pandas to combine this information into a data frame.

Python

df = pd.DataFrame({'Date': dates, 'MeanTemp': dat, 'Experiment': 'historical'})
df.head()

##                   Date   MeanTemp  Experiment
## 0  1981-01-16 12:00:00  -6.276367  historical
## 1  1981-02-15 00:00:00 -11.902924  historical
## 2  1981-03-16 12:00:00  -1.033234  historical
## 3  1981-04-16 00:00:00   5.808594  historical
## 4  1981-05-16 12:00:00  15.836456  historical

Throughout the labs in this book, we have focused our analysis on a single spatial point (or grid cell). pandas makes this easy. As you gain skill and as the spatial focus of your research broadens, you may become interested in performing analysis across multiple spatial points, and you will find that the pandas code here cannot be easily generalized across dimensions. The powerful xarray package borrows concepts from pandas and applies them to $n$-dimensional arrays. To learn more about xarray refer to the documentation. Another fantastic resouce is the book An Introduction to Earth and Environmental Data Science by Ryan Abernathey, which includes a chapter introducing xarray.

Even though we are using monthly data, the date column is still formatted with a full YYYY-MM-DD hh:mm:ss date! The date and time represents approximately the middle of the month (probably from averaging the values from a smaller time scale). In addition, the calendar used in CanESM5 is a “365 day” calendar, which ignores leap years. The num2date function that we used above, converts the values of the time dimension into a special cftime.DatetimeNoLeap object. Unfortunately, this date format does not react intuitively in plots. Given that we are using monthly data, it doesn’t matter to us whether there were 28 or 29 days in February of any given year. As such, let’s convert the cftime.DatetimeNoLeap object to a regular Python datetime.date with a YYYY-MM-DD format.

Python

df['Date'] = [dt.date(i.year, i.month, i.day) for i in df.Date]

Let’s see what our data frame looks like.

Python

df.head()

##          Date   MeanTemp  Experiment
## 0  1981-01-16  -6.276367  historical
## 1  1981-02-15 -11.902924  historical
## 2  1981-03-16  -1.033234  historical
## 3  1981-04-16   5.808594  historical
## 4  1981-05-16  15.836456  historical

Python

df.tail()

##            Date   MeanTemp  Experiment
## 355  2010-08-16  23.870422  historical
## 356  2010-09-16  18.445435  historical
## 357  2010-10-16  12.228333  historical
## 358  2010-11-16   4.120392  historical
## 359  2010-12-16  -1.534607  historical

We can plot the results of our analysis up to this point.

Python

plt.plot(df.Date, df.MeanTemp)
plt.title(r'CanESM5 Monthly $T_\mathrm{mean}$ at Toronto (1981‒2010)')
plt.xlabel("Year")
plt.ylabel("Mean Temperature (°C)")

$Monthly $T_\mathrm{mean}$ at Toronto (1981‒2010), as simulated by CanESM5.$

Figure 8.1: Monthly $T_\mathrm{mean}$ at Toronto (1981‒2010), as simulated by CanESM5.

Looks good! The pronounced seasonal cycle and the winter to summer range looks reasonable for the Toronto location.

It is always a good idea to close our netCDF file when we are done with it.

Python

nc.close()

8.2 The past (GCS + xarray)

Now, that you have a better sense of the netCDF data structure, we are going to simplify the above process using Google Cloud and xarray. First, let’s import the additional packages we need.

Python

import xarray as xr
import zarr
import gcsfs

Now, let’s access the Google Cloud CMIP6 data catalog as we did in Lab 2 (Sec. 6).

Python

cmip6_cat = pd.read_csv('https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv')

We will proceed by querying the same model (CanESM5), variable (monthly tas, aka MeanTemp) and historical experiment as we did above.

Python

files = cmip6_cat.query("table_id == 'Amon' & source_id == 'CanESM5' & variable_id == 'tas' & experiment_id == 'historical' & member_id =='r1i1p1f1'")
files

##        activity_id institution_id source_id experiment_id member_id  ...  \
## 100305        CMIP          CCCma   CanESM5    historical  r1i1p1f1  ...   
## 
##        variable_id grid_label  \
## 100305         tas         gn   
## 
##                                                    zstore dcpp_init_year  \
## 100305  gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical...            NaN   
## 
##          version  
## 100305  20190429  
## 
## [1 rows x 11 columns]

There’s the file we are looking for. Great! As in Lab 2, we will now access the file by establishing a connection to Google Cloud.

Python

# this only needs to be created once
gcs = gcsfs.GCSFileSystem(token='anon')

# get the path to a specific zarr store
zstore = files.zstore.values[0]

# create a mutable-mapping-style interface to the store
mapper = gcs.get_mapper(zstore)

# open it using xarray and zarr
ds = xr.open_zarr(mapper, consolidated=True)
ds

<xarray.Dataset>
Dimensions:    (lat: 64, bnds: 2, lon: 128, time: 1980)
Coordinates:
    height     float64 ...
  * lat        (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 82.31 85.1 87.86
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(64, 2), meta=np.ndarray>
  * lon        (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(128, 2), meta=np.ndarray>
  * time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
Dimensions without coordinates: bnds
Data variables:
    tas        (time, lat, lon) float32 dask.array<chunksize=(600, 64, 128), meta=np.ndarray>
Attributes: (12/56)
    CCCma_model_hash:            3dedf95315d603326fde4f5340dc0519d80d10c0
    CCCma_parent_runid:          rc3-pictrl
    CCCma_pycmor_hash:           33c30511acc319a98240633965a04ca99c26427e
    CCCma_runid:                 rc3.1-his01
    Conventions:                 CF-1.7 CMIP-6.2
    YMDH_branch_time_in_child:   1850:01:01:00
    ...                          ...
    variable_id:                 tas
    variant_label:               r1i1p1f1
    version:                     v20190429
    status:                      2019-10-25;created;by nhn2@columbia.edu
    netcdf_tracking_ids:         hdl:21.14100/872062df-acae-499b-aa0f-9eaca76...
    version_id:                  v20190429

Notice the structure of the data. You should see all the same dimensions, coordinates and variables as we saw above, but presented in a much more visually intuitive way.

As above, we will now extract the location and time period we are interested in. We did something similar in Lab 2 for daily temperature data. Let’s extend our time period to the year 2014, so that we can more easily append our future experiments, the SSPs, to our data frame.

Python

# Select grid box nearest to the Toronto station latitude and longitude
dat = ds.sel(lat="43.67", lon="280.6",method='nearest')

# Select time range using slicing
dat = dat.sel(time=slice("1981","2014")) - 273.15 # convert from Kelvin to degC

# Load data into memory
dat.load()

<xarray.Dataset>
Dimensions:    (bnds: 2, time: 408)
Coordinates:
    height     float64 2.0
    lat        float64 43.25
    lat_bnds   (bnds) float64 41.86 44.66
    lon        float64 281.2
    lon_bnds   (bnds) float64 279.8 282.7
  * time       (time) object 1981-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bnds  (time, bnds) object 1981-01-01 00:00:00 ... 2015-01-01 00:00:00
Dimensions without coordinates: bnds
Data variables:
    tas        (time) float32 -6.276 -11.9 -1.033 5.809 ... 9.475 4.363 -3.205
Attributes: (12/56)
    CCCma_model_hash:            3dedf95315d603326fde4f5340dc0519d80d10c0
    CCCma_parent_runid:          rc3-pictrl
    CCCma_pycmor_hash:           33c30511acc319a98240633965a04ca99c26427e
    CCCma_runid:                 rc3.1-his01
    Conventions:                 CF-1.7 CMIP-6.2
    YMDH_branch_time_in_child:   1850:01:01:00
    ...                          ...
    variable_id:                 tas
    variant_label:               r1i1p1f1
    version:                     v20190429
    status:                      2019-10-25;created;by nhn2@columbia.edu
    netcdf_tracking_ids:         hdl:21.14100/872062df-acae-499b-aa0f-9eaca76...
    version_id:                  v20190429

Next, we are going to covert our xarray into a pandas data frame to ease our integration of weather station data into our analysis later on. Recall that above our dataframe had three columns, Date, MeanTemp and Experiment. We can extract the experiment name using the column names from our CMIP6 catalog query.

Python

# get the experiment id to add to our dataframe
exp = files.experiment_id.values[0]
print(exp)

## historical

To integrate the time variable into pandas, we need to convert it from a cftime object (the netCDF time format) to a datetime object.

Python

dates = dat.indexes['time'].to_datetimeindex() #convert cftime to datetime

## <string>:1: RuntimeWarning: Converting a CFTimeIndex with dates from a non-standard calendar, 'noleap', to a pandas.DatetimeIndex, which uses dates from the standard calendar.  This may lead to subtle errors in operations that depend on the length of time between dates.

Finally, let’s package all the pieces together into a data frame.

Python

df = pd.DataFrame({'Date': dates, 'MeanTemp': dat['tas'].values, 'Experiment': exp})
df.head()

##                  Date   MeanTemp  Experiment
## 0 1981-01-16 12:00:00  -6.276367  historical
## 1 1981-02-15 00:00:00 -11.902924  historical
## 2 1981-03-16 12:00:00  -1.033234  historical
## 3 1981-04-16 00:00:00   5.808594  historical
## 4 1981-05-16 12:00:00  15.836456  historical

As above, let’s simplify the Date variable:

Python

df['Date'] = [dt.date(i.year, i.month, i.day) for i in df.Date]

The final data frame should look exactly like the first one we created.

Python

df.head()

##          Date   MeanTemp  Experiment
## 0  1981-01-16  -6.276367  historical
## 1  1981-02-15 -11.902924  historical
## 2  1981-03-16  -1.033234  historical
## 3  1981-04-16   5.808594  historical
## 4  1981-05-16  15.836456  historical

Python

df.tail()

##            Date   MeanTemp  Experiment
## 403  2014-08-16  25.186005  historical
## 404  2014-09-16  20.401062  historical
## 405  2014-10-16   9.474548  historical
## 406  2014-11-16   4.363190  historical
## 407  2014-12-16  -3.205353  historical

8.3 The future

The first thing we need to do is query and locate the future experiment files for CanESM5, i.e. the SSP files.

Python

files2 = cmip6_cat.query("table_id == 'Amon' & source_id == 'CanESM5' & variable_id == 'tas' & experiment_id == ['ssp119','ssp126','ssp245','ssp370','ssp585'] & member_id =='r1i1p1f1'")
files2

##         activity_id institution_id source_id experiment_id member_id  ...  \
## 86050   ScenarioMIP          CCCma   CanESM5        ssp370  r1i1p1f1  ...   
## 94452   ScenarioMIP          CCCma   CanESM5        ssp585  r1i1p1f1  ...   
## 110673  ScenarioMIP          CCCma   CanESM5        ssp245  r1i1p1f1  ...   
## 130759  ScenarioMIP          CCCma   CanESM5        ssp119  r1i1p1f1  ...   
## 151017  ScenarioMIP          CCCma   CanESM5        ssp126  r1i1p1f1  ...   
## 
##        variable_id grid_label  \
## 86050          tas         gn   
## 94452          tas         gn   
## 110673         tas         gn   
## 130759         tas         gn   
## 151017         tas         gn   
## 
##                                                    zstore dcpp_init_year  \
## 86050   gs://cmip6/CMIP6/ScenarioMIP/CCCma/CanESM5/ssp...            NaN   
## 94452   gs://cmip6/CMIP6/ScenarioMIP/CCCma/CanESM5/ssp...            NaN   
## 110673  gs://cmip6/CMIP6/ScenarioMIP/CCCma/CanESM5/ssp...            NaN   
## 130759  gs://cmip6/CMIP6/ScenarioMIP/CCCma/CanESM5/ssp...            NaN   
## 151017  gs://cmip6/CMIP6/ScenarioMIP/CCCma/CanESM5/ssp...            NaN   
## 
##          version  
## 86050   20190429  
## 94452   20190429  
## 110673  20190429  
## 130759  20190429  
## 151017  20190429  
## 
## [5 rows x 11 columns]

Let’s sort this data frame by the experiment. This will come in handy later.

Python

files2 = files2.sort_values(by=['experiment_id'])

Now, we are going to repeat the same process for the five SSP experiments that we did for the historical experiment. Repetition calls for a function! Let’s define a function that accesses each files (i.e., each row) of a Google Cloud query data frame.

Python

def open_dataset(files,i):
    zstore = files.zstore.values[i]
    exp_id = files.experiment_id.values[i]
    ds = xr.open_zarr(gcs.get_mapper(zstore), consolidated=True)
    return ds,exp_id

We can use this function to append the SSP data onto the historical data frame that we have already created using a quick for loop across the five SSP files.

This loop will process the data from each file in the same way that we processed the historical data, and will create one data frame containing all of our data.

Python

for i in range(len(files2)):
    tmp,exp = open_dataset(files2,i)

    #extract location and time period of interest
    tmp = tmp.sel(lat="43.67", lon="280.6",method='nearest')
    tmp = tmp.sel(time=slice("2015","2100")) - 273.15 #convert to degC

    #note: some models will have time already as a datetime object. If so, comment out the below line and uncomment the next line.
    dates = tmp.indexes['time'].to_datetimeindex(unsafe=True) #convert cftime to datetime
    #dates = tmp.indexes['time']

    #create data frame
    df2 = pd.DataFrame({'Date': dates, 'MeanTemp': tmp['tas'].values, 'Experiment': exp})
    df2['Date'] = [dt.date(i.year, i.month, i.day) for i in df2.Date]

    #concatenate the new ssp data frame (df2) to original historical data frame (df), such that df grows each time through the loop
    df = pd.concat([df,df2], ignore_index=True, sort=True)

The table that we have created above, is a “long” data frame. A long data frame is what is referred to as “tidy data” in the R universe. Tidy data is characterized by a few simple principles: each column is a variable, each row is an observation, each table contains a unique type of observational unit.

A long layout table is very useful for many operations such as grouping and subsetting, but can be a challenge to plot. For instance:

Python

df.plot()

$CanESM5 Monthly $T_\mathrm{mean}$ at Toronto (2015‒2100) using three SSP scenarios$

Figure 8.2: CanESM5 Monthly $T_\mathrm{mean}$ at Toronto (2015‒2100) using three SSP scenarios

Somewhere in Figure 8.2 is the CanESM5 Monthly $T_\mathrm{mean}$ at Toronto (2015‒2100) using our five SSP scenarios, but it looks like we need some more sophisticated code to plot this! We can manually fix the above by filtering for each projection and adding the line manually.

Python

plt.plot(df.Date[df.Experiment == "ssp119"], df.MeanTemp[df.Experiment == "ssp119"], 'c-', alpha=0.7, label = 'SSP1-1.9')
plt.plot(df.Date[df.Experiment == "ssp126"], df.MeanTemp[df.Experiment == "ssp126"], 'b-', alpha=0.7, label = 'SSP1-2.6')
plt.plot(df.Date[df.Experiment == "ssp245"], df.MeanTemp[df.Experiment == "ssp245"], 'y-', alpha=0.7, label = 'SSP2-4.5')
plt.plot(df.Date[df.Experiment == "ssp370"], df.MeanTemp[df.Experiment == "ssp370"], 'r-', alpha=0.7, label = 'SSP3-7.0')
plt.plot(df.Date[df.Experiment == "ssp585"], df.MeanTemp[df.Experiment == "ssp585"], 'darkred', alpha=0.7, label = 'SSP5-8.5')
plt.legend(loc='lower right')
plt.ylim(-20,40)
plt.title(r'CanESM5 Monthly $T_\mathrm{mean}$ at Toronto (2015‒2100)')
plt.xlabel("Year")
plt.ylabel("Mean Temperature (°C)")

## (-20.0, 40.0)

$Monthly $T_\mathrm{mean}$ at Toronto (2015‒2100), as simulated by CanESM5 using five SSP scenarios.$

Figure 8.3: Monthly $T_\mathrm{mean}$ at Toronto (2015‒2100), as simulated by CanESM5 using five SSP scenarios.

What if we want to include the “historical” data? In Figure 8.3 we manually plotted each scenario as a separate line on the same plot. The seaborn library can also make this process a little easier. Using seaborn we can identify what it is that makes each data series unique and how we want to differentiate between series. We will pass hue="Experiment" to our seaborn plot, which indicates that we want a different hue (i.e. colour) for each unique value in "Experiment".

Python

import seaborn as sns
sns.lineplot(data=df, x="Date", y="MeanTemp", hue="Experiment",palette=["k","c", "b", "y", "r","darkred"]).set(ylabel="Mean Temperature (°C)", title=r'CanESM5 Monthly $T_\mathrm{mean}$ at Toronto (1981‒2100)')

$Historical and future monthly $T_\mathrm{mean}$ at Toronto (1981‒2100), as simulated by CanESM5. Future projections are forced using five SSP scenarios.$

Figure 8.4: Historical and future monthly $T_\mathrm{mean}$ at Toronto (1981‒2100), as simulated by CanESM5. Future projections are forced using five SSP scenarios.

A long data frame is computationally efficient, but isn’t as intuitive to navigate by the human eye. As such, if we want to include a table in a report or publication, the wide layout is usually more print-friendly. You can create a wide layout table using df.pivot like this:

Python

df_wide = df.pivot(index='Date', columns='Experiment')['MeanTemp']
df_wide.head()

## Experiment  historical  ssp119  ssp126  ssp245  ssp370  ssp585
## Date                                                          
## 1981-01-16   -6.276367     NaN     NaN     NaN     NaN     NaN
## 1981-02-15  -11.902924     NaN     NaN     NaN     NaN     NaN
## 1981-03-16   -1.033234     NaN     NaN     NaN     NaN     NaN
## 1981-04-16    5.808594     NaN     NaN     NaN     NaN     NaN
## 1981-05-16   15.836456     NaN     NaN     NaN     NaN     NaN

Python

df_wide.tail()

## Experiment  historical     ssp119     ssp126     ssp245     ssp370     ssp585
## Date                                                                         
## 2100-08-16         NaN  23.564697  26.388550  26.977051  31.046631  31.213440
## 2100-09-16         NaN  19.886261  20.563843  25.151825  26.710999  30.756989
## 2100-10-16         NaN  11.292908  14.421967  16.221588  17.903748  22.165222
## 2100-11-16         NaN   6.061523   5.788635   8.628082  10.470490  13.136444
## 2100-12-16         NaN  -2.109161  -2.885834   3.456177   4.007507   5.829773

This is also easier to plot with the plot method built into pandas.

Python

df_wide.plot(style={'historical': 'k', 'ssp119': 'c','ssp126': 'b','ssp245': 'y','ssp370': 'r','ssp585': 'darkred'}, title = r'CanESM5 Monthly $T_\mathrm{mean}$ at Toronto (1981‒2100)')
plt.ylabel("Mean Temperature (°C)")

$Historical and future monthly $T_\mathrm{mean}$ at Toronto (1981‒2100), as simulated by CanESM5. Future projections are forced using five SSP scenarios.$

Figure 8.5: Historical and future monthly $T_\mathrm{mean}$ at Toronto (1981‒2100), as simulated by CanESM5. Future projections are forced using five SSP scenarios.

8.4 Change Factors

In our next lab, we are going to use change factors to describe changes in our baseline climate. The change factor method (sometimes called $\Delta T$) calculates the projected temperature by adding the observed, baseline temperature time series to the change in the temperature projected by a GCM integration (typically using monthly GCM data). This process is intended to remove the mean bias in GCM historical and future simulated climate conditions, and represents a simplified form of downscaling. Change factors are described by the following equation:

\[T_{\mathrm{FUTURE}} = T_{\mathrm{OBSERVED}} + \Delta T_{\mathrm{GCM}}\]

The first step for a change factor analysis is to calculate the change projected by a GCM over its baseline period. To do so, we will use 30-year “tridecades”, including our baseline (1981–2010), and three common climate change tridecades: the 2020s (2011–2040), the 2050s (2041–2070), and the 2080s (2071–2100). In this case, $T_{\mathrm{OBSERVED}}$ is the observed 30-year average mean temperature over the baseline period, and $\Delta T_{\mathrm{GCM}}$ is the difference between the model-projected 30-year average mean temperature for the baseline period and the model-projected 30-year average mean temperature for each of the tridecades. In your research, you may opt to use a seasonal or monthly average instead of the annual average presented here.

Let’s begin by filtering our “long” data frame to our time periods of interest. I will only provide examples for the baseline and the 2020s, but you should complete this work for the 2050s and 2080s as well.

Python

df_base = df[(df.Date >= dt.date(1981, 1, 1)) & (df.Date <= dt.date(2010, 12, 31))]
df_2020s = df[(df.Date >= dt.date(2011, 1, 1)) & (df.Date <= dt.date(2040, 12, 31))]

Next, we compute the 30-year averages for our baseline and the 2020s. Recall that the 2020s will contain a few years from the historical experiment, the years 2011-2014.

Python

bsln = df_base[df_base.Experiment == "historical"].MeanTemp.mean()

Now we can calculate the change factor (or anomaly) for each of our tridecadal periods.

Python

cf2020s_ssp119 = df_2020s[df_2020s.Experiment.isin(["historical","ssp119"])].MeanTemp.mean() - bsln
cf2020s_ssp126 = df_2020s[df_2020s.Experiment.isin(["historical","ssp126"])].MeanTemp.mean() - bsln
cf2020s_ssp245 = df_2020s[df_2020s.Experiment.isin(["historical","ssp245"])].MeanTemp.mean() - bsln
cf2020s_ssp370 = df_2020s[df_2020s.Experiment.isin(["historical","ssp370"])].MeanTemp.mean() - bsln
cf2020s_ssp585 = df_2020s[df_2020s.Experiment.isin(["historical","ssp585"])].MeanTemp.mean() - bsln

We can summarize these data in a table like so:

Python

pd.DataFrame({"Model": np.repeat("CanESM5", 5),
              "Ensemble": np.repeat("r1i1p1f1", 5),
              "Scenario": ["SSP1-1.9","SSP1-2.6", "SSP2-4.5","SSP3-7.0", "SSP5-8.5"],
              "Baseline (°C)": np.repeat([bsln], 5),
              "2020s": [cf2020s_ssp119, cf2020s_ssp126, cf2020s_ssp245, cf2020s_ssp370,cf2020s_ssp585],
              "2050s": np.repeat([''], 5),
              "2080s": np.repeat([''], 5)})

##      Model  Ensemble  Scenario  Baseline (°C)     2020s 2050s 2080s
## 0  CanESM5  r1i1p1f1  SSP1-1.9       8.782319  1.941514            
## 1  CanESM5  r1i1p1f1  SSP1-2.6       8.782319  1.865280            
## 2  CanESM5  r1i1p1f1  SSP2-4.5       8.782319  2.065468            
## 3  CanESM5  r1i1p1f1  SSP3-7.0       8.782319  2.133127            
## 4  CanESM5  r1i1p1f1  SSP5-8.5       8.782319  2.131548

8.5 Exercises (what to submit)

Complete a table similar to the one above for a different model (This can be the model that you used in Lab 2 or you can see a list of model names to choose from here). [2 marks]
The table, below, provides the baseline mean temperature (°C) and 2020s change factor for SSP1-2.6 and SSP2-4.5 for all five available ensemble members of CanESM5. In some cases, the change factor for the SSP1-2.6 scenario is higher than the SSP2-4.5 scenario. What are some possible explanations for this? Hint: Consider the sources of uncertainty in climate model projections and how these relate to two key terms, “pathway” and “ensemble member”. [2 marks]

Ensemble	Scenario	Baseline	2011-2040
r1i1p1f1	SSP1-2.6	8.8	1.9
r1i1p1f1	SSP2-4.5	8.8	2.1
r2i1p1f1	SSP1-2.6	8.6	2.0
r2i1p1f1	SSP2-4.5	8.6	1.8
r3i1p1f1	SSP1-2.6	8.8	1.9
r3i1p1f1	SSP2-4.5	8.8	1.9
r4i1p1f1	SSP1-2.6	8.8	2.1
r4i1p1f1	SSP2-4.5	8.8	1.9
r5i1p1f1	SSP1-2.6	8.7	1.8
r5i1p1f1	SSP2-4.5	8.7	2.0

Download the following .csv file. This file contains CMIP6 data for Toronto all of the models that are available for ensemble r1i1p1f1 for annual mean temperature, tas. Open the file with pandas and take a look at the column names. Based on the lab exercise, describe what the column names correspond to. [2 marks]
Compare the output of the models in the .csv. file. Are the model projections (the change factors) largely similar, or different? What might explain some of these similarities or differences (consider similarities/differences across scenarios, across time periods and across models). [2 marks]
Provide a brief summary of the output in the .csv. file. Specifically, which model(s) simulates the warmest and coldest baseline? Which model projects the largest change in temperature (overall and for each scenario and tridecadal period)? Which projects the smallest? What is the multi-model mean baseline and change factor for each tridecadal period and scenario? Compile your data in a table(s). Think about how best to present this data in tabular form (model names and the corresponding values). If you are more comfortable doing this analysis with Excel, you are welcome to, but I encourage you to attempt some analysis using python, Lab 4 (Sec. (ref?)(lab4)) offers some guidance! [4 marks]