Sec. 8 Lab exercise 3: Navigating across three dimensions of data
Last update: Tue Feb 14 08:28:19 2023 -0500 (3570082)
Now that we have acquired our GCM output data (Lab 2, Sec. 6), we need to learn more about how to access it and begin to analyze it. In this lab, you will find a spatial and temporal subset of your GCM data matching your observed climate baseline using two different methods, and will begin to understand the future climate using change factors. By the end of this lab, you will have data from more than 50 GCMs. In Lab 4 (Sec. 9 and 10), we will see how we can narrow-down our selection of the models that will inform our climate change impact assessment.
We are going to begin by exploring some of the attributes of three dimensional data contained in the downloaded CMIP6 netCDF files. We are going to use the netCDF4 module to access the information in our netCDF files. It is strongly recommended that you have a browser tab open with the netCDF4 library documentation for reference.
Once we have a better sense of the structure of our data, we will then use Google Cloud and the xarray package to access the CMIP6 data in zarr format in a more efficient way.
We begin, as always, by importing the libraries that we will need.
Python
import datetime as dt
import numpy as np
import pandas as pd
import re
from matplotlib import pyplot as plt
from netCDF4 import Dataset, date2index, num2date, date2num
Now, create a list of the netCDF files that we will examine. We have created some miniature netCDF files for use in this book. If you have been following this manual in order, then you should substitute the directories and filenames below with paths leading to the netCDF files that you downloaded in Lab 2.
Python
= "data/mini/tas_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412_mini.nc"
past = ["data/mini/tas_Amon_CanESM5_ssp119_r1i1p1f1_gn_201501-210012_mini.nc","data/mini/tas_Amon_CanESM5_ssp126_r1i1p1f1_gn_201501-210012_mini.nc", "data/mini/tas_Amon_CanESM5_ssp245_r1i1p1f1_gn_201501-210012_mini.nc", "data/mini/tas_Amon_CanESM5_ssp370_r1i1p1f1_gn_201501-210012_mini.nc","data/mini/tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_201501-210012_mini.nc"] future
8.1 The past (from scratch)
We will begin by looking at our “historical” netCDF file. We can load this file into memory using the Dataset()
function that we imported from netCDF4.
Python
= Dataset(past) nc
First, we recommend that you get familiar with this data set. For instance, what are the dimensions of your data?
Python
nc.dimensions
## {'lat': <class 'netCDF4._netCDF4.Dimension'>: name = 'lat', size = 7, 'bnds': <class 'netCDF4._netCDF4.Dimension'>: name = 'bnds', size = 2, 'lon': <class 'netCDF4._netCDF4.Dimension'>: name = 'lon', size = 32, 'time': <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'time', size = 1980}
Notice that we have three-dimensional data, stored along X (longitude), Y (latitude), and T (time). Some netCDF files also have a Z dimension (e.g. height). There is no hard limit to the number of dimensions a netCDF file can have, though it is suggested not to exceed 1024. Rest assured, it is unlikely that you will encounter a 1024-D file in the wild. To get a list of the variables stored in your netCDF file, query the variables
attribute of your loaded data set.
Python
nc.variables
## {'height': <class 'netCDF4._netCDF4.Variable'>
## float64 height()
## units: m
## axis: Z
## positive: up
## long_name: height
## standard_name: height
## unlimited dimensions:
## current shape = ()
## filling on, default _FillValue of 9.969209968386869e+36 used, 'lat': <class 'netCDF4._netCDF4.Variable'>
## float64 lat(lat)
## bounds: lat_bnds
## units: degrees_north
## axis: Y
## long_name: Latitude
## standard_name: latitude
## unlimited dimensions:
## current shape = (7,)
## filling on, default _FillValue of 9.969209968386869e+36 used, 'lat_bnds': <class 'netCDF4._netCDF4.Variable'>
## float64 lat_bnds(lat, bnds)
## unlimited dimensions:
## current shape = (7, 2)
## filling on, default _FillValue of 9.969209968386869e+36 used, 'lon': <class 'netCDF4._netCDF4.Variable'>
## float64 lon(lon)
## bounds: lon_bnds
## units: degrees_east
## axis: X
## long_name: Longitude
## standard_name: longitude
## unlimited dimensions:
## current shape = (32,)
## filling on, default _FillValue of 9.969209968386869e+36 used, 'lon_bnds': <class 'netCDF4._netCDF4.Variable'>
## float64 lon_bnds(lon, bnds)
## unlimited dimensions:
## current shape = (32, 2)
## filling on, default _FillValue of 9.969209968386869e+36 used, 'tas': <class 'netCDF4._netCDF4.Variable'>
## float32 tas(time, lat, lon)
## standard_name: air_temperature
## long_name: Near-Surface Air Temperature
## comment: ST+273.16, CMIP_table_comment: near-surface (usually, 2 meter) air temperature
## units: K
## original_name: ST
## history: degctok 2019-04-30T17:32:17Z altered by CMOR: Treated scalar dimension: 'height'. 2019-04-30T17:32:17Z altered by CMOR: Reordered dimensions, original order: lat lon time. 2019-04-30T17:32:17Z altered by CMOR: replaced missing value flag (1e+38) with standard missing value (1e+20).
## cell_methods: area: time: mean
## cell_measures: area: areacella
## coordinates: height
## missing_value: 1e+20
## _FillValue: 1e+20
## unlimited dimensions: time
## current shape = (1980, 7, 32)
## filling on, 'time': <class 'netCDF4._netCDF4.Variable'>
## float64 time(time)
## bounds: time_bnds
## units: days since 1850-01-01 0:0:0.0
## calendar: 365_day
## axis: T
## long_name: time
## standard_name: time
## unlimited dimensions: time
## current shape = (1980,)
## filling on, default _FillValue of 9.969209968386869e+36 used, 'time_bnds': <class 'netCDF4._netCDF4.Variable'>
## float64 time_bnds(time, bnds)
## unlimited dimensions: time
## current shape = (1980, 2)
## filling on, default _FillValue of 9.969209968386869e+36 used}
You can query a specific variable as well, e.g. time
.
Python
'time'] nc.variables[
## <class 'netCDF4._netCDF4.Variable'>
## float64 time(time)
## bounds: time_bnds
## units: days since 1850-01-01 0:0:0.0
## calendar: 365_day
## axis: T
## long_name: time
## standard_name: time
## unlimited dimensions: time
## current shape = (1980,)
## filling on, default _FillValue of 9.969209968386869e+36 used
The above shows us some attributes of our variable, such as units, current shape, and whether it is an unlimited variable or not.
We can store the variable in a Python object, which will help us to get its values.
Python
= nc.variables['time']
nc_time 0:10] nc_time[
## masked_array(data=[ 15.5, 45. , 74.5, 105. , 135.5, 166. , 196.5, 227.5,
## 258. , 288.5],
## mask=False,
## fill_value=1e+20)
8.1.1 Get subset indices
The time variable isn’t very easy for us to read, as it is stored as “Days since January 1, 1850”. Interpreting the time variable is where xarray excels and why we will ultimately use this package going forward. Nevertheless, it is instructive to work through how to handle the time dimension in the absence of xarray. Luckily, there are a few helper functions to help us get our period of interest: 1981 to 2010.
Python
= date2index(dt.datetime(1980, 12, 1), nc_time, select="nearest")
time_start = date2index(dt.datetime(2010, 10, 31), nc_time, select="nearest") time_end
Note that we have selected the index that contains data nearest to our desired end date.
Python
=nc_time.units, calendar=nc_time.calendar) num2date(nc_time[time_end], units
## cftime.DatetimeNoLeap(2010, 12, 16, 12, 0, 0, 0, has_year_zero=True)
We now know the slice of T that we need. How about X and Y? We will select a single grid cell, based on our station target. First, I will define that target. I have chosen to use a Python dictionary, because I have a tendency to get my “lats” and my “lons” confused. You can use a regular list if you prefer.
Python
# Toronto station coords
= {"lat": 43.67, "lon": -79.40} target
We are used to using negatives for western longitudes (i.e. 180°W to 180°E), but the CMIP6 models use longitudes from 0°E to 360°E instead. We’ll need to convert our longitude.
Python
if target['lon'] < 0:
'lon'] = 360 + target['lon']
target[
'lon'] target[
## 280.6
Now we can get the data for our latitude and longitude variables. Notice the syntax below. We query the variable using nc.variables.get()
, then we choose the full range of the data using [:]
, which returns a masked array. We pull the values out of that array with .data
.
Python
= nc.variables['lat'][:].data
nc_lat = nc.variables['lon'][:].data nc_lon
Now we can find the latitude and longitude cell that is closest to our target.
Python
= np.argmin(np.abs(nc_lat - target['lat']))
lat_cell = np.argmin(np.abs(nc_lon - target['lon'])) lon_cell
The minimal netCDF files that we’ve created for this book do not
include “bounds” variables, however, these can be useful. It is not
always clear what the “latitude” and “longitude” of a cell refers to: is
it the centre point? Is it the top-left corner of the grid box? This is
where the bounds variables are really useful. The latitude bounds
provide the latitude of the top and bottom of each grid box. The
longitude bounds provide the right and left sides of the box. Using the
bounds can be a safer way to identify the grid box that we are
interested in. Depending on the model you have selected, look for a
variable called lat_bnds
or lat_bounds
.
Python
nc_lat_bnds = nc.variables["lat_bnds"][:].data
lat_cell = np.where((nc_lat_bnds[:,0] < target['lat']) & (nc_lat_bnds[:,1] > target['lat']))[0][0]
nc_lon_bnds = nc.variables["lon_bnds"][:].data
lon_cell = np.where((nc_lon_bnds[:,0] < target['lon']) & (nc_lon_bnds[:,1] > target['lon']))[0][0]
8.1.2 Getting the data
Now that we know the indices of X, Y, and T that we are interested in, we can pull these out of the netCDF file.
Python
= nc.variables['tas'][time_start:(time_end + 1), lat_cell, lon_cell]
dat 0:10] dat[
## masked_array(data=[266.87363, 261.24707, 272.11676, 278.9586 , 288.98645,
## 292.85455, 295.75702, 294.00974, 288.09775, 283.35312],
## mask=False,
## fill_value=1e+20,
## dtype=float32)
You may have noticed temperatures in the 200s. Our data is in Kelvin. To convert to Celsius, we can subtract 273.15 from every value. For convenience, I will also pull the values out of the masked array using .data
.
Python
= np.subtract(dat.data, 273.15) dat
Let’s make sure that we got the right amount of time and data: 12 months times 30 years for 360 data points.
Python
- time_start + 1 time_end
## 360
len(dat)
## 360
We can convert our time deltas into dates as follows:
Python
= num2date(nc_time[time_start:time_end + 1], units=nc_time.units, calendar=nc_time.calendar) dates
You may have noticed that, in choosing a single grid cell from a 3D gridded data set, we’ve ended up dropping both spatial dimensions entirely and are left with a one-dimensional array of temperature values. However, we want to retain important label information, such as the time for which the value was projected, so we need a 2D data structure, and will make use of the same tools that we used for our observed data in 4. We can take our two separate (1D) arrays (Date
and MeanTemp
) and use pandas to combine this information into a data frame.
Python
= pd.DataFrame({'Date': dates, 'MeanTemp': dat, 'Experiment': 'historical'})
df df.head()
## Date MeanTemp Experiment
## 0 1981-01-16 12:00:00 -6.276367 historical
## 1 1981-02-15 00:00:00 -11.902924 historical
## 2 1981-03-16 12:00:00 -1.033234 historical
## 3 1981-04-16 00:00:00 5.808594 historical
## 4 1981-05-16 12:00:00 15.836456 historical
Throughout the labs in this book, we have focused our analysis on a single spatial point (or grid cell). pandas makes this easy. As you gain skill and as the spatial focus of your research broadens, you may become interested in performing analysis across multiple spatial points, and you will find that the pandas code here cannot be easily generalized across dimensions. The powerful xarray package borrows concepts from pandas and applies them to \(n\)-dimensional arrays. To learn more about xarray refer to the documentation. Another fantastic resouce is the book An Introduction to Earth and Environmental Data Science by Ryan Abernathey, which includes a chapter introducing xarray.
Even though we are using monthly data, the date column is still formatted with a full YYYY-MM-DD hh:mm:ss date! The date and time represents approximately the middle of the month (probably from averaging the values from a smaller time scale). In addition, the calendar used in CanESM5 is a “365 day” calendar, which ignores leap years. The num2date
function that we used above, converts the values of the time dimension into a special cftime.DatetimeNoLeap
object. Unfortunately, this date format does not react intuitively in plots. Given that we are using monthly data, it doesn’t matter to us whether there were 28 or 29 days in February of any given year. As such, let’s convert the cftime.DatetimeNoLeap
object to a regular Python datetime.date
with a YYYY-MM-DD format.
Python
'Date'] = [dt.date(i.year, i.month, i.day) for i in df.Date] df[
Let’s see what our data frame looks like.
Python
df.head()
## Date MeanTemp Experiment
## 0 1981-01-16 -6.276367 historical
## 1 1981-02-15 -11.902924 historical
## 2 1981-03-16 -1.033234 historical
## 3 1981-04-16 5.808594 historical
## 4 1981-05-16 15.836456 historical
Python
df.tail()
## Date MeanTemp Experiment
## 355 2010-08-16 23.870422 historical
## 356 2010-09-16 18.445435 historical
## 357 2010-10-16 12.228333 historical
## 358 2010-11-16 4.120392 historical
## 359 2010-12-16 -1.534607 historical
We can plot the results of our analysis up to this point.
Python
plt.plot(df.Date, df.MeanTemp)r'CanESM5 Monthly $T_\mathrm{mean}$ at Toronto (1981‒2010)')
plt.title("Year")
plt.xlabel("Mean Temperature (°C)") plt.ylabel(
Looks good! The pronounced seasonal cycle and the winter to summer range looks reasonable for the Toronto location.
It is always a good idea to close our netCDF file when we are done with it.Python
nc.close()
8.2 The past (GCS + xarray)
Now, that you have a better sense of the netCDF data structure, we are going to simplify the above process using Google Cloud and xarray. First, let’s import the additional packages we need.
Python
import xarray as xr
import zarr
import gcsfs
Now, let’s access the Google Cloud CMIP6 data catalog as we did in Lab 2 (Sec. 6).
Python
= pd.read_csv('https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv') cmip6_cat
We will proceed by querying the same model (CanESM5), variable (monthly tas, aka MeanTemp) and historical experiment as we did above.
Python
= cmip6_cat.query("table_id == 'Amon' & source_id == 'CanESM5' & variable_id == 'tas' & experiment_id == 'historical' & member_id =='r1i1p1f1'")
files files
## activity_id institution_id source_id experiment_id member_id ... \
## 100305 CMIP CCCma CanESM5 historical r1i1p1f1 ...
##
## variable_id grid_label \
## 100305 tas gn
##
## zstore dcpp_init_year \
## 100305 gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical... NaN
##
## version
## 100305 20190429
##
## [1 rows x 11 columns]
There’s the file we are looking for. Great! As in Lab 2, we will now access the file by establishing a connection to Google Cloud.
Python
# this only needs to be created once
= gcsfs.GCSFileSystem(token='anon')
gcs
# get the path to a specific zarr store
= files.zstore.values[0]
zstore
# create a mutable-mapping-style interface to the store
= gcs.get_mapper(zstore)
mapper
# open it using xarray and zarr
= xr.open_zarr(mapper, consolidated=True)
ds ds
<xarray.Dataset> Dimensions: (lat: 64, bnds: 2, lon: 128, time: 1980) Coordinates: height float64 ... * lat (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 82.31 85.1 87.86 lat_bnds (lat, bnds) float64 dask.array<chunksize=(64, 2), meta=np.ndarray> * lon (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2 lon_bnds (lon, bnds) float64 dask.array<chunksize=(128, 2), meta=np.ndarray> * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray> Dimensions without coordinates: bnds Data variables: tas (time, lat, lon) float32 dask.array<chunksize=(600, 64, 128), meta=np.ndarray> Attributes: (12/56) CCCma_model_hash: 3dedf95315d603326fde4f5340dc0519d80d10c0 CCCma_parent_runid: rc3-pictrl CCCma_pycmor_hash: 33c30511acc319a98240633965a04ca99c26427e CCCma_runid: rc3.1-his01 Conventions: CF-1.7 CMIP-6.2 YMDH_branch_time_in_child: 1850:01:01:00 ... ... variable_id: tas variant_label: r1i1p1f1 version: v20190429 status: 2019-10-25;created;by nhn2@columbia.edu netcdf_tracking_ids: hdl:21.14100/872062df-acae-499b-aa0f-9eaca76... version_id: v20190429
Notice the structure of the data. You should see all the same dimensions, coordinates and variables as we saw above, but presented in a much more visually intuitive way.
As above, we will now extract the location and time period we are interested in. We did something similar in Lab 2 for daily temperature data. Let’s extend our time period to the year 2014, so that we can more easily append our future experiments, the SSPs, to our data frame.
Python
# Select grid box nearest to the Toronto station latitude and longitude
= ds.sel(lat="43.67", lon="280.6",method='nearest')
dat
# Select time range using slicing
= dat.sel(time=slice("1981","2014")) - 273.15 # convert from Kelvin to degC
dat
# Load data into memory
dat.load()
<xarray.Dataset> Dimensions: (bnds: 2, time: 408) Coordinates: height float64 2.0 lat float64 43.25 lat_bnds (bnds) float64 41.86 44.66 lon float64 281.2 lon_bnds (bnds) float64 279.8 282.7 * time (time) object 1981-01-16 12:00:00 ... 2014-12-16 12:00:00 time_bnds (time, bnds) object 1981-01-01 00:00:00 ... 2015-01-01 00:00:00 Dimensions without coordinates: bnds Data variables: tas (time) float32 -6.276 -11.9 -1.033 5.809 ... 9.475 4.363 -3.205 Attributes: (12/56) CCCma_model_hash: 3dedf95315d603326fde4f5340dc0519d80d10c0 CCCma_parent_runid: rc3-pictrl CCCma_pycmor_hash: 33c30511acc319a98240633965a04ca99c26427e CCCma_runid: rc3.1-his01 Conventions: CF-1.7 CMIP-6.2 YMDH_branch_time_in_child: 1850:01:01:00 ... ... variable_id: tas variant_label: r1i1p1f1 version: v20190429 status: 2019-10-25;created;by nhn2@columbia.edu netcdf_tracking_ids: hdl:21.14100/872062df-acae-499b-aa0f-9eaca76... version_id: v20190429
Next, we are going to covert our xarray into a pandas data frame to ease our integration of weather station data into our analysis later on. Recall that above our dataframe had three columns, Date
, MeanTemp
and Experiment
. We can extract the experiment name using the column names from our CMIP6 catalog query.
Python
# get the experiment id to add to our dataframe
= files.experiment_id.values[0]
exp print(exp)
## historical
To integrate the time variable into pandas, we need to convert it from a cftime object (the netCDF time format) to a datetime object.
Python
= dat.indexes['time'].to_datetimeindex() #convert cftime to datetime dates
## <string>:1: RuntimeWarning: Converting a CFTimeIndex with dates from a non-standard calendar, 'noleap', to a pandas.DatetimeIndex, which uses dates from the standard calendar. This may lead to subtle errors in operations that depend on the length of time between dates.
Finally, let’s package all the pieces together into a data frame.
Python
= pd.DataFrame({'Date': dates, 'MeanTemp': dat['tas'].values, 'Experiment': exp})
df df.head()
## Date MeanTemp Experiment
## 0 1981-01-16 12:00:00 -6.276367 historical
## 1 1981-02-15 00:00:00 -11.902924 historical
## 2 1981-03-16 12:00:00 -1.033234 historical
## 3 1981-04-16 00:00:00 5.808594 historical
## 4 1981-05-16 12:00:00 15.836456 historical
As above, let’s simplify the Date
variable:
Python
'Date'] = [dt.date(i.year, i.month, i.day) for i in df.Date] df[
The final data frame should look exactly like the first one we created.
Python
df.head()
## Date MeanTemp Experiment
## 0 1981-01-16 -6.276367 historical
## 1 1981-02-15 -11.902924 historical
## 2 1981-03-16 -1.033234 historical
## 3 1981-04-16 5.808594 historical
## 4 1981-05-16 15.836456 historical
Python
df.tail()
## Date MeanTemp Experiment
## 403 2014-08-16 25.186005 historical
## 404 2014-09-16 20.401062 historical
## 405 2014-10-16 9.474548 historical
## 406 2014-11-16 4.363190 historical
## 407 2014-12-16 -3.205353 historical
8.3 The future
The first thing we need to do is query and locate the future experiment files for CanESM5, i.e. the SSP files.
Python
= cmip6_cat.query("table_id == 'Amon' & source_id == 'CanESM5' & variable_id == 'tas' & experiment_id == ['ssp119','ssp126','ssp245','ssp370','ssp585'] & member_id =='r1i1p1f1'")
files2 files2
## activity_id institution_id source_id experiment_id member_id ... \
## 86050 ScenarioMIP CCCma CanESM5 ssp370 r1i1p1f1 ...
## 94452 ScenarioMIP CCCma CanESM5 ssp585 r1i1p1f1 ...
## 110673 ScenarioMIP CCCma CanESM5 ssp245 r1i1p1f1 ...
## 130759 ScenarioMIP CCCma CanESM5 ssp119 r1i1p1f1 ...
## 151017 ScenarioMIP CCCma CanESM5 ssp126 r1i1p1f1 ...
##
## variable_id grid_label \
## 86050 tas gn
## 94452 tas gn
## 110673 tas gn
## 130759 tas gn
## 151017 tas gn
##
## zstore dcpp_init_year \
## 86050 gs://cmip6/CMIP6/ScenarioMIP/CCCma/CanESM5/ssp... NaN
## 94452 gs://cmip6/CMIP6/ScenarioMIP/CCCma/CanESM5/ssp... NaN
## 110673 gs://cmip6/CMIP6/ScenarioMIP/CCCma/CanESM5/ssp... NaN
## 130759 gs://cmip6/CMIP6/ScenarioMIP/CCCma/CanESM5/ssp... NaN
## 151017 gs://cmip6/CMIP6/ScenarioMIP/CCCma/CanESM5/ssp... NaN
##
## version
## 86050 20190429
## 94452 20190429
## 110673 20190429
## 130759 20190429
## 151017 20190429
##
## [5 rows x 11 columns]
Let’s sort this data frame by the experiment. This will come in handy later.
Python
= files2.sort_values(by=['experiment_id']) files2
Now, we are going to repeat the same process for the five SSP experiments that we did for the historical experiment. Repetition calls for a function! Let’s define a function that accesses each files (i.e., each row) of a Google Cloud query data frame.
Python
def open_dataset(files,i):
= files.zstore.values[i]
zstore = files.experiment_id.values[i]
exp_id = xr.open_zarr(gcs.get_mapper(zstore), consolidated=True)
ds return ds,exp_id
We can use this function to append the SSP data onto the historical data frame that we have already created using a quick for
loop across the five SSP files.
This loop will process the data from each file in the same way that we processed the historical data, and will create one data frame containing all of our data.
Python
for i in range(len(files2)):
= open_dataset(files2,i)
tmp,exp
#extract location and time period of interest
= tmp.sel(lat="43.67", lon="280.6",method='nearest')
tmp = tmp.sel(time=slice("2015","2100")) - 273.15 #convert to degC
tmp
#note: some models will have time already as a datetime object. If so, comment out the below line and uncomment the next line.
= tmp.indexes['time'].to_datetimeindex(unsafe=True) #convert cftime to datetime
dates #dates = tmp.indexes['time']
#create data frame
= pd.DataFrame({'Date': dates, 'MeanTemp': tmp['tas'].values, 'Experiment': exp})
df2 'Date'] = [dt.date(i.year, i.month, i.day) for i in df2.Date]
df2[
#concatenate the new ssp data frame (df2) to original historical data frame (df), such that df grows each time through the loop
= pd.concat([df,df2], ignore_index=True, sort=True) df
The table that we have created above, is a “long” data frame. A long data frame is what is referred to as “tidy data” in the R universe. Tidy data is characterized by a few simple principles: each column is a variable, each row is an observation, each table contains a unique type of observational unit.
A long layout table is very useful for many operations such as grouping and subsetting, but can be a challenge to plot. For instance:
Python
df.plot()
Somewhere in Figure 8.2 is the CanESM5 Monthly \(T_\mathrm{mean}\) at Toronto (2015‒2100) using our five SSP scenarios, but it looks like we need some more sophisticated code to plot this! We can manually fix the above by filtering for each projection and adding the line manually.
Python
== "ssp119"], df.MeanTemp[df.Experiment == "ssp119"], 'c-', alpha=0.7, label = 'SSP1-1.9')
plt.plot(df.Date[df.Experiment == "ssp126"], df.MeanTemp[df.Experiment == "ssp126"], 'b-', alpha=0.7, label = 'SSP1-2.6')
plt.plot(df.Date[df.Experiment == "ssp245"], df.MeanTemp[df.Experiment == "ssp245"], 'y-', alpha=0.7, label = 'SSP2-4.5')
plt.plot(df.Date[df.Experiment == "ssp370"], df.MeanTemp[df.Experiment == "ssp370"], 'r-', alpha=0.7, label = 'SSP3-7.0')
plt.plot(df.Date[df.Experiment == "ssp585"], df.MeanTemp[df.Experiment == "ssp585"], 'darkred', alpha=0.7, label = 'SSP5-8.5')
plt.plot(df.Date[df.Experiment ='lower right')
plt.legend(loc-20,40)
plt.ylim(r'CanESM5 Monthly $T_\mathrm{mean}$ at Toronto (2015‒2100)')
plt.title("Year")
plt.xlabel("Mean Temperature (°C)") plt.ylabel(
## (-20.0, 40.0)
What if we want to include the “historical” data? In Figure 8.3 we manually plotted each scenario as a separate line on the same plot. The seaborn library can also make this process a little easier. Using seaborn we can identify what it is that makes each data series unique and how we want to differentiate between series. We will pass hue="Experiment"
to our seaborn plot, which indicates that we want a different hue (i.e. colour) for each unique value in "Experiment"
.
Python
import seaborn as sns
=df, x="Date", y="MeanTemp", hue="Experiment",palette=["k","c", "b", "y", "r","darkred"]).set(ylabel="Mean Temperature (°C)", title=r'CanESM5 Monthly $T_\mathrm{mean}$ at Toronto (1981‒2100)') sns.lineplot(data
A long data frame is computationally efficient, but isn’t as intuitive to navigate by the human eye. As such, if we want to include a table in a report or publication, the wide layout is usually more print-friendly. You can create a wide layout table using df.pivot
like this:
Python
= df.pivot(index='Date', columns='Experiment')['MeanTemp']
df_wide df_wide.head()
## Experiment historical ssp119 ssp126 ssp245 ssp370 ssp585
## Date
## 1981-01-16 -6.276367 NaN NaN NaN NaN NaN
## 1981-02-15 -11.902924 NaN NaN NaN NaN NaN
## 1981-03-16 -1.033234 NaN NaN NaN NaN NaN
## 1981-04-16 5.808594 NaN NaN NaN NaN NaN
## 1981-05-16 15.836456 NaN NaN NaN NaN NaN
Python
df_wide.tail()
## Experiment historical ssp119 ssp126 ssp245 ssp370 ssp585
## Date
## 2100-08-16 NaN 23.564697 26.388550 26.977051 31.046631 31.213440
## 2100-09-16 NaN 19.886261 20.563843 25.151825 26.710999 30.756989
## 2100-10-16 NaN 11.292908 14.421967 16.221588 17.903748 22.165222
## 2100-11-16 NaN 6.061523 5.788635 8.628082 10.470490 13.136444
## 2100-12-16 NaN -2.109161 -2.885834 3.456177 4.007507 5.829773
This is also easier to plot with the plot
method built into pandas.
Python
={'historical': 'k', 'ssp119': 'c','ssp126': 'b','ssp245': 'y','ssp370': 'r','ssp585': 'darkred'}, title = r'CanESM5 Monthly $T_\mathrm{mean}$ at Toronto (1981‒2100)')
df_wide.plot(style"Mean Temperature (°C)") plt.ylabel(
8.4 Change Factors
In our next lab, we are going to use change factors to describe changes in our baseline climate. The change factor method (sometimes called \(\Delta T\)) calculates the projected temperature by adding the observed, baseline temperature time series to the change in the temperature projected by a GCM integration (typically using monthly GCM data). This process is intended to remove the mean bias in GCM historical and future simulated climate conditions, and represents a simplified form of downscaling. Change factors are described by the following equation:
The first step for a change factor analysis is to calculate the change projected by a GCM over its baseline period. To do so, we will use 30-year “tridecades”, including our baseline (1981–2010), and three common climate change tridecades: the 2020s (2011–2040), the 2050s (2041–2070), and the 2080s (2071–2100). In this case, \(T_{\mathrm{OBSERVED}}\) is the observed 30-year average mean temperature over the baseline period, and \(\Delta T_{\mathrm{GCM}}\) is the difference between the model-projected 30-year average mean temperature for the baseline period and the model-projected 30-year average mean temperature for each of the tridecades. In your research, you may opt to use a seasonal or monthly average instead of the annual average presented here.
Let’s begin by filtering our “long” data frame to our time periods of interest. I will only provide examples for the baseline and the 2020s, but you should complete this work for the 2050s and 2080s as well.
Python
= df[(df.Date >= dt.date(1981, 1, 1)) & (df.Date <= dt.date(2010, 12, 31))]
df_base = df[(df.Date >= dt.date(2011, 1, 1)) & (df.Date <= dt.date(2040, 12, 31))] df_2020s
Next, we compute the 30-year averages for our baseline and the 2020s. Recall that the 2020s will contain a few years from the historical experiment, the years 2011-2014.
Python
= df_base[df_base.Experiment == "historical"].MeanTemp.mean() bsln
Now we can calculate the change factor (or anomaly) for each of our tridecadal periods.
Python
= df_2020s[df_2020s.Experiment.isin(["historical","ssp119"])].MeanTemp.mean() - bsln
cf2020s_ssp119 = df_2020s[df_2020s.Experiment.isin(["historical","ssp126"])].MeanTemp.mean() - bsln
cf2020s_ssp126 = df_2020s[df_2020s.Experiment.isin(["historical","ssp245"])].MeanTemp.mean() - bsln
cf2020s_ssp245 = df_2020s[df_2020s.Experiment.isin(["historical","ssp370"])].MeanTemp.mean() - bsln
cf2020s_ssp370 = df_2020s[df_2020s.Experiment.isin(["historical","ssp585"])].MeanTemp.mean() - bsln cf2020s_ssp585
We can summarize these data in a table like so:
Python
"Model": np.repeat("CanESM5", 5),
pd.DataFrame({"Ensemble": np.repeat("r1i1p1f1", 5),
"Scenario": ["SSP1-1.9","SSP1-2.6", "SSP2-4.5","SSP3-7.0", "SSP5-8.5"],
"Baseline (°C)": np.repeat([bsln], 5),
"2020s": [cf2020s_ssp119, cf2020s_ssp126, cf2020s_ssp245, cf2020s_ssp370,cf2020s_ssp585],
"2050s": np.repeat([''], 5),
"2080s": np.repeat([''], 5)})
## Model Ensemble Scenario Baseline (°C) 2020s 2050s 2080s
## 0 CanESM5 r1i1p1f1 SSP1-1.9 8.782319 1.941514
## 1 CanESM5 r1i1p1f1 SSP1-2.6 8.782319 1.865280
## 2 CanESM5 r1i1p1f1 SSP2-4.5 8.782319 2.065468
## 3 CanESM5 r1i1p1f1 SSP3-7.0 8.782319 2.133127
## 4 CanESM5 r1i1p1f1 SSP5-8.5 8.782319 2.131548
8.5 Exercises (what to submit)
- Complete a table similar to the one above for a different model (This can be the model that you used in Lab 2 or you can see a list of model names to choose from here). [2 marks]
- The table, below, provides the baseline mean temperature (°C) and 2020s change factor for SSP1-2.6 and SSP2-4.5 for all five available ensemble members of CanESM5. In some cases, the change factor for the SSP1-2.6 scenario is higher than the SSP2-4.5 scenario. What are some possible explanations for this? Hint: Consider the sources of uncertainty in climate model projections and how these relate to two key terms, “pathway” and “ensemble member”. [2 marks]
Ensemble | Scenario | Baseline | 2011-2040 |
---|---|---|---|
r1i1p1f1 | SSP1-2.6 | 8.8 | 1.9 |
SSP2-4.5 | 8.8 | 2.1 | |
r2i1p1f1 | SSP1-2.6 | 8.6 | 2.0 |
SSP2-4.5 | 8.6 | 1.8 | |
r3i1p1f1 | SSP1-2.6 | 8.8 | 1.9 |
SSP2-4.5 | 8.8 | 1.9 | |
r4i1p1f1 | SSP1-2.6 | 8.8 | 2.1 |
SSP2-4.5 | 8.8 | 1.9 | |
r5i1p1f1 | SSP1-2.6 | 8.7 | 1.8 |
SSP2-4.5 | 8.7 | 2.0 |
- Download the following .csv file. This file contains CMIP6 data for Toronto all of the models that are available for ensemble r1i1p1f1 for annual mean temperature, tas. Open the file with pandas and take a look at the column names. Based on the lab exercise, describe what the column names correspond to. [2 marks]
- Compare the output of the models in the .csv. file. Are the model projections (the change factors) largely similar, or different? What might explain some of these similarities or differences (consider similarities/differences across scenarios, across time periods and across models). [2 marks]
- Provide a brief summary of the output in the .csv. file. Specifically, which model(s) simulates the warmest and coldest baseline? Which model projects the largest change in temperature (overall and for each scenario and tridecadal period)? Which projects the smallest? What is the multi-model mean baseline and change factor for each tridecadal period and scenario? Compile your data in a table(s). Think about how best to present this data in tabular form (model names and the corresponding values). If you are more comfortable doing this analysis with Excel, you are welcome to, but I encourage you to attempt some analysis using python, Lab 4 (Sec. (ref?)(lab4)) offers some guidance! [4 marks]