EOF Analysis in IDL

QUESTION: Someone told me I should analyze my data in IDL using Principal Component (PC) analysis. Someone else said, no, use Empirical Orthogonal Function (EOF) analysis. I don't know anything about either one of them. Can you help?

ANSWER: Well, I can try to help, but I strongly recommend you get a copy of Statistical Methods in the Atmospheric Sciences, Second Edition, by Daniel S. Wilks. It may be the most practical statistical book I've ever read. I couldn't do my work without it.

First of all, PC analysis and EOF analysis are one and the same. Edward Lorenz, a Boulder resident who died recently, coined the name Empirical Orthogonal Function analysis when he developed this multivariate statistical technique in the 1950s. (This is the same Lorenz who deeply investigated chaos theory and is famous for illuminating the “butterfly effect.” The text of a talk he gave on the subject, “Predictability: Does the Flap of a Butterfly's Wings in Brazil Set off a Tornado in Texas?” is available for your reading pleasure.) The purpose of EOF analysis is to “reduce a data set containing a large number of variables to a data set containing fewer (hopefully many fewer) new variables.”

In practical terms, we often want to investigate the variability of something like temperature in the Arctic over time. We typically have many thousands of measurements, k, over some finite time, n, and k is much greater than n. In EOF analysis, we are trying to reduce the many-dimensional problem vector space to a reduced set of orthogonal vectors that can be said to “explain ”the variance in the data. It is not unusual in atmospheric science data, which is often highly correlated (temperatures in regions of the Earth near each other do not vary greatly from one another), to find that a handful of vectors, typically five or six of what we call modes, can explain more than 95% of the data variability. Of course, just because we can “explain” the data mathematically doesn't mean that we can explain the underlying physical processes that create the variability in the first place, but that's a science problem, not a programming problem, and I leave that part to you. Many papers, some of them convincing, are written on this topic.

Obtaining the Data to Analyze

To give us something practical to work on, I have decided to analyze NCEP/NCAR temperature data over the Arctic for the month of September. (Ice concentration minimums are in September, so the idea is that we might understand more about why the Arctic ice is melting if we can see patterns in Arctic temperature variability.) We will begin our analysis in 1987 when ice concentration measurements from satellites became reliable.

The NCEP/NCAR reanalysis data is readily available. The monthly air temperature data set I am interested in is named air.mon.mean.nc and is stored in a netCDF format. The data contains monthly mean temperatures for the years 1969 to present, at 17 different pressure levels (I will be using the 925 mbar pressure in this example). The data is gridded on a global 2.5 degree latitude/longitude grid (144x73). Longitudes range from 0 to 357.5 degrees and latitudes range from -90 to 90 degrees. The complete data set is 144x73x17x722. The last dimension is the number of months since Jan 1969 in units of hours since 01-01-0001. (Don't ask me.) According to the documentation, the air temperature must be prepared for use by multiplying by 0.01 and adding an offset of 127.65.

Here is the code I used to create the data I am going to use in the analysis. To read the data, I used the NCDF_DATA object from the Coyote Library. Running the code results in a data set, named air_temp, containing air temperatures in a 144x18x21 array. In other words, air temperatures measured at 144 times 18 locations (2592 locations), for 21 years. It also results in corresponding latitude and longitude arrays, named ncep_lat and ncep_lon, but these are only used to plot the data from the analysis, they are not used in the analysis itself.

EOF Analysis

The first step in the EOF analysis is to multiply the air temperature values by a weighting factor, if this is required. In this case it is, because each grid cell in the data is the same size, but each grid cell on the Earth is a different size, according to its latitude. It is common, then, to multiply the data by a “cosine of latitude” weighting factor. I am actually going to use a square-root function of the cosine of latitude, because I think this is more accurate.

Note that my latitude values go from 47.5 to 90.0 degrees of latitude, in 2.5 degree increments. I am going to subtract 1.25 degrees from each of these latitude values, in order to locate the latitudes in the center of the grid, and to avoid multiplying by zero for all the latitudes at 90 degrees.

    dims = Size(air_temp, /Dimensions)
   nlon = dims[0] & nlat = dims[1] & ntime = dims[2]
   lon = Reform(lon_ncep[*,0]) & dlon = Abs(lon[1]-lon[0])
   lat = Reform(lat_ncep[0,*]) & dlat = Abs(lat[1]-lat[0])
   weights = Sqrt(Cos((lat_ncep - dlat/2.) * !DtoR))
   FOR j=0,ntime-1 DO air_temp[*,*,j] = air_temp[*,*,j] * weights

The next step is to reorganize the data into a two-dimensional array of spatial measurements verses time (k verses n, with k >> n). This is easily done in IDL, and results in a 2592 by 21 array.

   air_temp = Reform(air_temp, nlon*nlat, ntime, /OVERWRITE)

Normally, the means are subtracted from each spatial point, over time. That is, for each point in space, we calculate the mean value of the 21 time measurements for that point, and subtract that mean from each spatial value. This gives us the data anomalies, or differences. Sometimes the anomalies are divided by the standard deviation of each measurement as well, but since I am going to be doing a covariance analysis, rather than a correlation analysis, this is not necessary here.

    air_anomalies = air_temp
   FOR j=0,nlon*nlat-1 DO air_anomalies[j,*] = air_temp[j,*] - Mean(air_temp[j,*])

Creating the Covariance Matrix: A Trick for Fast Analysis

Since we are concerned with spatial patterns of variance in this analysis, we wish to find out how the spatial patterns vary over time. Thus, the covariance matrix that you would normally want to build would be a 2592x2592 matrix. This is extremely large and difficult to deal with on many computers. If I perform this analysis on a 32-bit computer with 2 GBytes of RAM, the analysis will typically take something in the neighborhood of 25 minutes to run. Running with the complete NCEP temperature array could take hours or days to run.

However, Wilks points out, on page 499 of Statistical Methods, that since k is greater than n, the covariance matrix is singular, which implies that all the eigenvalues after the nth eigenvalue will be approximately zero. Given this, it is pretty much pointless to calculate these additional eigenvalues and their associated eigenvectors, since in practical applications, they will be ignored anyway.

He goes on to point out that rather than creating a k-by-k covariance matrix, the spatial eigenvectors (in other words, the spatial patterns we are looking for) can be recreated by using an n-by-n covariance matrix and reconstructing the n spatial eigenvectors from the n time domain eigenvectors produced by the smaller matrix. This is because the n eigenvalues produced from using the smaller matrix are identical to the n non-zero eigenvalues produced from using the larger matrix. An analysis using this computational trick produces the same results and runs in less than half a second!

Naturally, we are going to use this trick, since you can't believe how much it speeds up program development and debugging! The covariance matrix is calculated like this.

   matrix = (1/ntime-1) * (Double(air_anomalies) ## Transpose(air_anomalies))

Just a couple of important points about this. I've tried using the Correlate function to build the correlation matrix, and I find this gives me inaccurate results. My advice to you is to always build the correlation matrix yourself, using code similar to the code above. Note, too, that you must divide the matrix by 1 less than the number of time points you have in the analysis. And keep in mind that the ## operator in IDL multiplies the rows of the first argument by the columns of the second argument, which is what the normal matrix multiplication operator does in linear algebra. You should always use the ## operator, rather than the # operator in these kinds of equations if you wish to be able to read the damn text book at the same time you are coding in IDL. (It seems to me half my time is spent just trying to wrap my head around the differences in (row,col) matrices in the book and IDL's (col,row) arrays in my code! Understanding exactly what the ## operator does helps a lot.)

Calculating the Eigenvectors and Eigenvalues

The next step in the analysis is to calculate the eigenvalues and eigenvectors of the covariance matrix. There are a number of ways to do this in IDL. In this example, I am going to use LA_SVD which uses the mathematical technique called singular value decomposition. This technique will decompose the covariance matrix into a left-matrix of eigenvectors, and a diagonal matrix of singular values, which are the eigenvalues.

    LA_SVD, matrix, W, U, V

The variables W, U, and V are the singular values (as a vector), the left-hand matrix, and the right-hand matrix, respectively. These are all output variables and are returned as a result of the decomposition.

The eigenvectors (of the time domain space) are in the columns of U. These will have to be transformed into the eigenvectors of the spatial domain (in the variable eof, below) by the Wilks trick referred to above. In other words, the n-length time domain eigenvectors are converted into the k-length spatial domain eigenvectors. The code looks like this:

    dims = Size(air_anomalies, /Dimensions)
   eof = FltArr(dims[1], dims[0])    FOR j=0,dims[1]-1 DO BEGIN
      t = Transpose(air_anomalies) ## U[j,*]
      eof[j,*]  = t / SQRT(Total(t^2))    ENDFOR

The eigenvectors simply represent the pattern of variance in the spatial measurements. The units are totally arbitrary. The only important characteristic of the values is whether they are positive or negative. To interpret the patterns, we have to look to what are sometimes called the principal components, which are the anomaly data projected onto, or transformed by, or rotated by, (we hear all three of these terms) the eigenvectors. The principal component attached to the corresponding eigenvector spatial pattern provides the sign and overall amplitude of the pattern as a function of time. To calculate the principal component matrix, we execute these commands. The principle component vectors will be the columns of the pc array, and will be of length n.

    pc = FltArr(dims[1], dims[1])
   FOR j=0,dims[1]-1 DO pc[j,*]  = air_anomalies ## eof[j,*]

We sometimes refer to the columns of the eof and pc arrays as the modes. In other words, the first mode is the first column or spatial pattern in the eof array, combined with the first column or PC in the pc array. The first value in the singular value vector corresponds to the first mode, and, as a percentage of the total values, tells you how much the first mode vector “explains” the data variance. We sometimes compute and display this as a percent variance.

    percent_variance = W / Total(W) * 100.0

In our case, the first element of the percent_variance vector is about 26, indicated that the first mode explains or predicts about 26% of the variance in the data. The second mode explains about 15% of the variance, and so forth.

Plotting the EOF Analysis

To see how to plot the results of the EOF analysis, let's consider the first mode. We first obtain the first eigenvector (in the first column of the eof variable) and reform it into its spatial dimensions. A number of programs from the Coyote Library are used in this step (e.g, cgScaleVector, cgColor, cgImage, cgLoadCT, and cgSymcat, among others).

    mode = 1    theEOF = eof[mode-1,*]
   theEOF = Reform(theEOF, nlon, nlat, /OVERWRITE)

Next, we open a window and set up the colors for display.

   cgDisplay, 800, 500, /FREE, TITLE='Analysis of NCEP Sept Temperature at 925 mBars'
   cgLoadCT, 3, NCOLORS=128, CLIP=[55, 255]
   cgLoadCT, 1, NCOLORS=127, CLIP=[55, 255], BOTTOM=128, /REVERSE
   TVLCT, cgColor('white', /TRIPLE), 255

The next step is to create a map projection, here an equal area polar projection, and warp the first mode pattern as an image into the map projection. Note the use of the latitude and longitude arrays in the Map_Patch command. Care is taken to scale the data correctly so that negative values are scaled into negative colors, and positive values are scaled into positive colors. Finally, a color bar is added to the plot to give the user some idea of the color scaling. Remember, the actual values are completely arbitrary.

   cgMAP_SET, /LAMBERT, 90, 0, POSITION=[0.025, 0.0, 0.525, 0.8], $
      LIMIT=[45, -180, 90, 180], /NOERASE, /NOBORDER
   value = Max([Abs(Min(theEOF)), Abs(Max(theEOF))] )
   negvalues = Where(theEOF LT 0.0, COMPLEMENT=posvalues)  
   negvector = cgScaleVector(findgen(128), -value, 0)
   posvector = cgScaleVector(findgen(127), 0, value)
   negscaled = Value_Locate(negvector, theEOF[negvalues])
   posscaled = Value_Locate(posvector, theEOF[posvalues])    scaled = theEOF
   scaled[negvalues] = negscaled    scaled[posvalues] = posscaled + 128.0
   warp = Map_Patch(scaled, lon_ncep, lat_ncep, /TRIANGULATE, MISSING=255)
   cgImage, warp, POSITION=[0.025, 0.0, 0.525, 0.8]
   cgMap_Continents, COLOR='charcoal'
   cgLoadCT, 3, NCOLORS=128, CLIP=[55, 255]
   cgLoadCT, 1, NCOLORS=127, CLIP=[55, 255], BOTTOM=128, /REVERSE
   cgColorbar, RANGE=[-value, value], DIVISIONS=2, FORMAT='(F8.2)', $
      POSITION=[0.085, 0.875, 0.425,  0.925], ANNOTATECOLOR='navy', XMINOR=0, $
      TITLE='Mode 1 (Sep)', FONT=0, NCOLORS=255

Next, we plot the Principal Component of mode 1.

    time = 21
   years = Indgen(time) + 1987
   cgPlot, years, pc[mode-1,*], /NoErase, COLOR='navy', /NoDATA, $
      TITLE='PC of Mode ' + Strtrim(mode,2) + ' for Sept', FONT=0, $
      POSITION=[0.65, 0.55, 0.95, 0.85], XTITLE='Years', XSTYLE=1
   cgOPlot, years, pc[mode-1,*], COLOR='indian red', THICK=2
   cgOPlot, years, pc[mode-1,*], COLOR='dodger blue', PSYM=16, SYMSIZE=0.75
   cgOPLOT, [!X.CRange[0], !X.CRange[1]], [0, 0], LINESTYLE=2, COLOR='charcoal'

And, finally, we plot the percent variance.

   cgPlot, Indgen(time)+1, percent_variance, /NoErase, COLOR='navy', /NoDATA, $
      TITLE='Percent Variance', FONT=0, XTITLE='Modes', $
      POSITION=[0.65, 0.15, 0.95, 0.38]
   cgOPlot, Indgen(time)+1, percent_variance, COLOR='indian red', THICK=2
   cgOPlot, Indgen(time)+1, percent_variance, COLOR='dodger blue', PSYM=16, SYMSIZE=0.75
   cgPlotS, (Indgen(time)+1)[mode-1], percent_variance[mode-1], COLOR='orange', $
      PSYM=16, SYMSIZE=1.25

You can see the results in the figure below. Here is the code I used to run the program.

The EOF pattern of the first mode shown on a map projection on the left, and the Principle Component
and Percent Varience of the first mode shown on the right.

Acknowledgement

My deep appreciation to James McCreight and Mary Jo Brodzik of the National Snow and Ice Data Center and to Balaji Rajagopalan of the University of Colorado for their help in understanding the details of EOF Analysis. Thanks, too, to Tao Wang, who provided me with robust code to use as an example program.

Version of IDL used to prepare this article: IDL 7.0.1.