Averaging Time Series Components

QUESTION: I have been trying to average components of a time series when they meet certain criteria. For example, I have values in which I use a Where statement to specify the critera:

   criteria = Where(time LT 1000)

I get an array returned which looks like this:

   data = [0,1,2,3,4,6,8,9,10]

Now, given this array, I would like to specify individual arrays for any “block” of subsequent numbers with size greater than three. In this example, the result would look like this:

   b1 = [0,1,2,3,4]
   b2 = [8,9,10]

I tried using the complement keyword in the “Where” statement to put the null values into an array and try to subscript my way through the answer. But that turned out to be quite messy, especially since I am using multiple files which all have a different “patterns of three or more”.

I've messed around with FOR loops and IF statements, but again I have to change those for each individual file. It would be nice to know of a technique that could find multiple sequences of consecutive integer series in an array and make individual arrays out of those sequences. This seems like a simple problem, but I haven't been able to figure it out. Can you help?

ANSWER: As usual, our guide for this journey is JD Smith.

The solution to this problem is actually not as simple as it seems. Here is a method that uses LABEL_REGION and HISTOGRAM for the solution:

   data = [0,1,2,3,4,6,8,9,10]
   labels = Label_Region([0L, (Shift(data,-1) - data) EQ 1, 0L])
   h = Histogram(labels[1:N_Elements(labels)-2], MIN=1, REVERSE_INDICES=ri)
   wh = Where(h GE 3-1, count)
   p = PtrArr(count)
   FOR j=0,count-1 DO p[j] = Ptr_New(data[ri[ri[wh[j]]]:ri[ri[wh[j]+1]-1]+1])

To see the answer:

   IDL> FOR j=0,count-1 DO Print, *p[j]
       0       1       2       3       4
       8       9      10

There are a couple of tricky bits in this solution that are worth pointing out. First, note that we have to bracket the LABEL_REGION array with zeros. This is because LABEL_REGION has the annoying habit of labelling the end points as “no region”. If we fail to do this, there is a good chance we will not have the correct number of points in the end sequences. For example, compare this:

   IDL> Print, Label_Region( (Shift(data,-1) - data) EQ 1 )
       0       1       1       1       0       0       2       2       0

To this:

   IDL> Print, Label_Region( [0, (Shift(data,-1) - data), 0] EQ 1 )
       0       1       1       1       1       0       0       2       2       0       0

And be very careful of the usage of REVERSE_INDICES here. Instead of the normal semantic:


which is like data[index_vector], I instead use an explicit range:

which is like data[low:high].

Stare at this until you see the difference. Here I put the two descriptions together so you can see the difference more easily.


I did this because I knew that the elements in the labeled regions had consecutive indices, and because we are always missing one member of the “consecutive run” (due to the SHIFT call). The latter is also the reason for the final +1. In general the indices in a given HISTOGRAM bucket are not adjacent in the original array, so while either of these forms would be correct, the second is more convenient for our purposes.

Another way to say this is that it's just the difference between data[[1,2,3,4,5]] and data[1:5]. No big deal. If you want to add “6”, it's easier to use the second form than the first form.

Be sure to free up your pointers when you are finished with them.

   IDL> Ptr_Free, p

Web Coyote's Guide to IDL Programming