Calculating the Statistical Mode

QUESTION: I have an integer array and I would like to calculate the statistical MODE of the array. How do I do this in IDL?

ANSWER: I'm trying to remember all the way back to grade school, I think, but isn't the statistical mode the value having the largest frequency distribution? I remember that the mode isn't unique, because two values could be represented with the same frequency in the sample (bimodal, I guess), and it is theoretically possible to have no mode at all (all distribution frequencies in the sample are the same).

But saying all that, if you have an integer array, I think I would calculate the mode of the array like this:

   IDL> array = [1, 1, 2 , 4, 1, 3, 3, 2, 4, 5, 3, 2, 2, 1, 2, 6, -3]
   IDL> distfreq = Histogram(array, MIN=Min(array))
   IDL> maxfreq= Max(distfreq)
   IDL> mode = Where(distfreq EQ maxfreq) + Min(array)
   IDL> Print, mode
        2

As usual, going for readability is not always the “IDL way.” JD Smith offers this correction.

Just a hint on your HISTOGRAM usage. This might be slightly preferred, since it skips the WHERE and MIN:

   IDL> array = [1, 1, 2 , 4, 1, 3, 3, 2, 4, 5, 3, 2, 2, 1, 2, 6]
   IDL> void = Max(Histogram(array,OMIN=mn), mxpos)
   IDL> mode = mn + mxpos

This method, of course, will be very problematic if you have, for example:

   array = [1, 1, 2 , 4, 1, 3, 3, 2, 4, 5, 3, 2, 2, 1, 2, 6, 10000000]

Another option, if you worry about this, would be to re-cast the solution as a sorting problem, using the method discussed in a recent IDL newsgroup thread:

   array = array[Sort(array)] 
   wh = where(array ne Shift(array,-1), cnt)
   if cnt eq 0 then mode = array[0] else begin
      void = Max(wh-[-1,wh], mxpos)
      mode = array[wh[mxpos]]
   endelse 
   Print, mode   

Both methods will give you the lowest number in the case of ties for the mode. The second will be slower, but more robust against large dynamic range in your array. You could use both, deciding which to use by the min/max of the array.

If I were going to turn this simple idea of mine into a (readable) function, I would probably add some warnings and error messages.

FUNCTION Mode, array

   ; Calculates the MODE (value with the maximum frequency distribution)
   ; of an array. Works ONLY with integer data.

   On_Error, 2
   
   ; Check for arguments.
   IF N_Elements(array) EQ 0 THEN Message, 'Must pass an array argument.'
   
   ; Is the data an integer type? If not, exit.
   dataType = Size(array, /Type)
   IF ((dataType GT 3) AND (dataType LT 12)) THEN $
	Message, 'Data is not INTEGER type.'
   
   ; Calculate the distribution frequency
   distfreq = Histogram(array, MIN=Min(array))
   
   ; Find the maximum of the frequency distribution.
   maxfreq = Max(distfreq)
   
   ; Find the mode.
   mode = Where(distfreq EQ maxfreq, count) + Min(array)
  
   ; Warn the user if the mode is not singular.
   IF count NE 1 THEN ok = Dialog_Message('The MODE is not singular.')
   
   RETURN, mode
   
END

Several years after this article was written, Allan Whiteford had this to add.

I think all the discussions on finding the mode probably pre-dated the ++ operator. It could be that using the vectorised ++ operator is a better way to do it. (Although I doubt it. Normally if histogram can do something then histogram will be the best way!)

It would make this example into something like this:

   array = [1, 1, 2 , 4, 1, 3, 3, 2, 4, 5, 3, 2, 2, 1, 2, 6,-3]
   f = IntArr(Max(array) - Min(array)+1)
   f[array - Min(array)]++
   void = Max(f,idx)
   mode = idx + Min(array)
   Print, mode

Again, I have no idea whether this would be more efficient. If you're doing analysis on measurements (typically non-integers) then you'd need to invoke histogram anyway to bin them before trying to find the mode.

Google
 
Web Coyote's Guide to IDL Programming