Are These Arrays Equal?
QUESTION: I have data in two floating point arrays. I would like to know if the arrays are “equal ”or not. By equal, I mean they have the same number of elements and that matching elements in the two arrays are equal. Since these are floating point arrays, I notice that I do not get good results when I use the EQ operator to compare them. I presume this is because of different round-off errors in the calculations that fill the two arrays. Can you suggest a method to compare these two arrays and establish equality?
ANSWER: In general, the method is to subtract the two arrays from one another, take the absolute value of the difference, and compare that to a number that is “sufficiently close” to zero. The only trick, if there is one, is how to choose a number that is “sufficiently close” to zero. Here is how you might write a function to do the comparison for you.
FUNCTION FLOATS_EQUAL, array_1, array_2 ; Error handling. Return to caller on error. ON_ERROR, 2 IF N_Params() NE 2 THEN Message, 'Must pass two arrays to compare.' ; Arrays not equal if they are not the same length. IF N_Elements(array_1) NE N_Elements(array_2) THEN RETURN, 0 ; Choose a number "sufficiently close" to zero for comparison. NUMBER = 1e-6 ; Compare the arrays. IF Total(Abs(array_1 - array_2) LE NUMBER) EQ N_Elements(array_1) THEN $ RETURN, 1 ELSE RETURN, 0 END
There was an interesting discussion in the IDL newsgroup (see reference below) on how NUMBER should be chosen. Mati Meron also had some relevant comments on this issue in this discussion.
Basically, the arguments boil down to two mutually exclusive choices that both appear to be correct. So, you can choose. :-)
One argument is that NUMBER should be chosen empirically. You know what your data is about. You know the magnitude of the expected values in the arrays. You know what it is you are trying to accomplish with your data analysis. You ought to be able to choose a number that is sufficiently close to zero for your purposes. That is the kind of number I chose in the example above.
The other argument is that the difference in floating point values is related to how good your computer is at storing floats internally, and this can be discovered by using the MACHAR function. So NUMBER should be calculated something like this.
epsilon = (MACHAR()).eps NUMBER = epsilon
In practice, this generally works, but it depends on the magnitude of your values. For “normal magnitude” values (i.e., in the range of counting numbers) it works great. Unfortunately, not all values are in this range. In fact, science numbers used in research can vary over many, many magnitudes. Thus, it is thought that a better solution is to use the machine precision, but scaled by the size the numbers you are comparing.
Mati Meron explains the situation like this.
The floating number is stored as two parts, mantissa and power (for the garden variety float you have 24 bits for the mantissa and 8 for the power). The mantissa specifies the significant digits, which are then multiplied by the appropriate power. The storage is binary, of course, but for the purpose of this argument we may look at decimal. So, if you store, say, 7 significant digits, your number is of the form 0.abcdefg * 10^p, where a...f are digits between 0 and 9. If you take two numbers such that their true (as opposed to stored) expansion has same first 7 significant digits while differing at the 8th, they'll be stored as same number. [For example, 1234567e28 and 1234567e-28.] So, roughly, one can say that the accuracy of the stored number is 0.00000005 *10^p (note, 7 zeroes for the significant digits, then half the maximum for the next). So, the storage error, for fixed number of decimal places, is relative, not absolute, it is around 0.00000005/0.abcdefg. As the magnitude of the number grows, so does the error.
This implies that NUMBER should be chosen like this.
epsilon = (MACHAR()).eps NUMBER = (Abs(array_1) > Abs(array_2)) * epsilon
If you are comparing double precision values, you will want to make sure you are getting the machine precision in double precision, too. So you will need to set the DOUBLE keyword on MACHAR in this case.
epsilon = (MACHAR(/DOUBLE)).eps NUMBER = (Abs(array_1) > Abs(array_2)) * epsilon
Finally, there is often some kind of a “scaling factor” added to the NUMBER. In FORTRAN code this is often called the Unit in the Last Place, or ULP. It is the gap or difference between two floating point numbers in the last significant digit. This number should be a positive integer. I'll add this value as a ULP keyword to the program, and give it a default value of 1. If you are comparing double precision values, make sure the ULP value gets passed in as a double, too.
The final code looks like this. You can find the final program here.
FUNCTION FLOATS_EQUAL, array_1, array_2, ULP=ulp ; Error handling. Return to caller on error. ON_ERROR, 2 IF N_Params() NE 2 THEN Message, 'Must pass two arrays to compare.' ; Are we comparing double precision values? IF Size(array_1, /TNAME) EQ 'DOUBLE' OR Size(array_2, /TNAME) EQ 'DOUBLE' THEN $ double = 1 ELSE double = 0 ; Check keyword. IF N_Elements(ulp) EQ 0 THEN ulp = 1.0D ELSE ulp = ROUND(ABS(ulp)) ; Arrays not equal if they are not the same length. IF N_Elements(array_1) NE N_Elements(array_2) THEN RETURN, 0 ; Choose a number "sufficiently close" to zero for comparison. epsilon = (MACHAR(DOUBLE=double)).eps NUMBER = (Abs(array_1) > Abs(array_2)) * epsilon * ulp ; Compare the arrays. IF Total(Abs(array_1 - array_2) LE NUMBER) EQ N_Elements(array_1) THEN $ RETURN, 1 ELSE RETURN, 0 END
If you are interested in following an IDL newsgroup discussion of this topic, you can find it here.
Copyright © 2007 David W. Fanning
Last Updated 29 August 2007