|
Re: What? You can't histogram a string array? [message #51544 is a reply to message #51543] |
Tue, 28 November 2006 10:43   |
David Fanning
Messages: 11724 Registered: August 2001
|
Senior Member |
|
|
JD Smith writes:
> This is not good, and much worse than a minor nitpick. The
> IND_INT_SORT algorithm relies on SORT doing the right thing. That is,
> for two identical elements in the concatenated vector [a,b], SORT
> should place the first one first, i.e. the matching elements from 'a'
> will show up before those from 'b'. That's the only reason it
> works. There was always the concern that IDL's SORT would change and
> this would no longer be the case (the element from b would come
> first), in which case the algorithm would be broken.
In running the test program, I get immediate out-of-bounds
errors with IDL's SORT routine. But nothing of the sort
(a pun!) with the NASA BSORT routine I always use when I
need to sort something "for real".
Running on Windows XP.
Cheers,
David
--
David Fanning, Ph.D.
Fanning Software Consulting, Inc.
Coyote's Guide to IDL Programming: http://www.dfanning.com/
Sepore ma de ni thui. ("Perhaps thou speakest truth.")
|
|
|
Re: What? You can't histogram a string array? [message #51545 is a reply to message #51544] |
Tue, 28 November 2006 10:12   |
JD Smith
Messages: 850 Registered: December 1999
|
Senior Member |
|
|
On Tue, 28 Nov 2006 09:52:06 -0800, Braedley wrote:
>
> Braedley wrote:
>> JD, a small nitpick: ind_int_sort will occasionally take the index from
>> [a, b], and not from just a. This can quickly lead to out of bounds
>> conditions if the user doesn't want to index [a, b], but just wants to
>> index a. In my case, a is a column from a 2D string array, where b is
>> just a 1D string array. I think a where statement is all that is
>> needed to fix this (I know, it'll slow it down for large sets).
>>
>> Braedley
>
> Actually, the fix was much easier than previously thought. Instead of
> return, srt[wh]
> use
> return, srt[wh]<srt[wh+1]
>
> I haven't done any tests, but it shouldn't take much longer for sparse
> or small sets.
That is a clever fix, but if the ordering of elements from a and b is
random, and if you have a repeated set in a match a repeated set in b, and
their interleaved sorted order is random, you'll get back a random number
of the matching repeats (not 1, as was intended).
See my other post though, and let me know your findings w.r.t. SORT.
Thanks,
JD
|
|
|
Re: What? You can't histogram a string array? [message #51546 is a reply to message #51545] |
Tue, 28 November 2006 10:08   |
JD Smith
Messages: 850 Registered: December 1999
|
Senior Member |
|
|
On Tue, 28 Nov 2006 09:16:12 -0800, Braedley wrote:
> JD, a small nitpick: ind_int_sort will occasionally take the index from
> [a, b], and not from just a. This can quickly lead to out of bounds
> conditions if the user doesn't want to index [a, b], but just wants to
> index a. In my case, a is a column from a 2D string array, where b is
> just a 1D string array. I think a where statement is all that is
> needed to fix this (I know, it'll slow it down for large sets).
This is not good, and much worse than a minor nitpick. The
IND_INT_SORT algorithm relies on SORT doing the right thing. That is,
for two identical elements in the concatenated vector [a,b], SORT
should place the first one first, i.e. the matching elements from 'a'
will show up before those from 'b'. That's the only reason it
works. There was always the concern that IDL's SORT would change and
this would no longer be the case (the element from b would come
first), in which case the algorithm would be broken.
Can you provide an example where this isn't happening? I just tried
it on a simulated set of 100,000 random 6 character strings, and it
didn't show this behavior: all ~30 matching elements were selected
from a. I then ran this test 100 times, and in all cases it behaved
as expected. Perhaps it depends on the machine/OS? I'm actually not
sure if SORT calls a library sort function (which might make the
algorithm non-portable), or uses its own. You can try this test
yourself, like this:
for i=1,100 do begin
a=string(byte(randomu(sd,6,100000)*26)+65b)
b=string(byte(randomu(sd,6,100000)*26)+65b)
s=ind_int_sort(a,b)
print,strtrim(n_elements(s),2),' matches found'
m=max(s)
if m ge 100000 then begin
print,'Out of bounds: ',m
break
endif
endfor
Let me know if it runs through without error for you. For anyone else
who wants to test this, it would be appreciated. Here I run:
IDL> help,!VERSION,/st
** Structure !VERSION, 8 tags, length=76, data length=76:
ARCH STRING 'x86'
OS STRING 'linux'
OS_FAMILY STRING 'unix'
OS_NAME STRING 'linux'
RELEASE STRING '6.3'
BUILD_DATE STRING 'Mar 23 2006'
MEMORY_BITS INT 32
FILE_OFFSET_BITS
INT 64
BTW, if you only want the *values*, not the positions, where match
occurred, replace:
return,srt[wh]
with
return,s[wh]
and this will "solve" the problem for you (with this change, it's
equivalent to the CONTAIN function I posted long long ago). This is
insensitive to the ordering of a or b SORT performs.
Also note that IND_INT_SORT only returns *one* match for repeated
elements, which may or may not be what you want.
JD
|
|
|
Re: What? You can't histogram a string array? [message #51547 is a reply to message #51546] |
Tue, 28 November 2006 09:52   |
Braedley
Messages: 57 Registered: September 2006
|
Member |
|
|
Braedley wrote:
> JD, a small nitpick: ind_int_sort will occasionally take the index from
> [a, b], and not from just a. This can quickly lead to out of bounds
> conditions if the user doesn't want to index [a, b], but just wants to
> index a. In my case, a is a column from a 2D string array, where b is
> just a 1D string array. I think a where statement is all that is
> needed to fix this (I know, it'll slow it down for large sets).
>
> Braedley
Actually, the fix was much easier than previously thought. Instead of
return, srt[wh]
use
return, srt[wh]<srt[wh+1]
I haven't done any tests, but it shouldn't take much longer for sparse
or small sets.
Braedley
|
|
|
|
|
|
|
|
|
|
|
|
Re: What? You can't histogram a string array? [message #51680 is a reply to message #51544] |
Tue, 28 November 2006 13:17  |
JD Smith
Messages: 850 Registered: December 1999
|
Senior Member |
|
|
On Tue, 28 Nov 2006 11:43:20 -0700, David Fanning wrote:
> JD Smith writes:
>
>> [quoted text muted]
>
> In running the test program, I get immediate out-of-bounds
> errors with IDL's SORT routine. But nothing of the sort
> (a pun!) with the NASA BSORT routine I always use when I
> need to sort something "for real".
OK, so far OSX and Windows XP throw out of bounds errors. Can anyone
on Linux confirm that this runs without error? I checked libidl.so,
and it mentions qsort, of the GLIBC variety. So it must be my
implementation of qsort in my GLIBC preserves order, but others do
not. Ouch.
Might want to add a note to that page. If you don't have repeated
elements, then the fix Braedley offered works fine. BSORT from
Nasalib sorts and then reorders duplicates to preserve the original
order. It will compromise speed somewhat, but is a good alternative.
JD
P.S. How long as it been the case that SORT scrambles order on Windows?
I'm surprised the issue with IND_INT_SORT didn't come up before.
|
|
|
Re: What? You can't histogram a string array? [message #51689 is a reply to message #51545] |
Tue, 28 November 2006 11:23  |
Braedley
Messages: 57 Registered: September 2006
|
Member |
|
|
JD Smith wrote:
> On Tue, 28 Nov 2006 09:52:06 -0800, Braedley wrote:
>
>>
>> Braedley wrote:
>>> JD, a small nitpick: ind_int_sort will occasionally take the index from
>>> [a, b], and not from just a. This can quickly lead to out of bounds
>>> conditions if the user doesn't want to index [a, b], but just wants to
>>> index a. In my case, a is a column from a 2D string array, where b is
>>> just a 1D string array. I think a where statement is all that is
>>> needed to fix this (I know, it'll slow it down for large sets).
>>>
>>> Braedley
>>
>> Actually, the fix was much easier than previously thought. Instead of
>> return, srt[wh]
>> use
>> return, srt[wh]<srt[wh+1]
>>
>> I haven't done any tests, but it shouldn't take much longer for sparse
>> or small sets.
>
> That is a clever fix, but if the ordering of elements from a and b is
> random, and if you have a repeated set in a match a repeated set in b, and
> their interleaved sorted order is random, you'll get back a random number
> of the matching repeats (not 1, as was intended).
>
> See my other post though, and let me know your findings w.r.t. SORT.
>
> Thanks,
>
> JD
I hit an out of bounds on my first try. Running MacOSX, 10.4.8,
IDLv6.2. Unfortunately, I do need the indices, as I pointed out
earlier. Perhaps I'll use BSORT instead.
|
|
|
Re: What? You can't histogram a string array? [message #51691 is a reply to message #51543] |
Tue, 28 November 2006 10:53  |
news.qwest.net
Messages: 137 Registered: September 2005
|
Senior Member |
|
|
"David Fanning" <news@dfanning.com> wrote in message
news:MPG.1fd632bc5a98f848989df5@news.frii.com...
> David Fanning writes:
>
>> In running the test program, I get immediate out-of-bounds
>> errors with IDL's SORT routine. But nothing of the sort
>> (a pun!) with the NASA BSORT routine I always use when I
>> need to sort something "for real".
>
> Whoops! I forgot the mandatory link:
>
> http://www.dfanning.com/tips/sort.html
>
You know, those ad links are actually pretty good. I'm looking
at the Princeton Instruments brochure right now, to see the latest
in near infrared imaging systems. I wonder if the tech has gotten to
the point where we can use off the shelf stuff now, instead of building
our own.
|
|
|