comp.lang.idl-pvwave archive: archive » Re: How to Sort/Uniq a list and keep its original index

Home » Public Forums » archive » Re: How to Sort/Uniq a list and keep its original index

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

Re: How to Sort/Uniq a list and keep its original index [message #50698]

Thu, 12 October 2006 11:16

JD Smith
Messages: 850
Registered: December 1999

Senior Member

On Wed, 11 Oct 2006 17:12:07 -0600, David Fanning wrote:
>
> I haven't tested this, but just off the top of my head:
>
> I = Where(Histogram(indexU, Min=0, Max=N_Elements(testTotal)) $
> EQ 0, count)

I think that will leave out one of the duplicates of each set (since
one of them by definition is unique).

If you're going to use HISTOGRAM, you could use it to do the whole
thing:

h=histogram(testTotal,REVERSE_INDICES=ri)
wh=where(h gt 1,cnt) ;; bins with duplicates
for i=0,cnt-1 do do_something_with,ri[ri[wh[i]]:ri[wh[i]+1]-1]

since it's faster than SORT for well-behaved data. Notice that I didn't
explicitly test for empty bins, since I'm only looping over those bins
with 2 or more entries. If most of your duplicate counts are low (2x, 3x,
etc.), you can see another big speedup by binning the resulting histogram.
Standard sparse data warnings apply.

If you want to use SORT anyway (for simplicity, or for instance
because the data could be very sparse), your could just do the
opposite of what UNIQ does:

indexDUP=where((test eq shift(test,-1)) OR (test eq shift(test,1)))

JD

Report message to a moderator

Re: How to Sort/Uniq a list and keep its original index [message #50703 is a reply to message #50698]

Thu, 12 October 2006 09:19

Jean H.
Messages: 472
Registered: July 2006

Senior Member

without the histogram, you could try:

tmp = lindgen(650000)
tmp[indexU] = -1
duplicate = tmp[where tmp ne -1)]

Jean

Dilkushi@gmail.com wrote:
> Dear all
> I have to sort a file with 650,000 records in search of duplicate
> records.. and I need a list of duplicates (not a list without
> duplicates)...
> indexS=sort(testTotal)
> test=testTotal[indexS]
> indexU=uniq(test)
>
> indexU is an index with no duplicates..
> how do I get an index pertaining to the duplicates only?..
> please help..
> thanks in advance
> dilkushi
>

Report message to a moderator

Re: How to Sort/Uniq a list and keep its original index [message #50726 is a reply to message #50703]

Wed, 11 October 2006 16:12

David Fanning
Messages: 11724
Registered: August 2001

Senior Member

Dilkushi@gmail.com writes:

> I have to sort a file with 650,000 records in search of duplicate
> records.. and I need a list of duplicates (not a list without
> duplicates)...
> indexS=sort(testTotal)
> test=testTotal[indexS]
> indexU=uniq(test)
>
> indexU is an index with no duplicates..
> how do I get an index pertaining to the duplicates only?..

I haven't tested this, but just off the top of my
head:

I = Where(Histogram(indexU, Min=0, Max=N_Elements(testTotal)) $
EQ 0, count)

Cheers,

David

--
David Fanning, Ph.D.
Fanning Software Consulting, Inc.
Coyote's Guide to IDL Programming: http://www.dfanning.com/
Sepore ma de ni thui. ("Perhaps thou speakest truth.")

Report message to a moderator

Re: How to Sort/Uniq a list and keep its original index [message #50752 is a reply to message #50698]

Wed, 18 October 2006 11:36

Dilkushi@gmail.com
Messages: 21
Registered: August 2006

Junior Member

Thank you JD
this is waht i was looking for... perfect...
dilkushi

JD Smith wrote:

> On Wed, 11 Oct 2006 17:12:07 -0600, David Fanning wrote:
>>
>> I haven't tested this, but just off the top of my head:
>>
>> I = Where(Histogram(indexU, Min=0, Max=N_Elements(testTotal)) $
>> EQ 0, count)
>
> I think that will leave out one of the duplicates of each set (since
> one of them by definition is unique).
>
> If you're going to use HISTOGRAM, you could use it to do the whole
> thing:
>
> h=histogram(testTotal,REVERSE_INDICES=ri)
> wh=where(h gt 1,cnt) ;; bins with duplicates
> for i=0,cnt-1 do do_something_with,ri[ri[wh[i]]:ri[wh[i]+1]-1]
>
> since it's faster than SORT for well-behaved data. Notice that I didn't
> explicitly test for empty bins, since I'm only looping over those bins
> with 2 or more entries. If most of your duplicate counts are low (2x, 3x,
> etc.), you can see another big speedup by binning the resulting histogram.
> Standard sparse data warnings apply.
>
> If you want to use SORT anyway (for simplicity, or for instance
> because the data could be very sparse), your could just do the
> opposite of what UNIQ does:
>
> indexDUP=where((test eq shift(test,-1)) OR (test eq shift(test,1)))
>
> JD

Report message to a moderator

Previous Topic:	Weirdest Error Ever
Next Topic:	Re: vector of bin indices using histogram?

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

]

Current Time: Fri Nov 28 15:48:03 PST 2025

Total time taken to generate the page: 2.80849 seconds