Removing (or replacing) substrings in a string array [message #87262] |
Wed, 22 January 2014 06:13  |
Mats Löfdahl
Messages: 263 Registered: January 2012
|
Senior Member |
|
|
Say I have a string array where each string may or may not begin with a dot (".") and may or may not end with a dot. What is an efficient (non-loop?) way of removing those dots in IDL 7?
The reason I specify IDL 7 is that I realize this could be done with strsplit in IDL 8 but in IDL 7 straplit does not support string arrays.
Alternatively, if I could make it so those dots are not there in the first place, that's even better. What I'm trying to do it to parse an array of strings, where the fields are separated by dots but I cannot identify the substrings by their position. What I can do is to identify them with regular expressions because they all have different forms. And stregex supports string arrays also in IDL 7.
So I can do something like strmid(stregex(strlist,'\.[0-9]{5}\.',/extr),1,5) if I know I have a five-digit number surrounded by dots. I remove the dots with the strmid command.
But any field can also be first or last, so to find it I need to do stregex(strlist,'(\.|^)[0-9]{5}(\.|$)',/extr). But then one of the dots will be missing in those cases when the field really is first or last, so the simple strmid operation does not work.
And some fields can have variable lengths but be identified by an initial character, so I could find it with something like stregex(strlist,'(\.|^)f[+-][0-9]+(\.|$)',/extr). In this case strmid does not work even when there is a dot at the end, because I don't know its position.
Would be great if I could tell stregex to return everything but the dots (if present) but I don't know if that is possible. Can't you do something like that with regular expressions in the shell? Or was it in elisp?
I wanted to educate myself and went looking for “Making Regular Expressions Your Friends” by Mike Galloy but the links seem to be dead in http://michaelgalloy.com/2006/06/11/regular-expressions.html and the link in http://www.exelisvis.com/docs/Learning_About_Regular_E.html leads to a page telling me that "The mgunit project site has moved to Github" and when I follow the provided link to github I can't find the document.
Hmmm... I guess if I substitute '.'+strlist+'.' for strlist in all stregex calls I don't have to worry about the case where there are no dots. But that seems not so elegant and it still does not solve the problem with unknown lengths.
|
|
|
Re: Removing (or replacing) substrings in a string array [message #87264 is a reply to message #87262] |
Wed, 22 January 2014 07:19   |
Matthew Argall
Messages: 286 Registered: October 2011
|
Senior Member |
|
|
I used the following site to learn about regular expressions. It is a bit wordy, but gets the job done.
http://www.regular-expressions.info/tutorial.html
;Strings
myStr1 = '.aldfa09741_{}+=!@#$%^&*(.'
myStr2 = 'aldfa09741_{}+=!@#$%^&*(.'
myStr3 = '.aldfa09741_{}+=!@#$%^&*('
;Stregex
regex1 = stregex(myStr1, '^\.?([^.]*)\.?$', /SUBEXP, /EXTRACT)
regex2 = stregex(myStr2, '^\.?([^.]*)\.?$', /SUBEXP, /EXTRACT)
regex3 = stregex(myStr3, '^\.?([^.]*)\.?$', /SUBEXP, /EXTRACT)
;Print
print, regex1[1]
print, regex2[1]
print, regex3[1]
'^\.?' -- look for an optional (?) dot (\.) at the beginning of the string (^)
'([^.]*)' -- look for any character except the dot ([^.]) any number of times (*) and extract it ()
'\.?$' -- look for an optional (?) dot (\.) at the end of the string ($)
|
|
|
|
Re: Removing (or replacing) substrings in a string array [message #87270 is a reply to message #87264] |
Wed, 22 January 2014 07:34   |
Mats Löfdahl
Messages: 263 Registered: January 2012
|
Senior Member |
|
|
Den onsdagen den 22:e januari 2014 kl. 16:19:14 UTC+1 skrev Matthew Argall:
> I used the following site to learn about regular expressions. It is a bit wordy, but gets the job done.
> http://www.regular-expressions.info/tutorial.html
Thanks, I'll have a look at it.
> ;Strings
> myStr1 = '.aldfa09741_{}+=!@#$%^&*(.'
> myStr2 = 'aldfa09741_{}+=!@#$%^&*(.'
> myStr3 = '.aldfa09741_{}+=!@#$%^&*('
>
> ;Stregex
>
> regex1 = stregex(myStr1, '^\.?([^.]*)\.?$', /SUBEXP, /EXTRACT)
> regex2 = stregex(myStr2, '^\.?([^.]*)\.?$', /SUBEXP, /EXTRACT)
> regex3 = stregex(myStr3, '^\.?([^.]*)\.?$', /SUBEXP, /EXTRACT)
>
> ;Print
>
> print, regex1[1]
> print, regex2[1]
> print, regex3[1]
>
> '^\.?' -- look for an optional (?) dot (\.) at the beginning of the string (^)
> '([^.]*)' -- look for any character except the dot ([^.]) any number of times (*) and extract it ()
> '\.?$' -- look for an optional (?) dot (\.) at the end of the string ($)
Maybe that is part of the solution. I hadn't realized you can use the subexp that way. But it fails when there are more fields. The dots are (in general) the separators between multiple fields.
IDL> mystr4= 'gag.aldfa09741_{}+=!@#$%^&*(.sdf'
IDL> regex4 = stregex(myStr4, '^\.?([^.]*)\.?$', /SUBEXP, /EXTRACT)
Then regex4 is an array of two empty strings.
|
|
|
|
Re: Removing (or replacing) substrings in a string array [message #87274 is a reply to message #87270] |
Wed, 22 January 2014 07:58   |
Mats Löfdahl
Messages: 263 Registered: January 2012
|
Senior Member |
|
|
Den onsdagen den 22:e januari 2014 kl. 16:34:26 UTC+1 skrev Mats Löfdahl:
>
> Maybe that is part of the solution.
Indeed it is!
In my own example I had these two calls:
stregex(strlist,'(\.|^)[0-9]{5}(\.|$)',/extr)
stregex(strlist,'(\.|^)f[+-][0-9]+(\.|$)',/extr)
If I rewrite them like this
stregex(strlist,'(\.|^)([0-9]{5})(\.|$)',/extr,/subexp)
stregex(strlist,'(\.|^)(f[+-][0-9]+)(\.|$)',/extr,/subexp)
I get 2D string arrays. And I can get a list of the field I'm interested in by doing
(stregex(strlist,'(\.|^)([0-9]{5})(\.|$)',/extr,/subexp))[2, *]
(stregex(strlist,'(\.|^)(f[+-][0-9]+)(\.|$)',/extr,/subexp)) [2,*]
Thank you Matthew!
|
|
|
Re: Removing (or replacing) substrings in a string array [message #87278 is a reply to message #87262] |
Wed, 22 January 2014 10:14   |
wlandsman
Messages: 743 Registered: June 2000
|
Senior Member |
|
|
A very simple and fast (but not general) method to do this is to convert the string to a byte array and remove all appearance of the byte character for a period. This is what the routine remchar does
http://idlastro.gsfc.nasa.gov/ftp/pro/misc/remchar.pro
On Wednesday, January 22, 2014 9:13:08 AM UTC-5, Mats Löfdahl wrote:
> Say I have a string array where each string may or may not begin with a dot (".") and may or may not end with a dot. What is an efficient (non-loop?) way of removing those dots in IDL 7?
>
>
>
> The reason I specify IDL 7 is that I realize this could be done with strsplit in IDL 8 but in IDL 7 straplit does not support string arrays.
>
>
>
>
>
> Alternatively, if I could make it so those dots are not there in the first place, that's even better. What I'm trying to do it to parse an array of strings, where the fields are separated by dots but I cannot identify the substrings by their position. What I can do is to identify them with regular expressions because they all have different forms. And stregex supports string arrays also in IDL 7.
>
>
>
> So I can do something like strmid(stregex(strlist,'\.[0-9]{5}\.',/extr),1,5) if I know I have a five-digit number surrounded by dots. I remove the dots with the strmid command.
>
>
>
> But any field can also be first or last, so to find it I need to do stregex(strlist,'(\.|^)[0-9]{5}(\.|$)',/extr). But then one of the dots will be missing in those cases when the field really is first or last, so the simple strmid operation does not work.
>
>
>
> And some fields can have variable lengths but be identified by an initial character, so I could find it with something like stregex(strlist,'(\.|^)f[+-][0-9]+(\.|$)',/extr). In this case strmid does not work even when there is a dot at the end, because I don't know its position.
>
>
>
> Would be great if I could tell stregex to return everything but the dots (if present) but I don't know if that is possible. Can't you do something like that with regular expressions in the shell? Or was it in elisp?
>
>
>
>
>
> I wanted to educate myself and went looking for “Making Regular Expressions Your Friends” by Mike Galloy but the links seem to be dead in http://michaelgalloy.com/2006/06/11/regular-expressions.html and the link in http://www.exelisvis.com/docs/Learning_About_Regular_E.html leads to a page telling me that "The mgunit project site has moved to Github" and when I follow the provided link to github I can't find the document.
>
>
>
>
>
> Hmmm... I guess if I substitute '.'+strlist+'.' for strlist in all stregex calls I don't have to worry about the case where there are no dots. But that seems not so elegant and it still does not solve the problem with unknown lengths.
|
|
|
|
|
|
Re: Removing (or replacing) substrings in a string array [message #87283 is a reply to message #87262] |
Wed, 22 January 2014 14:57   |
Michael Galloy
Messages: 1114 Registered: April 2006
|
Senior Member |
|
|
On 1/22/14, 7:13 AM, Mats Löfdahl wrote:
> Say I have a string array where each string may or may not begin with
> a dot (".") and may or may not end with a dot. What is an efficient
> (non-loop?) way of removing those dots in IDL 7?
>
> The reason I specify IDL 7 is that I realize this could be done with
> strsplit in IDL 8 but in IDL 7 straplit does not support string
> arrays.
>
>
> Alternatively, if I could make it so those dots are not there in the
> first place, that's even better. What I'm trying to do it to parse an
> array of strings, where the fields are separated by dots but I cannot
> identify the substrings by their position. What I can do is to
> identify them with regular expressions because they all have
> different forms. And stregex supports string arrays also in IDL 7.
>
> So I can do something like
> strmid(stregex(strlist,'\.[0-9]{5}\.',/extr),1,5) if I know I have a
> five-digit number surrounded by dots. I remove the dots with the
> strmid command.
>
> But any field can also be first or last, so to find it I need to do
> stregex(strlist,'(\.|^)[0-9]{5}(\.|$)',/extr). But then one of the
> dots will be missing in those cases when the field really is first or
> last, so the simple strmid operation does not work.
>
> And some fields can have variable lengths but be identified by an
> initial character, so I could find it with something like
> stregex(strlist,'(\.|^)f[+-][0-9]+(\.|$)',/extr). In this case strmid
> does not work even when there is a dot at the end, because I don't
> know its position.
>
> Would be great if I could tell stregex to return everything but the
> dots (if present) but I don't know if that is possible. Can't you do
> something like that with regular expressions in the shell? Or was it
> in elisp?
>
>
> I wanted to educate myself and went looking for “Making Regular
> Expressions Your Friends” by Mike Galloy but the links seem to be
> dead in http://michaelgalloy.com/2006/06/11/regular-expressions.html
> and the link in
> http://www.exelisvis.com/docs/Learning_About_Regular_E.html leads to
> a page telling me that "The mgunit project site has moved to Github"
> and when I follow the provided link to github I can't find the
> document.
>
>
> Hmmm... I guess if I substitute '.'+strlist+'.' for strlist in all
> stregex calls I don't have to worry about the case where there are no
> dots. But that seems not so elegant and it still does not solve the
> problem with unknown lengths.
>
Check out MG_STREPLACE:
https://github.com/mgalloy/mglib/blob/master/src/strings/mg_ streplace.pro
I have that "Making Regular Expressions Your Friends" article here
somewhere too. I will update the link on my website and post here when I
find it. Have to run now...
Mike
--
Michael Galloy
www.michaelgalloy.com
Modern IDL: A Guide to IDL Programming (http://modernidl.idldev.com)
Research Mathematician
Tech-X Corporation
|
|
|
Re: Removing (or replacing) substrings in a string array [message #87284 is a reply to message #87283] |
Wed, 22 January 2014 18:50  |
Michael Galloy
Messages: 1114 Registered: April 2006
|
Senior Member |
|
|
On 1/22/14, 3:57 pm, Michael Galloy wrote:
> On 1/22/14, 7:13 AM, Mats Löfdahl wrote:
>> Say I have a string array where each string may or may not begin with
>> a dot (".") and may or may not end with a dot. What is an efficient
>> (non-loop?) way of removing those dots in IDL 7?
>>
>> The reason I specify IDL 7 is that I realize this could be done with
>> strsplit in IDL 8 but in IDL 7 straplit does not support string
>> arrays.
>>
>>
>> Alternatively, if I could make it so those dots are not there in the
>> first place, that's even better. What I'm trying to do it to parse an
>> array of strings, where the fields are separated by dots but I cannot
>> identify the substrings by their position. What I can do is to
>> identify them with regular expressions because they all have
>> different forms. And stregex supports string arrays also in IDL 7.
>>
>> So I can do something like
>> strmid(stregex(strlist,'\.[0-9]{5}\.',/extr),1,5) if I know I have a
>> five-digit number surrounded by dots. I remove the dots with the
>> strmid command.
>>
>> But any field can also be first or last, so to find it I need to do
>> stregex(strlist,'(\.|^)[0-9]{5}(\.|$)',/extr). But then one of the
>> dots will be missing in those cases when the field really is first or
>> last, so the simple strmid operation does not work.
>>
>> And some fields can have variable lengths but be identified by an
>> initial character, so I could find it with something like
>> stregex(strlist,'(\.|^)f[+-][0-9]+(\.|$)',/extr). In this case strmid
>> does not work even when there is a dot at the end, because I don't
>> know its position.
>>
>> Would be great if I could tell stregex to return everything but the
>> dots (if present) but I don't know if that is possible. Can't you do
>> something like that with regular expressions in the shell? Or was it
>> in elisp?
>>
>>
>> I wanted to educate myself and went looking for “Making Regular
>> Expressions Your Friends” by Mike Galloy but the links seem to be
>> dead in http://michaelgalloy.com/2006/06/11/regular-expressions.html
>> and the link in
>> http://www.exelisvis.com/docs/Learning_About_Regular_E.html leads to
>> a page telling me that "The mgunit project site has moved to Github"
>> and when I follow the provided link to github I can't find the
>> document.
>>
>>
>> Hmmm... I guess if I substitute '.'+strlist+'.' for strlist in all
>> stregex calls I don't have to worry about the case where there are no
>> dots. But that seems not so elegant and it still does not solve the
>> problem with unknown lengths.
>>
>
> Check out MG_STREPLACE:
>
>
> https://github.com/mgalloy/mglib/blob/master/src/strings/mg_ streplace.pro
>
> I have that "Making Regular Expressions Your Friends" article here
> somewhere too. I will update the link on my website and post here when I
> find it. Have to run now...
>
> Mike
I updated the article at:
http://michaelgalloy.com/2006/06/11/regular-expressions.html
with working links to the article and the example code. I would use
MG_STREPLACE from my Github repo over the STR_REPLACE linked to in the
article.
Mike
--
Michael Galloy
www.michaelgalloy.com
Modern IDL: A Guide to IDL Programming (http://modernidl.idldev.com)
Research Mathematician
Tech-X Corporation
|
|
|