comp.lang.idl-pvwave archive
Messages from Usenet group comp.lang.idl-pvwave, compiled by Paulo Penteado

Home » Public Forums » archive » byte/unicode mismatch
Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend 
Switch to threaded view of this topic Create a new topic Submit Reply
byte/unicode mismatch [message #63772] Thu, 20 November 2008 02:19 Go to next message
R.Bauer is currently offline  R.Bauer
Messages: 1424
Registered: November 1998
Senior Member
Hi

the ascii table is gone.

IDL> print,byte('ᅵ')
195 188

A char has now two bytes

IDL> help, byte('ᅵ')
<Expression> BYTE = Array[2]

This means all of the fast string replacing routines which are related
to iso encoded ascii one byte characters are broken in 7.0

What is the name of the function to convert byte('ᅵ') into 252b ?

cheers
Reimar
Re: byte/unicode mismatch [message #63841 is a reply to message #63772] Thu, 20 November 2008 11:08 Go to previous messageGo to next message
Heinz Stege is currently offline  Heinz Stege
Messages: 189
Registered: January 2003
Senior Member
On Thu, 20 Nov 2008 09:23:52 -0800 (PST), mgalloy@gmail.com wrote:

> On Nov 20, 3:19�am, Reimar Bauer <R.Ba...@fz-juelich.de> wrote:
>> Hi
>>
>> the ascii table is gone.
>>
>> IDL> print,byte('�')
>> �195 188
>>
>> A char has now two bytes
>>
>> IDL> help, byte('�')
>> <Expression> � �BYTE � � �= Array[2]
>>
>> This means all of the fast string replacing routines which are related
>> to iso encoded ascii one byte characters are broken in 7.0
>>
>> What is the name of the function to convert byte('�') into 252b ?
>
> I guess it is how you type/enter the �:
>
> IDL> u = string(252B)
> IDL> print, u
> �
> IDL> help, u
> U STRING = '�'
> IDL> print, byte(u)
> 252
>
> Mike

Probably most readers here don't have an �-key on their keybord. So
here is another example:

IDL> print,!version
{ x86 Win32 Windows Microsoft Windows 7.0 Oct 25 2007 32 64}
IDL> mu='�' ; (the greek letter)
IDL> help,mu
MU STRING = '�'
IDL> help,byte(mu)
<Expression> BYTE = Array[2]
IDL> print,byte(mu)
194 181

The string entered in the workbench command line is encoded in UTF8.
Using this string as a title in direct graphics results in a mu
preceded by an "A" with a hat. Direct graphics don't like UTF8. It
would need string(181b) for a mu.

If I don't miss something, Reimar is asking for a function to convert
the UTF8 string to ISO8859(?).

Heinz
Re: byte/unicode mismatch [message #63847 is a reply to message #63772] Thu, 20 November 2008 09:23 Go to previous messageGo to next message
Michael Galloy is currently offline  Michael Galloy
Messages: 1114
Registered: April 2006
Senior Member
On Nov 20, 3:19 am, Reimar Bauer <R.Ba...@fz-juelich.de> wrote:
> Hi
>
> the ascii table is gone.
>
> IDL> print,byte('ü')
>  195 188
>
> A char has now two bytes
>
> IDL> help, byte('ü')
> <Expression>    BYTE      = Array[2]
>
> This means all of the fast string replacing routines which are related
> to iso encoded ascii one byte characters are broken in 7.0
>
> What is the name of the function to convert byte('ü') into 252b ?

I guess it is how you type/enter the ü:

IDL> u = string(252B)
IDL> print, u
ü
IDL> help, u
U STRING = 'ü'
IDL> print, byte(u)
252

Mike
--
www.michaelgalloy.com
Tech-X Corporation
Associate Research Scientist
Re: byte/unicode mismatch [message #63880 is a reply to message #63772] Fri, 21 November 2008 09:51 Go to previous messageGo to next message
Allan Whiteford is currently offline  Allan Whiteford
Messages: 117
Registered: June 2006
Senior Member
Reimar Bauer wrote:
> That is all orthogonal.
>
> How can I decode and how can I encode?
>
> cheers
> Reimar
>

Reimar,

The question (and answer) isn't all that straightforward, byte values
over 127 aren't well defined without an encoding system or a codepage.

However, the answer you're probably looking for is:

b=byte('�') ; assumption 2
print,b[1]+(b[0] eq 195)*64 ; assumption 1

which is assuming:

1) you want byte values from (two byte) UTF-8 to ISO-8859-1

and

2) that the u-umlaut character has entered the intepreter from a UTF-8
environment.

Please don't just cut and paste the above assuming all will be well.

Thanks,

Allan

> Allan Whiteford schrieb:
>
>> Heinz Stege wrote:
>>
>>> On Thu, 20 Nov 2008 09:23:52 -0800 (PST), mgalloy@gmail.com wrote:
>>>
>>>
>>>
>>>> On Nov 20, 3:19 am, Reimar Bauer <R.Ba...@fz-juelich.de> wrote:
>>>>
>>>>
>>>> >Hi
>>>> >
>>>> >the ascii table is gone.
>>>> >
>>>> >IDL> print,byte('�')
>>>> >195 188
>>>> >
>>
>>> The string entered in the workbench command line is encoded in UTF8.
>>
>> Picking up on this point (and the one made by Mike) - it's mostly to do
>> with your editor. The workbench seems to be unicode aware so it really
>> is passing a two byte representation of � into the interpreter.
>>
>> If I use the simple command line interface running through an xterm
>> (X.Org 6.8.99.903) which I guess isn't unicode aware then I get 252 with
>> the same version of IDL:
>>
>> IDL> print,!version
>> { x86 linux unix linux 7.0 Oct 25 2007 32 64}
>> IDL> print,byte('�')
>> 252
>>
>> but with the workbench:
>>
>> IDL> print,!version
>> { x86 linux unix linux 7.0 Oct 25 2007 32 64}
>> IDL> print,byte('�')
>> 195 188
>>
>> I would expect that if you read the character from a file (either as
>> data or in a .pro file) it depends on the program which wrote the file
>> and whether your editor was unicode-aware.
>>
>> In saying all this, I don't understand unicode properly (does anyone?!?)
>> - I'm just reporting on the fact that it isn't just the IDL interpreter
>> which is the issue, it's to do with the editor which sends the character
>> to the interpreter.
>>
>> This has already been said - I've just rephrased it using more
>> (unnecessary?) words. I hope it's helpful.
>>
>> Thanks,
>>
>> Allan
Re: byte/unicode mismatch [message #63914 is a reply to message #63772] Fri, 21 November 2008 02:10 Go to previous messageGo to next message
R.Bauer is currently offline  R.Bauer
Messages: 1424
Registered: November 1998
Senior Member
That is all orthogonal.

How can I decode and how can I encode?

cheers
Reimar

Allan Whiteford schrieb:
> Heinz Stege wrote:
>> On Thu, 20 Nov 2008 09:23:52 -0800 (PST), mgalloy@gmail.com wrote:
>>
>>
>>> On Nov 20, 3:19 am, Reimar Bauer <R.Ba...@fz-juelich.de> wrote:
>>>
>>>> Hi
>>>>
>>>> the ascii table is gone.
>>>>
>>>> IDL> print,byte('�')
>>>> 195 188
>>>>
>
>> The string entered in the workbench command line is encoded in UTF8.
>
> Picking up on this point (and the one made by Mike) - it's mostly to do
> with your editor. The workbench seems to be unicode aware so it really
> is passing a two byte representation of � into the interpreter.
>
> If I use the simple command line interface running through an xterm
> (X.Org 6.8.99.903) which I guess isn't unicode aware then I get 252 with
> the same version of IDL:
>
> IDL> print,!version
> { x86 linux unix linux 7.0 Oct 25 2007 32 64}
> IDL> print,byte('�')
> 252
>
> but with the workbench:
>
> IDL> print,!version
> { x86 linux unix linux 7.0 Oct 25 2007 32 64}
> IDL> print,byte('�')
> 195 188
>
> I would expect that if you read the character from a file (either as
> data or in a .pro file) it depends on the program which wrote the file
> and whether your editor was unicode-aware.
>
> In saying all this, I don't understand unicode properly (does anyone?!?)
> - I'm just reporting on the fact that it isn't just the IDL interpreter
> which is the issue, it's to do with the editor which sends the character
> to the interpreter.
>
> This has already been said - I've just rephrased it using more
> (unnecessary?) words. I hope it's helpful.
>
> Thanks,
>
> Allan
Re: byte/unicode mismatch [message #63917 is a reply to message #63841] Fri, 21 November 2008 02:04 Go to previous messageGo to next message
Allan Whiteford is currently offline  Allan Whiteford
Messages: 117
Registered: June 2006
Senior Member
Heinz Stege wrote:
> On Thu, 20 Nov 2008 09:23:52 -0800 (PST), mgalloy@gmail.com wrote:
>
>
>> On Nov 20, 3:19 am, Reimar Bauer <R.Ba...@fz-juelich.de> wrote:
>>
>>> Hi
>>>
>>> the ascii table is gone.
>>>
>>> IDL> print,byte('�')
>>> 195 188
>>>

> The string entered in the workbench command line is encoded in UTF8.

Picking up on this point (and the one made by Mike) - it's mostly to do
with your editor. The workbench seems to be unicode aware so it really
is passing a two byte representation of � into the interpreter.

If I use the simple command line interface running through an xterm
(X.Org 6.8.99.903) which I guess isn't unicode aware then I get 252 with
the same version of IDL:

IDL> print,!version
{ x86 linux unix linux 7.0 Oct 25 2007 32 64}
IDL> print,byte('�')
252

but with the workbench:

IDL> print,!version
{ x86 linux unix linux 7.0 Oct 25 2007 32 64}
IDL> print,byte('�')
195 188

I would expect that if you read the character from a file (either as
data or in a .pro file) it depends on the program which wrote the file
and whether your editor was unicode-aware.

In saying all this, I don't understand unicode properly (does anyone?!?)
- I'm just reporting on the fact that it isn't just the IDL interpreter
which is the issue, it's to do with the editor which sends the character
to the interpreter.

This has already been said - I've just rephrased it using more
(unnecessary?) words. I hope it's helpful.

Thanks,

Allan
Re: byte/unicode mismatch [message #63933 is a reply to message #63772] Mon, 24 November 2008 05:38 Go to previous messageGo to next message
Allan Whiteford is currently offline  Allan Whiteford
Messages: 117
Registered: June 2006
Senior Member
Reimar Bauer wrote:
> Allan Whiteford schrieb:
>> Reimar Bauer wrote:
>>> That is all orthogonal.
>>>
>>> How can I decode and how can I encode?
>>>
>>> cheers
>>> Reimar
>>>
>> Reimar,
>>
>> The question (and answer) isn't all that straightforward, byte values
>> over 127 aren't well defined without an encoding system or a codepage.
>>
>> However, the answer you're probably looking for is:
>>
>> b=byte('�') ; assumption 2
>> print,b[1]+(b[0] eq 195)*64 ; assumption 1
>>
>> which is assuming:
>>
>> 1) you want byte values from (two byte) UTF-8 to ISO-8859-1
>>
>> and
>>
>> 2) that the u-umlaut character has entered the intepreter from a UTF-8
>> environment.
>>
>> Please don't just cut and paste the above assuming all will be well.
>>
>> Thanks,
>>
>> Allan
>>
>
> Hmm this does confuse me more. Lets see if an other examples helps me.
>
> If I write an output file using the ide e.g.
>
> openw, 10, 'testfile.txt'
> printf, 10, 'J�lich'
> close, 10
>
> If I run this program with iso encoding isn't the result different to utf-8?
>

Yes, copying and pasting that code into an IDL interpreter using a UTF-8
environment/editor will give a different output file to using one
without such awareness.

> Or how can I write it iso encoded independent from the user setting?

I would have said check to see if n_elements(byte("J�lich")) was the
same as strlen("J�lich") to see if things were UTF-8 or not but it seems
the IDL strlen function actually just counts bytes (I don't think it
should do this).

I'm not sure there is an elegant solution to this problem. In any case,
I'm about to lose my free wi-fi.

Thanks,

Allan

> In python I have several methods for that.
> http://effbot.org/zone/unicode-objects.htm
>
> cheers
> Reimar
>
>
>
>
>
>
>
>
>
>
>
>
Re: byte/unicode mismatch [message #63965 is a reply to message #63880] Fri, 21 November 2008 13:45 Go to previous messageGo to next message
R.Bauer is currently offline  R.Bauer
Messages: 1424
Registered: November 1998
Senior Member
Allan Whiteford schrieb:
> Reimar Bauer wrote:
>> That is all orthogonal.
>>
>> How can I decode and how can I encode?
>>
>> cheers
>> Reimar
>>
>
> Reimar,
>
> The question (and answer) isn't all that straightforward, byte values
> over 127 aren't well defined without an encoding system or a codepage.
>
> However, the answer you're probably looking for is:
>
> b=byte('�') ; assumption 2
> print,b[1]+(b[0] eq 195)*64 ; assumption 1
>
> which is assuming:
>
> 1) you want byte values from (two byte) UTF-8 to ISO-8859-1
>
> and
>
> 2) that the u-umlaut character has entered the intepreter from a UTF-8
> environment.
>
> Please don't just cut and paste the above assuming all will be well.
>
> Thanks,
>
> Allan
>

Hmm this does confuse me more. Lets see if an other examples helps me.

If I write an output file using the ide e.g.

openw, 10, 'testfile.txt'
printf, 10, 'J�lich'
close, 10

If I run this program with iso encoding isn't the result different to utf-8?

Or how can I write it iso encoded independent from the user setting?
In python I have several methods for that.
http://effbot.org/zone/unicode-objects.htm

cheers
Reimar
Re: byte/unicode mismatch [message #63999 is a reply to message #63933] Tue, 25 November 2008 05:03 Go to previous message
R.Bauer is currently offline  R.Bauer
Messages: 1424
Registered: November 1998
Senior Member
me has forwarded a feature request to creaso for an en/de- coding
parameter for open and had 5 minutes ago a phonecall about that. Lets see.

Reimar



Allan Whiteford schrieb:
> Reimar Bauer wrote:
>> Allan Whiteford schrieb:
>>> Reimar Bauer wrote:
>>>> That is all orthogonal.
>>>>
>>>> How can I decode and how can I encode?
>>>>
>>>> cheers
>>>> Reimar
>>>>
>>> Reimar,
>>>
>>> The question (and answer) isn't all that straightforward, byte values
>>> over 127 aren't well defined without an encoding system or a codepage.
>>>
>>> However, the answer you're probably looking for is:
>>>
>>> b=byte('�') ; assumption 2
>>> print,b[1]+(b[0] eq 195)*64 ; assumption 1
>>>
>>> which is assuming:
>>>
>>> 1) you want byte values from (two byte) UTF-8 to ISO-8859-1
>>>
>>> and
>>>
>>> 2) that the u-umlaut character has entered the intepreter from a UTF-8
>>> environment.
>>>
>>> Please don't just cut and paste the above assuming all will be well.
>>>
>>> Thanks,
>>>
>>> Allan
>>>
>>
>> Hmm this does confuse me more. Lets see if an other examples helps me.
>>
>> If I write an output file using the ide e.g.
>>
>> openw, 10, 'testfile.txt'
>> printf, 10, 'J�lich'
>> close, 10
>>
>> If I run this program with iso encoding isn't the result different to
>> utf-8?
>>
>
> Yes, copying and pasting that code into an IDL interpreter using a UTF-8
> environment/editor will give a different output file to using one
> without such awareness.
>
>> Or how can I write it iso encoded independent from the user setting?
>
> I would have said check to see if n_elements(byte("J�lich")) was the
> same as strlen("J�lich") to see if things were UTF-8 or not but it seems
> the IDL strlen function actually just counts bytes (I don't think it
> should do this).
>
> I'm not sure there is an elegant solution to this problem. In any case,
> I'm about to lose my free wi-fi.
>
> Thanks,
>
> Allan
>
>> In python I have several methods for that.
>> http://effbot.org/zone/unicode-objects.htm
>>
>> cheers
>> Reimar
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
  Switch to threaded view of this topic Create a new topic Submit Reply
Previous Topic: maximum LUN
Next Topic: Data organization question

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ] [ PDF ]

Current Time: Wed Oct 08 15:28:04 PDT 2025

Total time taken to generate the page: 0.00500 seconds