| Subject: Re: BOINC - characters error |
| From: david@djwhome.demon.co.uk (David Woolley) |
| Date: 05/03/2005, 11:18 |
In article <4228faf8$0$31055$a729d347@news.telepac.pt>,
Jos? Manuel ?lvares Pereira <alv.pereira@mail.telepac.pt> wrote:
I'm portuguese and I use my natural language. My username (the same for all
projects) has some characters like <illegal character> and <illegal
character>. Two of the projects
Welcome to the confusing world of I18n (Internationalisation). Firstly,
your news posting software is not I18n aware or not correctly configured
for non-ASCII operation. The default character set for news is ASCII
and your news posting is lacking the MIME (Multipurpose Internet Mail
Extensions) headers that would be needed to override that.
I assume that the characters that you intended were lower case e-acute
and upper case A-acute. In that case, the character set you were using
was either ISO 8859/1 or Windows Code Page 1252. As you are using a
Windows email program, it is likely that you are really using Code Page
1252, but it is better to use the non-proprietary ISO one if, you don't
use the Microsoft specific characters. As such, your news posting needs
at least the following headers:
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
It is not safe for all newsreaders to assume this (your newsreader does)
because that will break for anyone outside of Western Europe. In practice,
although not legal, it works if everyone that you communicate with has
a Western European language as their primary language, but doesn't work
for a Greek or Chinese person who also reads nominally English articles
with non-US-English characters in them.
(Failure to correctly identify the character set of web pages is also a
big problem, especially if one wants to, say, view both Chinese and
mainland European web pages. For that reason it is invalid (but more
often than not the case) for web pages not to explicitly specify the
character set.)
(ClimatePrediction.net and ProteinPredictorAtHome) retrieve those characters
correctly. But the other two projects retrieve my username with incorrect
It may be retrieving them correctly, or it may be retrieving them as you
sent them and only work if your PC is configured for the same character set
as when you saved them. I'd need to look at the source code to be sure
whether or not the database uses ISO-8559/1. However, if it was designed
for non-ASCII characters, it should have used UTF-8 given that it was
recently designed. UTF-8 allows for all world languages, at the expense
of using variable numbers of bytes per character.
(Pause for quick look at source.) There is some UTF-8 awareness in the
source code, but, on a quick inspection, I can't tell for sure whether
the user name is supposed to be UTF-8, only that it is stored as 8 bit
characters. It is possible that what you are seeing is error recovery
for invalid UTF-8 character codes. On the other hand, it is not clear
that the UTF-8 awareness is in the actual code or just in some browser
testing code.)
characters. For example Jos<illegal character> is written as Jos#233!
I assume that there was no ampersand. Sometimes using a wrong character
set on a web form will result in ampersand hash character code being
stored on the server. Sometimes the server will output this back so
it appears as the original character, but that is an error recovery
behaviour.
0223 is the number you would type on the keypad whilst holding down the alt
key to enter the character under Windows, i.e. it is the character code
for the character.
How can I correct this?
If the problem is with the clients that print the character code rather
than the way the character was first submitted, and they have published
source code, you can submit a source code patch to the maintainers.
If the database assumes ASCII, that might be difficult to fix. If the
database is UTF-8, then the problem may lie with what was used
to enter the name in the first place.
Another possible problem is in the way the database has been configured.
The schema declares the name as varchar (variable length character) data,
but, for example, Microsoft's SQL server makes a distinction between the
application program's character set and the database's character set and
will attempt to translate. Unless both character sets are UTF-8 and the
database has been told this correctly, or has been told not to convert,
some characters will get corrupted in going to and from the database.
Different projects may use different database software.
BOINC 4.19 and 4.24 had the same problem. On oldest versions, I don't
remember, but I think all was well.
I used the 2004-12-18 snapshot of the BOINC source code, but the
variations in output format almost certainly depend on the client version,
not the BOINC version.