Welcome, guest ( Login )

WikiHome » BugTracker » Bug411

Bug411

Version 2, changed by jotspot_jim. 02/23/2005.   Show version history

Bug ID#: Bug411

Summary (short description): Language Issues
OS: WinXP
Browser: FireFox
Bug Description:

It seems JotSpot does not support non English language. Problems appear in the wiki titles and sending email to a page.

We use different languages in our Wiki: Catalan, Spanish and English, and want to use them on JotSpot.

Sometimes you CAN NOT choose language, for instance when sending EMAIL to a page.

You definitively have to fix this.

Best regards from Barcelona.

Steps to Reproduce:

Send an email with subjcet: à, é, í, ó, ú....

You will get:
Page Not Found: /WikiHome/WikiProduccion/WikiJurisprudencia/PedidosCendoj/Enviando por correo electrónico: VLEX20050214-1-1.zip
You can create it by clicking here: /WikiHome/WikiProduccion/WikiJurisprudencia/PedidosCendoj/Enviando por correo electrónico: VLEX20050214-1-1.zip?

Your name: Lluis Faus
Your wiki domain name (*.jot.com): vlex

Comments (5)

admin@pinene said, 02/18/2005:

A question to follow up,
I tried to send an email to my WikiHome, with subject in Chinese. Then in WikiHome, I saw the email with strange characters. However, when I clicked it, a new page opened and said "Page Not Found: /WikiHome/ʵÑé ".

Now the question is, how to delete this page?

admin@ok said, 02/19/2005:

I have the same problem

jotspot_jim said, 02/23/2005:

Hello Everyone,

As noted in Bug329, certain international and special characters are not properly encoded creating page errors which can't be deleted. 

JotSpot acknowledges the severity of this issue.  We apologize for any inconvenience this bug has caused.  Please know we are working to fix this problem. 

Sincerely, Jim

admin@jung said, 04/20/2005:

As I commented in Bug161: "When you see lots of uppercase A with diacritical character (e.g., 'Ã'), that usually indicates a UTF-8 double conversion. I.e., some code is trying convert a string already in UTF-8 to a UTF-8 string."

I also tried emailing with Japanese in the subject and had similar results.
I examined the HTML source of the page and saw:

Page Not Found: /WikiHome/&#xe6;&#x97;&#xa5;&#xe6;&#x9c;&#xac;&#xe8;&#xaa;&#x9e;&#xe3;&#x81;&#xae;e-mail </h3>

These hex values are the correct BYTES for the UTF-8 string that correspond to my Japanese string. However, the NCRs should correspond CHARACTERS:

Page Not Found: /WikiHome/&#x65E5;&#x672C;&#x8A9E;&#x306E;e-mail </h3>

So the HTML is wrong.

But you also need to look what bytes are being used to save to the filesystem or whatever the underlying storage mechanism being used. Are you using NCRs as part of the pathname or raw UTF-8 or something else?

(BTW, I sent 2 emails with the same Japanese strings. One I sent as ISO-2022-JP (most common encoding for Japanese internet email) and one as UTF-8. Both generated the same results. So something received the ISO-2022-JP email and correctly converted it to UTF-8. But then ...)

admin@jung said, 04/20/2005:

The following comments apply to bugs: 25, 65, 79, 100, 161, 217, 411, and to Feedback25, and to Question78, and to "WikiHome >> BugTracker >> Latin1Characters":

When you see lots of uppercase A with diacritical character (e.g., 'Ã'), that usually indicates a UTF-8 double conversion. I.e., some code is trying convert a string already in UTF-8 to a UTF-8.

(It can also be seen when you have a UTF-8 web document mislabeled as iso-8859-1 (or cp1252), then the browser will try to run a Latin1->UTF-8 on UTF-8 data. But I doubt this is the problem since the Jotspot pages are in XHTML, and XHTML and XML default to UTF-8 unless there are encoding attributes.)

Suggestions for debugging:

First, decode the code points of the garbage string as if its encoded in Latin1. Then verify that those bytes correspond the the correct UTF-8 string. Look at the XHTML source. It's likely you will see NCRs for each byte instead of NCRs for each character.

Next, if it's happening in a link, determine what byte values are being used in the path for the backend storage. Hopefully it is raw UTF-8 and not escaped as NCRs.

Attachments (0)

  File By Size Attached Ver.