Understanding why a ASP.NET page doesn't display special characters like "åäö" correctly

Posted: (EET/GMT+2)

 

Recently, I ran into an issue with an ASP.NET 1.1 web application built with Delphi 2006 and Dreamweaver MX 2004 that had problems showing Finnish national characters like å, ä, and ö. These special characters were always displayed as garbage on the web browser, and so I had to investigate.

At first sight, all appeared to be OK. The .aspx pages were encoded in UTF-8 (Unicode), there was a proper meta tag for older browsers, and ASP.NET was configured with the web.config "globalization" element to return the pages in said UTF-8. Still, Internet Explorer displayed these special characters as garbage. The issue had been solved so far by using HTML entities, i.e. by using codes like "ä" for "ä", and so forth.

To test the issue, I built a very simple ASP.NET page with just basic HTML code inside, and to have these special characters there:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>UTF-8 Test</title>
</head>

<body>
Test: åäö
</body>
</html>

Next, I tested with telnet.exe to see how this page was returned by the server. As expected, the command returned output like this:

GET /test-utf8.aspx HTTP/1.0

HTTP/1.1 200 OK
Connection: close
Date: Tue, 11 Jul 2006 09:00:42 GMT
Server: Microsoft-IIS/6.0
X-Developed-With: Delphi
P3P: CP="PHYo ONLo CONo TELo SAMo"
X-Powered-By: ASP.NET
X-AspNet-Version: 1.1.4322
Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Length: 274

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>UTF-8 Test</title>
</head>

<body>
Test: ????????????
</body>
</html>

Here, take a note of the third-last line: "Test: ????????????". In UTF-8, all special characters should be encoded with two bytes, but clearly, this was not the case. Something was doing the actual Unicode conversion twice. Initially, I thought this was the ASP.NET engine that was wrong, but after some research, I found this to be an incorrect assumption. Instead, what I found (when using a hex editor to view the files) that Dreamweaver had actually saved the files as UTF-8, but ASP.NET didn't correctly detect that the files were already in UTF-8 format, and instead worked with the files as if they were plain ANSI files. This caused the garbage characters to appear.

To fix the problem, I had to investigate Dreamweaver's settings. Under the Page Properties dialog box, I found the Title/Encoding section:

Here, the developers had forgot to check the "Include Unicode Signature (BOM)" tab, which means that Dreamweaver will save the file with the requested encoding, but will not include the so called "byte order mark" at the beginning of the file. See the Unicode FAQ for more details about UTF-8 encoding and BOMs.

When saved with a BOM, here's how the file looks like in a hex editor (Visual Studio 2005 for that matter):

As you can see, the file now starts with the byte sequence "EF BB BF", which is the standard BOM for UTF-8 encoded files. With this encoding, ASP.NET is able to return a correctly encoded file to the browser (here, telnet):

GET /test-utf8-bom.aspx HTTP/1.0

HTTP/1.1 200 OK
Connection: close
Date: Tue, 11 Jul 2006 09:01:34 GMT
Server: Microsoft-IIS/6.0
X-Developed-With: Delphi
P3P: CP="PHYo ONLo CONo TELo SAMo"
X-Powered-By: ASP.NET
X-AspNet-Version: 1.1.4322
Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Length: 268

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>UTF-8 Test</title>
</head>

<body>
Test: ??????
</body>
</html>

The moral of the story is that even if Unicode is bliss in itself, it requires more skills to use it. It is a matter of a fact that plain ASCII was just simpler, though I don't regret the move, on the contrary.