Support non utf8 encoding of HTTP headers injected by the SP
Description
Environment
Activity
Scott CantorMay 10, 2007 at 11:49 AM
Well, I think Chad has noted that it's not that simple. The rule appears to be that the header encoding depends on the content encoding, and that makes it very difficult to fix. The browser can send any encoding, so you can't just pick one and expect it to work.
Technically, I would have to be able to select the encoding based on the request, and that would have bad ripple effects, or I'd have to transcode the cached data every request, which is really slow.
For the time being, it would help to know if Chad's suggestion of setting the encoding manually to UTF8 "works", at least in general. Not that that's the answer, but it would help us understand the problem.
I fear there's no real way to solve this in general, and we're stuck picking a bad solution and/or relying on the Java SP as an eventual better solution. But if I just started using ISO-8859 as an option, not only would you still lose many Unicode characters, but you still would break if the browser set a different encoding.
Leif JohanssonMay 10, 2007 at 3:06 AMEdited
Look in connectors/util/java/org/apache/tomcat/util/buf/ByteChunk.java (relative to the tomcat 5.5.17 source):
/** Default encoding used to convert to strings. It should be UTF8,
as most standards seem to converge, but the servlet API requires
8859_1, and this object is used mostly for servlets.
*/
public static final String DEFAULT_CHARACTER_ENCODING="iso-8859-1";
When we changed this to utf-8 headers with non-7-bit ascii from the SP turn up ok.
Leif JohanssonMay 10, 2007 at 2:32 AMEdited
Yes clearly it is a very bad situation, however currently for strict v2.3 servlet containers (like some recent versions of tomcat) utf8 headers from the SP get encoded into utf8 (again) which looses all codepoints I'll try to get you a precise reference.
Scott CantorMay 9, 2007 at 10:06 AM
Can you give us a reference to the specification language? I just want to make sure I understand the situation fully before I change something. ISO-8859 would destroy a lot of Unicode code points if I used it, and you wouldn't get them back inside the Java code, so this doesn't exactly seem like a solution.
The SP needs a control for the encoding to used (currently utf8 by default) to encode header values injected by the SP. The problem caused by utf8 is that certain servlet 2.3 containers implement the (silly) requirement that all headers must be encoded using ISO-8859-1 (!).