Why are we talking about character encoding?
I know what you're thinking. "This has been covered before." or "Why are you dredging up a history lesson?". It has become clear over the past several years of my career that an astonishing number of developers are either unaware of or indifferent to character encodings and why it is important.
Unfortunately, this isn't just a history lesson. Today, in a full stack developer's world, the topic of character encoding is more important than ever. The need to integrate in-house and vendor services with varying server and client technologies together into a reliable application requires developers to pay close attention to character encoding. Otherwise, you risk some potentially embarrassing production bugs that will cost your team valuable "street cred". The aim of this article is to reach back in the vault and remind everyone why this topic is still important.
So to solve this problem, there are some things you need to know. You have probably heard of US-ASCII or UTF-8. Most developers have a general understanding that they are character encodings and the difference between them. Some developers also understand the difference between a character set that defines code points, and a character encoding that specifies how to encode a code point as one or more bytes. If you are interested in learning more about how encoding works, or the history of character encoding, I would recommend reading the following two articles.
- What every programmer absolutely, positively needs to know about encodings and character sets to work with text
- David C. Zentgraf
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- Joel Splosky
In a perfect world, you're the consumer of an enlightened service that encodes everything to UTF-8 and you're decoding everything using the common default of UTF-8, you'll never need to worry. In practice, this is the assumption being made most of the time. But what happens if you're calling a kludgy service that encodes responses using Windows-1250 (similar to ISO-8859-2) due to some unknown setting buried in a misunderstood framework configuration file on some long extinct, unsupported platform? Well, the answer is "absolutely nothing" as long as you are only supporting characters that have the same code points across different character sets. As developers, we need to be prepared to handle these situations when we encounter them.
For the purposes of demonstration, I will refer to three of the most commonly used character encodings, US-ASCII, ISO-8859-1, and UTF-8 in this article, but the ideas presented are relevant to all existing character encodings.
Being generally aware of character encodings is not enough to prevent us from making frequent, avoidable mistakes. What makes this interesting is that a majority of the time the character encoding used to decode bytes doesn't matter due to the overlapping of the more popular character sets. For example, US-ASCII contains a character set of 128 visible and control characters. ISO-8859-1 has 256 characters and is a super set of US-ASCII with the first 128 code points identical to US-ASCII; it is one of many character sets to share that trait. The Unicode character set incorporates all 256 characters of ISO-8859-1 as the first code page of its 1,112,064 code points. I might be alone in this, but I wish they hadn't made these character sets overlap. Even if they were "off-by-one", it would have become clear to developers that they were decoding using the wrong format and this would already be common knowledge.
To illustrate what I mean, consider the word "HAT". The US-ASCII encoding is simply the numeric value of each character. Consulting an ASCII chart yields the following encoding.
HAT = 01001000 01000001 01010100
Now consider the fact that both ISO-8859-1 and UTF-8 share the same first 128 code points. As it turns out, all three encodings represent "HAT" with the same byte string 01001000 01000001 01010100. So essentially, I can encode "HAT" with any one of these three encodings, and decode it with another and I'll always get the correct result back out. This is the source of much of the confusion surrounding character encodings.
Herein lies the challenge. You may think you have everything correctly configured and it may work for years. Yet suddenly you may be faced with garbled text in your beautiful application. You may have been calling this service for ages and never realized that something was incorrectly configured. The day that this service needs to return the customer name "Günther", we're going to have a problem. When we attempt to decode this in UTF-8, it's going to be displayed as "G�nther". So, who broke the application?
Strategies for avoiding encoding problems
Before I dive into some specific examples, my blanket advice is to be explicit about which encoding format you are using. Admittedly, this is advice that I'm not always good at taking. However, the fact remains that a vast majority of code I've seen uses the platform-default encoding, which is almost always UTF-8. This is fine until we run the code on a platform that doesn't use this as a default or we read a file encoded with some other format.
Web Pages / AJAX
When you load a web page, the browser has to know how the html was encoded. Generally, this encoding is set via a configuration property on the web server that is hosting the web page you loaded. That information is communicated to the browser via the meta tag in html. You've probably seen this:
<meta charset="UTF-8" />
This informs the browser how to interpret the bytes it received from the server. How was it able to read these characters if it didn't know the encoding, you ask? The browser has to "guess", partially decode the bytes until it finds the charset, and then throw it out and start over with the correct charset. I guess it's a good thing that US-ASCII is the de facto standard for the first 128 code points, right?
When a web page makes an AJAX request to a server, it seems logical that the Content-Type header would contain the charset used to encode it. In fact, this is not true. According to the XmlHttpRequest specification, the encoding used for an AJAX request is always UTF-8. Attempts to override this behavior are supposed be ignored by all browsers if the spec is correctly implemented.
When it comes to REST and SOAP service calls made outside of web application, things get a little trickier. REST, isn't really a specification, but a set of practices that make use of the HTTP specification. SOAP on the other hand is a network messaging protocol specification that is commonly transferred via HTTP, but can also use many other transports such as JMS or SMTP. For REST, arguably the "most correct" way to specify charsets is using the Content-Type header for denoting payload encoding, and the Accepts-Charset header for denoting the desired response payload encoding.
Apache CXF, a commonly used SOAP framework, allows you to configure the encoding charsets for the HTTP transport globally via configuration file, or by overriding it per endpoint in the WSDL. My advice is to read the documentation thoroughly when configuring one of these frameworks.
Database / JDBC
I've seen some vague tips in the past on dealing with database encoding configurations. My advice is to ignore them entirely. Generally speaking, the encoding used internally by the database is entirely encapsulated within the database driver and the database implementation itself. If you look at the JDBC spec as an example, either the source encoding is documented in the API, or it is stored as raw bytes in whatever encoding the client used. For example:
- PreparedStatement.html#setAsciiStream specifies that the encoded bytes in the InputStream must be ASCII
- PreparedStatement.html#setCharacterStream takes in a Reader instance that must contain Unicode characters (i.e. InputStreamReaders would need to properly decode the source stream)
- PreparedStatement.html#setBinaryStream takes an InputStream, but the resulting bytes are stored in a VARBINARY or LONGVARBINARY where the raw bytes are stored and no encoding steps take place
The JMS specification describes a messaging system, so it would seem a natural consequence that encoding of those messages becomes an important configuration. In reality, the JMS service providers handle all of the character conversion coming in and out of the JMS implementations. There are obviously going to be some considerations here if you are forwarding from one provider implementation to another using some sort of JMS bridging concept, but those are generally configurable. It is possible to create BytesMessage instances where you have complete control over how those bytes are encoded on the producer side and decoded on the consumer side, but this is an application-level decision and is not a configuration consideration.
How do we repair the damage?
There is no easy canned solution to encoding problems. It can be very difficult to figure out what is causing your issues after the damage has been done. Without knowing exactly what is causing the issue, it is next to impossible to repair the damage upon decoding as mentioned in David Zentgraf's article (see section "My document doesn't make sense in any encoding!"). The approach for finding the cause depends on where the issue manifests itself.
The simplest case is opening a file in a text editor. Many editors offer the ability to read and write using a variety of character encodings. For instance Notepad++, SublimeText, and Vim, just to name a few, all support multiple encodings. Set a default that makes sense for you, though this is generally going to be UTF-8. If possible, when you open a file be aware of the encoding used. Failing that, if it doesn't look right, it's easy enough to click through the usual suspects to find the right encoding. If it doesn't look right in any encoding, the file has likely been irreparably broken by some other process.
An IDE is really just a fancy text editor, so they also allow for default encodings to be set when saving files. Most compilers also have similar settings that allow for source file encoding properties to be set. If you build a test file in an IDE and read it in during a test, i.e. JUnit, you'll want to decode that file using the correct character set. If it isn't the platform default, make sure you explicitly set the value.
If you see gobbledygook on a web page, things start to get a little more difficult. If it's static html, check the encoding of the source html and the meta charset tag on the page. Easy, right? It gets a little harder if it's a single-page web application that calls some number of services, which in turn call a set of 3rd party services and so on. Or maybe you're in a microservice environment where you have service composers that call dozens of different focused services. At that point, my advice is to start from the endpoint and trace the data backwards. Use Postman, SoapUI, or a host of other tools to make the service calls directly. Once you've located the culprit that mangles the text, you can start checking configurations and file encodings to figure out what happened.
3rd Party Service
What if, as in the previous example, you find out that the culprit was a 3rd party vendor service? Your best bet is to start by reaching out to them to ask about the configuration. What encoding do you use for your responses? In a perfect world, they'll say "Oh, that's on page 12 of our documentation. We always use ISO-8859-1" or maybe "Oh, that's configurable… just send us this header". Then you configure your client to match and, voila, problem solved. Just as common you might get a response that they don't know, think they don't have control over that, or say they'll "get back to you". If the answer is at all nebulous, you can try some of the usual suspects and see if the responses are decoded correctly. Ultimately, you may be at the mercy of your vendor.
Creativity For The Win!
I want to share with you a fun experience related to character encodings I recently encountered at a client that wasn't caused by configurations. In this scenario, we noticed that users with non-ASCII characters in their names were not being displayed properly on the website when they logged in. Our first guess was a configuration error, so we started unwinding the application as I mentioned in the previous section. As we worked our way back we found that this data was coming from a cookie. The application called a service that called another, much older service that fetched user profile data and used a Set-Cookie header to add this cookie.
Naturally, we assumed the problem was a configuration on this older service. We started looking at ways to configure the service correctly. However, as we dug deeper, we found the following exception in the logs:
java.lang.IllegalArgumentException: Control character in cookie value or attribute.
Control characters, what is this nonsense? To figure out what was going on, my colleague started reading ancient specs on Cookies. I use the term "spec" very loosely here. The original version of the Cookie spec was written back in 1997 and left out a lot of important details. The result is that every browser and server was left to their own devices when implementing cookie handling.
In the most recent attempt to clean up this old spec in 2011 (see RFC6265), one important historical footnote was carried forward. Specifically, cookie values could contain only US-ASCII characters. Actually, it is a subset of the US-ASCII charset excluding control characters, double quote, comma, semicolon and backslash.
Then we started to question, "Why did it break now when it's been working for years?". We looked into the browsers and found that they generally support more than just the specified visible US-ASCII characters, so that didn't seem to be the problem. Then it occurred to us that this old service application had recently been ported from an old version of IBM WebSphere to a recent version of JBoss. As it turns out, most application servers do the same thing the browsers do; they support non-ASCII characters in the Set-Cookie headers sent to the browser. However JBoss is famous (notorious?) for following specifications very closely and throwing exceptions when invalid values are detected.
So how do we avoid all of this? David Zentgraf said it best:
"It's really simple: Know what encoding a certain piece of text [is in]".
I often hear people say that everyone should just use UTF-8 since it is a superior encoding format. As English speakers, it is natural to gravitate toward UTF-8 since it generally is able to encode text using fewer bytes than any of the competitors that are able to encode the entire Unicode character set. However this may not be the correct stance in an increasingly global world. Consider the fact that the UTF-8 variable length encoding comes at a cost in complexity. Most Chinese characters are 3 bytes in UTF-8 while they are only 2 bytes in UTF-16 due to the need for control bits in the encoding.
My intent is not to preach the benefits of any particular format. Rather I believe we as developers need to understand how these encodings work and be explicit in our dealings with them. Future developers will thank you for not sending them down this particular rabbit hole when things don't work