Understanding ASCII, Unicode, and UTF

If you have been programming for some time, you might have heard of ASCII, UTF, and Unicode. They’re so ubiquitous that we rarely stop to think about how they work or how they relate to each other. And even less so why they came into existence in the first place.

For a long time, this was the case for me. I did not spend much time looking into the details since there was never a strong enough reason to do so. It never seemed a particularly exciting area to spend my time in. Oh, sure, I knew some things, I’ve read articles here and there about them. But I never worked on anything that would require me to go beyond a shallow understanding of the concepts.

This changed when I ran into Unicode confusables in the real world. I had to spend a considerable amount of time thinking about how to detect and deal with homoglyphs. As Unicode is what enables homoglyphs, understanding it beyond superficial knowledge can be useful. And if you start poking around Unicode, you’ll unearth UTF soon enough. From there, it’s just a short leap to ASCII. Having made this leap, you may faceplant into the following question: what’s the difference between these concepts?

This is the topic I wish to explore a bit more here.

Why Would Anyone Care?

Why indeed? You could avoid the topic for a long time without any detriment to the code you are writing. On the other hand, intimately understanding encodings, especially UTF-8, comes with certain advantages:

  • You can impress other people with your knowledge. Enjoy seeing rubes scrambling for ChatGPT in a vain attempt to match the excellence of your raw brain power. As if!
  • You understand what others are talking about. Some people might not know the details, but others will. You don’t want to look like a bumpkin in front of other people now, do you? Besides, you’ll run into references to ASCII, UTF, and Unicode in various RFCs, books, articles, and videos. Understanding the concepts will make dealing with those much more pleasant.
  • UTF-8 is the dominant encoding for the World Wide Web [1]. Knowing how it works will help you understand what’s happening under the hood of this little thing that’s called “the Internet.”
  • It’s fun! Knowing a bit of background of how things work might spark some new ideas or novel solutions to gnarly problems. Even without that, knowledge for its own sake is should not be sneered at either!

With that out of the way, let’s move on to the meat of the things.

Character Encodings

Before diving into details, we need to understand what character encoding is. Simply put, it’s assigning numbers to graphical symbols, such as letters, punctuation marks, and digits [2]. It’s a bit like a very simple cipher where characters are replaced by digits. Well, except that everyone’s aware of what character each digit represents.

To give an example, we could have an encoding table like this:

LetterNumber
a1
b2
c3

Using this table, we could encode a word cadet as:

c a d e t
3 1 4 5 20

Following this method, we can assign different numbers (code points), to every character. Receiver of the encoded text would refer to the encoding table and decode the text.

If our language is compact, a small amount of code points may be sufficient. More complicated grammars we would need many more of them. For example, English language does not need as many code points as Chinese language does.

If our language is compact, a relatively small amount of code points may be sufficient, whereas for more complicated grammars we would require many more of them. For example, English language does not need as many code points as Chinese language does.

An important decision lurks here. While humans are fine with using numbers such as 1, 8, 55 and so forth to encode characters, computers work on bits. Thus, everything must be encoded in zeroes and ones. What is worse, a text is just a stream of bits. Without some knowledge of its structure, it’s impossible to tell where one letter ends and another begins. Thus, we need to have some “rules” of how the encoding will work. For example, how many bits will we use to encode the alphabet? Will we need 8 bits (1 byte)? Would 16 bits (2 bytes) be more reasonable? What if we used variable lengths? If so, how will we signal the end of a character?

These are all valid questions that we need to answer. Otherwise the receiver of a text encoded with our character encoder would now know how to decode it. All our efforts of encoding the text would be in vain. This is where encoding standards come in.

With that, we are ready to explore ASCII.

ASCII

ASCII stands for American Standard Code for Information Interchange. It is one of the simpler character encoding standards you might run into.

It is using 7 bits to encode characters. This means that there are 2^7=128 code points, or characters, that can be used to encode various symbols. Out of those 128 available code points, ASCII has only 95 printable characters (such as letters, digits, punctuation marks).

The good news is that it’s (barely) sufficient to cover modern English alphabet. The bad news is that there are over 7164 languages in use [3]. So 128 code points turns out to be a rather modest set of options to have.

Nowadays extended ASCII uses all 8 bits, with the 8th bit used to represent non-English languages [4]. This gives us twice the amount of code points: 256. The most common encoding came to be Windows-1252, an extension of ISO-8851-1. Twice the amount of code points means twice as useful. But barely useful times two is still not particularly useful.

A short note on conversion of these code points to binary is warranted. The conversion process is trivial. You match a character to a code point. You take a code point and convert it to binary. To decode the binary stream, you just do the reverse. That’s about it. For example, a has a value of 97 in ASCII. 97 in binary is 1110101. Since extended ASCII uses 8 bits, we just add a 0 to left, and get 01110101. This is an a character encoded in ASCII. Simple, elegant, and completely insufficient for modern applications.

Unicode

We have established that ASCII is not enough to represent all the different languages. Naturally, countries which found ASCII insufficient would develop their own versions of encoding standards. While this may work within a country, it quickly breaks down when the Internet (or any software that wishes to take internationalization seriously) is taken into account. Code point defined by one standard may translate to a completely different character in another, assuming such code point is defined at all. The end result would be a huge mess. And as the world became more connected, it did end up being a mess.

This is the problem that Unicode standard aims to solve. It assigns each character a unique number (i.e. code point). The characters are not constrained to English language only. In fact, they cover all languages, mathematical symbols, emojis. Therefore, there is no need for custom encoding systems anymore. An encoding utopia, of sorts.

When reading about Unicode you’re likely to run into a concept of planes. Planes are groups of related characters. For example, the first plane, also known as plane 0 or Basic Multilingual Plane (BMP), contains characters for almost all modern languages. The next plane, plane 1 or Supplementary Multilingual Plane has characters of ancient scripts, such as Egyptian hieroglyphs [5]. Each plane is a contiguous group of 65 532 code points. There are 17 planes, numbering from 0 to 16. These 17 planes can accommodate 1 114 112 code points:

  • 2048 surrogates. Used to encode characters that are outside Basic Multilingual Plane. Roughly, it has characters for almost all modern languages.
  • 66 non-characters. Code points reserved for internal use.
  • 137 468 code points reserved for private use. Code points that can be used to represent characters not defined in Unicode.
  • 974 530 code points dedicated for public use.

At the time of writing, Unicode 16.0 defines 155 063 characters [6], which is sufficient to cover all languages, mathematic symbols, music, and so forth. Of course, over time more characters will be added. But for the foreseeable future Unicode has enough space to accommodate them. Overall, Unicode is a huge improvement over ASCII.

One pretty cool feature that further separates Unicode from ASCII is the ability to combine characters. For example, a can be combined with an acute accent to make a character á. It makes Unicode very flexible. It is also one of the features that enable abuses such as Unicode confusables.

An important thing to know about Unicode is that a code point does not directly translate to bits, unlike in ASCII. In fact, Unicode does not state how code points should be encoded. Thus, while is represented as 224C hex code in Unicode standard (written as U+224C), it is not transmitted as 01000100 1001100 over the wire. This is where UTF-8, UTF-16 and other character encodings come into place.

Finally, there are many Unicode Technical Reports. These are related to implementation and development of the Unicode Standard, and define things like how to normalize Unicode, security considerations, and many more technical details. Therefore, while on the surface Unicode looks quite simple (just a mapping of characters to code points, what could be easier?!), in reality it is a hugely complicated endeavor.

UTF-8

Finally, we come to Uniform Transform Format (UTF). It is a way to convert bits to characters and characters to bits. There are many UTF encodings, but the most important one is UTF-8. Nowadays, UTF-8 is implemented in all modern operating systems. On top of that, 98.3% of web pages use it, making it the dominant encoding. Other UTF encodings have their own niches, but they’re rare.

UTF-8 is rather clever. It covers all Unicode code points using 1 to 4 one byte code units to do so (as indicated by the 8 in UTF-8, which defines the unit size in bits). It doesn’t always use 4 bytes for all characters. Instead, UTF-8 will use fewer bytes for code points that lie on the lower end of Unicode character table. The code points there are more frequently used. For instance, characters covering English language use code points 0 to 127. For this reason one byte is enough to encode them. Over time, the savings from such optimization become formidable.

Another neat thing is that the first 128 code points in UTF-8 correspond to ASCII one-to-one. Combined with UTF-8’s stinginess with bytes on the lower end, UTF-8 ends up being compatible with ASCII. Thus, if you send ASCII-encoded text to a machine that uses UTF-8, it will be able to decode the text. Neat!

Alas, all is not sunshine and rainbows. When you get a text, you need to know it's encoding. Otherwise you may see a jumble of characters instead of the intended text. Consider the sequence of bits like 11100010 10001001 10001100. How do you know where one character starts and the other begins? How do you know how many characters there are for that matter? If it’s ASCII, it might be 3 characters. If it’s UTF-8, it might be 1 to 3 characters. Perhaps it’s neither ASCII nor UTF-8? Madness!

To solve this problem, the communicating parties need to agree which encoding they will use. For example, if you go to Wikipedia’s UTF-8 page and inspect network requests, you’ll find a GET request to https://en.wikipedia.org/wiki/UTF-8 that returns the HTML of a page. In its headers, you’ll see a header content-type: text/html; charset=UTF-8 which specifies what encoding is used for the content. Additionally, the HTML page itself has <meta charset="UTF-8"> tag, specifying charset being used. Even without a content-type header, you could read the HTML page assuming it’s in ASCII. Once you came across charset specification, you’d switch to decoding the content according to UTF-8 rules.

Speaking about the rules, encoding content in UTF-8 is quite straightforward. You can find the details here, but in short, there are 4 “brackets” of characters:

  1. Characters below or equal to 127 (hex 0x7F). Encoded with 1 byte.
  2. Characters above 127 but less than or equal to 2047 (hex 0x07FF). Encoded with 2 bytes.
  3. Characters above 2047 but less than or equal to 65535 (hex 0xFFFF). Encoded with 3 bytes.
  4. Characters above 65535 (hex 0xFFFF). Encoded with 4 bytes.

Each of the brackets defines which bits of a byte(s) will be set or clear:

1st byte2nd byte3rd byte4th bytecode points
0xxx xxxx0-127
110x xxxx10xx xxxx128-2047
1110 xxxx10xx xxxx10xx xxxx2048-65535
1111 0xxx10xx xxxx10xx xxxx10xx xxxx65536+

For example, is 224C in hex, which is 8780. This falls between 2047 and 65535, meaning that it will take 3 bytes to encode. 8780 in raw bits would be 00100010 01001100. The encoding would proceed as follows:

  • First byte has a predefined 4 bit sequence and 4 free bits: 1110 xxxx. We take first four bits of the 00100010 01001100 and replace free bits with 0010. We get 1110 0010.
  • Second byte has a predefined 2 bit sequence and 6 free bits: 10xx xxxx. We take the next six bits of the remaining ones from 0010 01001100 sequence (note we’ve removed 0010 that were already “used”). We replace free bits with 0010 01. We get 1000 1001.
  • Third byte has a predefined 2 bit sequence and 6 free bits: 10xx xxxx. We take the remaining bits from 001100 sequence (note we’ve removed 0010 and 0010 01 that were already “used”). We replace free bits with 001100. We get 1000 1100.

Adding all the bytes together, we get 11100010 10001001 10001100. Head on to some online tools like this one to verify that the encoded sequence actually represents character!

Observe that in each case except the first one, the number of set bits in the first byte indicates how many bytes will be used for encoding. Also note that the sequence of “taken” bits end in 0. Thus, if you see something like 1111 0xxx when decoding bit stream, you will know that you need to read 4 bytes to decode a character. It also means that you can easily detect corrupted characters in a byte stream. If it doesn't start with 0/ 110 / 1110 / 1111 0, the start of a character is missing. Hence, you should skip to the next sequence that begins with expected bits. This makes UTF-8 self-synchronising.

On the other hand, variable-width encoding also means that string manipulation operations aren’t trivial. For example, you can’t simply count bytes and derive character count from that (as you could with UTF-32). Thus, while UTF-8 can more efficient in terms of space, you have to pay the price in terms of speed and implementation complexity.

Conclusion

We have seen that while ASCII and Unicode are more of a character sets, an encoding system if you will, UTF-8 deals with the implementation of Unicode. It details how to encode characters to and decode them from bits. We’ve learned that Unicode addresses the issues of simple systems like ASCII. But we have also seen that it’s much more complicated, and that there are more intricacies that can be misused or deliberately exploited.

This blog post is merely a first step in the journey of a deep understanding of encoding standards and tools. There are multiple complicated issues that you could dive into to become the grandmaster of Unicode. However, for the usual day to day tasks this post should serve as a decent stepping stone.

I have to admit, when I started writing this blog post, I didn’t know most of the information detailed in here. I hope reading it will be similarly illuminating for you, too!

Sources

  1. https://w3techs.com/technologies/cross/character_encoding/ranking
  2. https://en.wikipedia.org/wiki/Character_encoding
  3. https://www.ethnologue.com/insights/how-many-languages/
  4. https://stackoverflow.com/questions/643694/what-is-the-difference-between-utf-8-and-unicode
  5. https://en.wikipedia.org/wiki/Plane_(Unicode)
  6. https://en.wikipedia.org/wiki/Unicode

Further Reading