But… what is this unicode thing?

TL;DR: The unicode system allows us to represent characters as bits, which is what a machine can actually understand.

Have you ever heard about the term “unicode” and just ignored what that was really talking about? Well, at least that had happened to me multiple times until I decided to investigate about it.

The unicode system allows us to represent characters as code points and they can be traduced by machines into units of code represented by bits, which are the actual language that machines can understand.

Before the unicode system, it was a real challenge to transform a characters file into something that the user could understand. That happened because the machines cannot understand the text symbols that we use and viceversa. In order to be processed, saved, and sent, the characters need to be converted into bits. The main problem that unicode solves is how the characters must be converted into bits.

Without a set of rules, each program could do that transformation in the way each one of them wanted, which would mean that, if another program wanted to use the text sent by the original program, it should understand and replicate the rules used by the original program to do the transformation, and that of course, does not scale very well.

Unicode offers a standard way to represent the characters used by all the languages in existence. To the date of this blog, the latest unicode version (13.0) contains 143,859 defined characters of the 1,111,998 that it can support. The already defined characters include languages, symbols, emojis, and so on.

Unicode is really a simple system, each character is represented in a range between U+0000 and U+10FFFF, where the value is an hexadecimal number.

An example of a character and its respective unicode representation could be: a\u0061

We can corroborate this easily by printing the unicode value of a in a programming language, like JavaScript:

ButWhatIsThisUnicodeThing/a-unicode-representation.png

With this we can appreciate that what we see in our screen, a, is in reality a graphic representation of an unicode that a machine can understand.

Unicode Planes

The 1,111,998 values that can be used by the unicode system are divided into 17 planes. Each plane can save up to 65, 536 (2^16) characters.

The first unicode plane of the seventeen that exists is the most commonly used and its name is basic multilingual plane, also known as BMP. In this plane is where all the most used characters exists, like the ones used by languages like greek, latin, and so on.

The characters that live in the BMP can be saved using 16-bits numbers. The planes that exist after the BMP are called astral planes. And the characters that live in those astral planes are the least commonly used and the special ones, like the majority of the emojis.

The particularity of the characters that live in an astral plane is that they cannot be represented using 16-bits numbers, for this reason, encoding formats like UTF-16 need to use two code units, each one of them conformed by 16-bits. This concept is known as surrogate pair, which is a combination of two units of code, the high surrogate pair and the low surrogate pair.

To corroborate this, we can print the length of an emoji using a programming language, for example JavaScript:

ButWhatIsThisUnicodeThing/happy-emoji.png

As we can see, we get a length of two, because that kind of emoji are characters that live in the astral planes, which need an additional surrogate pair to be represented. It is important to mention that there are some characters that are more complex to represent, and they could need more than just one additional surrogate pair to be represented. For example, let's see the length for the following emoji:

ButWhatIsThisUnicodeThing/flag-emoji.png

And there are some that are really complex to represent, that they even need to be a combination or other characters, like for example this emoji 👩🏻‍💻:

ButWhatIsThisUnicodeThing/girl-computer-emoji.png

Look at how whenever that emoji gets typed, it gets an special representation in the string!

Now you know a little bit about the unicode system, I think we should be more careful with our code validations whenever we want to validate the length of a string… don't you think? 😅

Happy Coding!