Last modified: 07/11/2010
Sometimes I experience that Unicode and UTF-8 are mentioned as if it is the same thing. Although there is a relationship they are not the same.
This post explains shortly this relationship and gives you some references if you are interested in this material (and every programmer should!).
What is Unicode?
Unicode is a repertoire of characters, where every character gets a unique number or code point.
e.g. The inverted exclamation mark “¡” gets the Unicode hexadecimal value 00A1 or U+00A1.
Unicode itself says nothing how this character is represented in your computer memory or on your harddisk.
That’s where encoding is involved. If you encode your document in UTF-8, which is the default encoding for XML documents, this upside down exclamation mark is represented as 2 bytes with the hex values C2 A1. Actually in your computer it will be stored as 11000010 10100001 (16 bits or 2 bytes) because your computer only understands 1 and 0.
You can see here that the code point value of Unicode (00A1) and the encoded UTF-8 value (C2A1) for the same character have a different value.
The degree Fahrenheit ℉ U+2109 is encoded in UTF-8 as E28489, which is 3 bytes. So you see that UTF-8 uses a multi byte pattern. That is also the reason why it is so popular. Unicode represents every character with at least 2 bytes, while UTF-8 can use one, two or more bytes and therefore it uses less space.
Because there is more than one encoding, it is very important to know for a programme what the encoding of a document is. Otherwise you can get the wrong characters on your screen.
A more in depth article about this issue is: “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)”
I hope this explanation was clarifying.
More to learn
Here are some more references to this material if you are interested.
- The Unicode Consortium
- Unicode Character Search
- Unicode Code Converter
- Characters and encodings