Unicode vs UTF-8 confusion in Python / Django?


I stumbled over this passage in the Django tutorial:

Django models have a default str() method that calls unicode() and converts the result to a UTF-8 bytestring. This means that unicode(p) will return a Unicode string, and str(p) will return a normal string, with characters encoded as UTF-8.

Now, I'm confused because afaik Unicode is not any particular representation, so what is a "Unicode string" in Python? Does that mean UCS-2? Googling turned up this "Python Unicode Tutorial" which boldly states

Unicode is a two-byte encoding which covers all of the world's common writing systems.

which is plain wrong, or is it? I have been confused many times by character set and encoding issues, but here I'm quite sure that the documentation I'm reading is confused. Does anybody know what's going on in Python when it gives me a "Unicode string"?


what is a "Unicode string" in Python? Does that mean UCS-2?

Unicode strings in Python are stored internally either as UCS-2 (fixed-length 16-bit representation, almost the same as UTF-16) or UCS-4/UTF-32 (fixed-length 32-bit representation). It's a compile-time option; on Windows it's always UTF-16 whilst many Linux distributions set UTF-32 (‘wide mode’) for their versions of Python.

You are generally not supposed to care: you will see Unicode code-points as single elements in your strings and you won't know whether they're stored as two or four bytes. If you're in a UTF-16 build and you need to handle characters outside the Basic Multilingual Plane you'll be Doing It Wrong, but that's still very rare, and users who really need the extra characters should be compiling wide builds.

plain wrong, or is it?

Yes, it's quite wrong. To be fair I think that tutorial is rather old; it probably pre-dates wide Unicode strings, if not Unicode 3.1 (the version that introduced characters outside the Basic Multilingual Plane).

There is an additional source of confusion stemming from Windows's habit of using the term “Unicode” to mean, specifically, the UTF-16LE encoding that NT uses internally. People from Microsoftland may often copy this somewhat misleading habit.

By : bobince

Meanwhile, I did a refined research to verify what the internal representation in Python is, and also what its limits are. "The Truth About Unicode In Python" is a very good article which cites directly from the Python developers. Apparently, internal representation is either UCS-2 or UCS-4 depending on a compile-time switch. So Jon, it's not UTF-16, but your answer put me on the right track anyway, thanks.

so what is a "Unicode string" in Python?

Python 'knows' that your string is Unicode. Hence if you do regex on it, it will know which is character and which is not etc, which is really helpful. If you did a strlen it will also give the correct result. As an example if you did string count on Hello, you will get 5 (even if it's Unicode). But if you did a string count of a foreign word and that string was not a Unicode string than you will have much larger result. Pythong uses the information form the Unicode Character Database to identify each character in the Unicode String. Hope that helps.

This video can help you solving your question :)
By: admin