Character Set Issues & Unicode

Web Architecture (INFO 290-03)

Erik Wilde, UC Berkeley School of Information
2008-09-25

Creative Commons License

This work is licensed under a CC
Attribution 3.0 Unported License

Abstract

Every character-based document is based on some model of which characters are available, and how they are encoded. Unicode is the most popular character set today and provides a variety of encoding schemes, each of them being a Unicode Transformation Format (UTF). In addition to character sets and encodings, other issues relevant when dealing with characters are transcoding and normalization, which deal with the problems arising when using different character encodings or different encodings of particular characters.


Characters

Outline (Characters)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]

Characters and Computers


Characters

Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape […]

The Unicode Standard, Version 4.0, Addison-Wesley, 2003


Glyphs

[A Glyph is] a recognizable abstract graphic symbol which is independent of a specific design.

ISO/IEC 9541:1991, Information Technology – Font Information Interchange


Character Sets

Outline (Character Sets)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]

History of Character Sets


ASCII 1963

ASCII 1963

ASCII 1965

ASCII 1965

ASCII 1967

ASCII 1967

EBCDIC

EBCDIC

EBCDIC (1964)

Augmented EBCDIC

Augmented EBCDIC


Beyond ASCII


ISO 8859


ISO 8859-1 (Latin-1) & ISO 8859-2 (Latin-2)

ISO 8859-1 (Latin-1)

Latin-1 (Western European)

ISO 8859-2 (Latin-2)

Latin-2 (Central European)


ISO 8859-4 (Latin-4) & ISO 8859-5 (Cyrillic)

ISO 8859-4 (Latin-4)

Latin-4 (North European)

ISO 8859-5 (Cyrillic)

Cyrillic


ISO 8859-7 (Greek) & ISO 8859-15 (Latin-9)

ISO 8859-7 (Greek)

Greek

ISO 8859-15 (Latin-9)

Latin-9


Unicode Basics

Outline (Unicode Basics)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]

ISO 8859 Problems


Unicode


Unicode Character Count


BMP Structure

bmp.png

Roadmap of Unicode BMP. Each numbered box represents 256 characters. (Source: Wikipedia)

  •  Black  = Latin scripts and symbols
  •  Light Blue  = Linguistic scripts
  •  Blue  = Other European scripts
  •  Orange  = Middle Eastern and SW Asian scripts
  •  Light Orange  = African scripts
  •  Green  = South Asian scripts
  •  Purple  = Southeast Asian scripts
  •  Red  = East Asian scripts
  •  Light Red  = Unified CJK Han
  •  Yellow  = Canadian Aboriginal scripts
  •  Magenta  = Symbols
  •  Dark Grey  = Diacritics
  •  Light Grey  = UTF-16 surrogates and private use
  •  Cyan  = Miscellaneous characters
  •  White  = Unused

Unicode Encodings

AאU+233B4.gif
Code pointU+0041U+05D0U+597DU+233B4
UTF-841D7 90E5 A5 BDF0 A3 8E B4
UTF-1600 4105 D059 7DD8 4C DF B4
UTF-3200 00 00 4100 00 05 D000 00 59 7D00 02 33 B4

UTF-8


Other UTFs


Character Set Identification


Normalization and Transcoding

Outline (Normalization and Transcoding)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]

Unicode is Complex

<français>français ≠ français</français>Unnormalized Unicode
<français>français ≠ français</français>Unnormalized UTF-8 with encoding="ISO-8859-1"

Normalization Examples

Compatibility Composites

Transcoding


Conclusions

Outline (Conclusions)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]

Text is Complex