Bits, bytes, and Unicode: Digital text for linguists
James Crippen
direct link: http://ling.auf.net/lingbuzz/003567
January 2016
A linguistically-oriented review of digital text and the representation of text with the Unicode charater set and encoding system. Presents basic terminology and concepts of writing systems, and of digital representation of information with binary (and hexadecimal) numbers. Details characters, character encodings, processes of encoding conversion, and file formats. The Unicode character set is discussed in extensive detail, distinguishing code points, character names, meta-structure such as planes and blocks, meta-data such as character types and properties, and basic principles of representation normalization and sorting. Ends with a review of Unicode encoding formats (e.g. UTF-8, UTF-16) and some practical issues for using Unicode in linguistics.
Format: | [ pdf ] |
Reference: | lingbuzz/003567 (please use that when you cite this article) |
Published in: | University of British Columbia |
keywords: | unicode, text, orthography, writing systems, human computer interaction, phonology |