Encodings and Unicode

This page explains in some more detail what is meant by an encoding system and what Unicode is. It's mainly for folks who are curious about technical details.

What is an Encoding System?

To represent character data numerically (in a computer file), one needs to decide on an encoding system that matches each character to a specific numerical representation. Each national language has one or more different encodings associated with it. (In the earlier days of the Internet, Japanese alone was represented by three common encoding schemes EUC, SJIS, and ISO-3022-JP or JIS. The first versions of this site were coded in SJIS.) So a single numerical value could represent any one of several different characters in several different languages: in order to know which language and which character, you need to know the encoding system.

Therefore, for a computer to display text that has been encoded as numerical data (an email message or a web page, for example), it needs to know what encoding system was used. If the encoding system is not specified explicitly in the email or on the web page, the email program or web browser will try to guess it, but if the guess is wrong, the text will display as nonsense. (In Safari, you as the user can then tell the browser program to try interpreting the page with a different encoding system. You do this by selecting a different encoding choice from the encoding menu, described on the browser page.)

What is Unicode?

Unicode is a standard that solves some of the problems above by using a much larger set of numerical values, and trying to represent all the characters of all major languages in a single encoding set. Each number encodes a unique character in a single language. Unicode is also referred to as UTF, for Unicode Transformation Format, and in you may see references to different Unicode encodings like UTF-8 and UTF-16. (If you are wondering why a single universal standard should produce several different encodings, see the note on encoding versus character set below).

These days, most web pages, email, and other text files your Mac will encounter are Unicode, which naturally includes encodings for Japanese and other Asian Languages. This means you will be able to read and write Japanese once you have enabled Japanese on your system. (See the first page of this site for information about how to do this.)

Unicode has a few different advantages. One is that it can be used as a more or less single universal standard: everyone can use Unicode to encode data in their own language, without the confusion of having many different encoding systems in use. For example, this makes it possible to open and save the same Japanese text file in different software applications, even on different platforms like Mac, Unix, and Windows. For web development, another immediate advantage is that with Unicode you can represent multiple languages on one web page. Without unicode, a single web page (or any pure text file) must use the same encoding throughout, so a page using a Japanese encoding scheme cannot contain characters that are not included in that set--Korean characters, for example. (Japanese encodings do contain roman characters, so you can mix English and Japanese on one page even without Unicode.)

Note that an encoding system like Unicode is quite distinct from a font. If software that supports Unicode encounters text in Japanese, it will know that a message is composed of Japanese characters, but the computer might not have the fonts required to display those characters on the screen. For this reason, to work in a language you also need a font for the language. Macs come with fonts for Japanese (and other common languages unicode supports), so this is not generally a problem unless you are working with much less common languages.

Encoding vs. Character set

If you are wondering why a single unified standard like Unicode should produce several different encodings, the answer is that what Unicode actually defines is a character set that matches every character with a unique integer number. But there are different ways of representing those integer numbers in the binary code (the ones and zeroes) that the computer uses. So UTF-8, UTF-16, and UTF-32 are three different encodings that all use the Unicode character set: each assigns the same integer to a given character, but each represents that integer with a different number and combination of ones and zeroes. (Actually the Japanese encodings EUC, Shift JIS, and ISO-3022-JP are also all based on a single character set. EUC is most common on UNIX, Shift JIS on the web, and ISO-3022-JP in email.) The most common unicode encoding on the web is UTF-8. That is the encoding these pages use.

How to Manipulate the Encoding of Text Files on a Mac

Apple's TextEdit application can open text files in a range of different encodings and convert files to different encodings. You need to save the file in text format instead of the Rich Text Format (RTF) that is the program's default. From the Format Menu select "Make Plain Text" or select Plain Text as the format when first saving the file. You can use the application preferences to specify the encoding to open and save files in; you can also specify the encoding to use when opening a file via a menu in the open dialog box.