Encodings and Unicode

This page explains in some more detail what is meant by an encoding system and what Unicode is. If you just want to use Japanese with your web browser or email and you are not curious about technical details, you can start on one of the earlier pages on this site.

What is an Encoding System?

To represent character data numerically (as in a computer file), one needs to decide on an encoding system that matches each numerical value to a specific characters. Each national language has one or more different encodings associated with it. So a single numerical value could represent any one of several different characters in several different languages, depending on the encoding system in use. Japanese alone has three common encoding schemes EUC, SJIS, and ISO-3022-JP. The last is sometimes referred to as JIS.

For a computer to display text that has been encoded as numerical data (an email message or a web page, for example), it needs to know what encoding system was used. If the encoding system is not specified explicitly in the email or on the web page, the email program or web browser will try to guess what encoding system was used, but this is not always easy. If the guess is wrong, the text will display as nonsense. You as the user can then tell the browser or email program to try interpreting the page with a different encoding system. You do this by selecting a different encoding choice from the encoding menu. Other pages on this site show how to do this in Apple Mail and Safari, for example.

What is Unicode?

Unicode is a standard for several closely related encoding systems that solves some of the problems above by using a much larger set of numerical values, and trying to represent all the characters of all major languages in a single encoding set. Each number encodes a unique character. Unicode is also referred to as UTF, for Unicode Transformation Format, and in the encoding menu of your browser you may see choices for Unicode encodings like UTF-8 and UTF-16. (If you are wondering why a single universal standard should produce several different encodings, see the note on encoding versus character set below).

Unicode encoding systems naturally include encodings for Japanese and other Asian Languages. Most Mac software is "Unicode compatible," meaning you will be able to use Japanese with it once you have enabled Japanese on your system. (See the first page of this site for information about how to do this.)

Unicode has a few different advantages. One is that it can be used as a more or less single universal standard: everyone can use Unicode to encode data in their own language, without the confusion of having many different encoding systems in use. For example, this makes it possible to open and save the same Japanese text file in several different software applications, even on different platforms like Mac, Unix, and Windows. For web development, another immediate advantage is that with Unicode you can represent multiple languages on one web page. Without unicode, a single web page (or any pure text file) must use the same encoding throughout, so a page using a Japanese encoding scheme cannot contain characters that are not included in that set--Korean characters, for example. (Japanese encodings do contain roman characters, so you can mix English and Japanese on one page even without Unicode.)

Note that an encoding system like Unicode is quite distinct from a font. If software that supports Unicode encounters text in Japanese, it may know that the message is composed of certain Japanese characters, but the computer may not have the fonts required to display those characters on the screen or printer. For this reason, to work in a language you also need a font for the language.

Encoding vs. Character set

If you are wondering why a single unified standard like Unicode should produce several different encodings, the answer is that what Unicode actually defines is a character set that matches every character with a unique integer number. But there are different ways of representing those integer numbers in the binary code (the ones and zeroes) that the computer uses. So UTF-8, UTF-16, and UTF-32 are three different encodings that all use the Unicode character set: each assigns the same integer to a given character, but each represents that integer with a different number and combination of ones and zeroes. (Actually the Japanese encodings EUC, Shift JIS, and ISO-3022-JP are all based on the same character set. EUC is most common on UNIX, Shift JIS on the web, and ISO-3022-JP in email.) The most common unicode encoding on the web is UTF-8. That is the encoding these pages use.

How to Manipulate the Encoding of Text Files on a Mac

Apple's TextEdit application can open text files in a range of different encodings and convert files to different encodings. You need to save the file in text format instead of the Rich Text Format (RTF) that is the program's default. From the Format Menu select "Make Plain Text" or select Plain Text as the format when first saving the file. You can use the application preferences to specify the encoding to open and save files in; you can also specify the encoding to use when opening a file via a menu in the open dialog box.

If you need to convert a large file like a dictionary file from one encoding system to another, you might try Cyclone, which allows you access Mac OS X's encoding converters directly.