What Is a UTF-8 file?
UTF-8 is a character set table. A UTF-8 file contains plain text. That is, the file does not have any formatting codes created by word processors. This is the type of file that can be opened and edited in a simple text editor like Notepad. The UTF-8 file has the UTF8 file extension, but can also have a TXT file extension.
-
Definition
-
UTF stands for UCS Transformation Format and UCS stands for Universal Character Set. UCS includes a range of different subsets, but UTF-8 is probably the most widely used. The UCS group of definitions is a joint project between the International Standards Organization and an industry body called Unicode. UTF-8 includes a range of characters needed for different languages.
Method
-
Each character is assigned a row of bytes. A byte is a string of eight bits, and a bit is a binary number, which means it has to be either zero or one. The UTF-8 system stratifies characters from simple to more complex characters and the number of bytes used to represent each group increases from one, for the simpler characters up to six for more complicated, or less used characters. However, UTF-8 uses a minimum of two bytes, and so the simpler characters have a zero byte inserted in front. Each character is assigned a hexadecimal number. Hexadecimal is a base 16 counting system. Humans use a base 10 system, called decimal, which uses the digits 0 – 9. Hexadecimal uses 0 – 9 plus A – F to represent a number. The UTF-8 code is an eight digit number, which is preceded by “U-”. A UTF-8 encoded file puts each character in the file as the byte representation, not the UTF-8 code number.
-
Background
-
The earliest system for encoding, and still the most widely known, is the ASCII code table produced by the American National Standards Institute. This code set pre-dates the use of computers and was originally created for teletype machines. The code table evolved between 1958 and 1967 and assigned a number to each character that US typists were most likely to use. This character set is sometimes known as ANSI. During the 1980s software developing companies realized that ASCII needed to be expanded to account for characters used in other languages. They formed the Unicode project to define a new code table. At the same time, ISO was working on its standard ISO 10646, which has the same aim. The two organizations combined their efforts. This is why UTF-8, an ISO name, is also called Unicode.
Text Editors
-
Some text editors are able to encode UTF-8, but have trouble reading code created in other editors. This is because some editors use a reverse byte order, called “little-endian,” the regular order is called “big-endian." This is allowed in the standards, but the file should begin with a code that explains in which order the bytes are stored. Little-endian files should start with “FF FE”and big-endian files start with “FE FF”. This is called the Byte Order Mark, or BOM. However, not all text editors are programmed to recognize this code, and it creates an error in the code interpretation.
-
References
- Photo Credit Ablestock.com/AbleStock.com/Getty Images