What is Encoding?
Why are encodings so important?
Computers use text all the time: news feeds, stock exchanges, messages on social media and messengers, banking applications, and more. Today, we cannot imagine life without the exchange of information. It wasn’t always like this, though. Computers learned to work with text thanks to the emergence of encodings. Encodings have come a long way from character tables created specifically for each computer to a unite worldwide encoding.
Today Unicode is the main character encoding standard that includes characters from almost all of the living languages of the world. Unicode is applied wherever there is text. Information on social media, database entries, computer programs and mobile applications — all this works with the use of Unicode.
In this guide, we will cover how Unicode appeared and what problems it resolves today. We will find out how information was stored and transmitted before the introduction of the universal character representation standard, and also look at the examples of Unicode-based encodings.
History of encodings
Originally, computers were intended to speed up and automate calculations. The word “computer” itself could be translated to some languages as “a calculator”, and in the 20th century in the USSR, the term “automatic computing engine” was in use.
All that computers operate on are numbers. And defense enterprises became the main customer and driving force behind the appearance of first models. They used computers to calculate the flight variables of ballistic missiles, airplanes, and satellites. In the 1950s, the computing power of computers was also used for:
- forecast of the weather
- experimental and theoretical physics
- calculations of salaries (for example, the LEO computer was used by a company that owned a chain of tea shops)
- forecasting the results of the US presidential election (UNIVAC in 1952, the first commercially produced digital computer)
Numbers and computers
The initial goals of creating computers resulted in an architecture intended for working with numbers. They are stored on the computer as follows:
- A number from the decimal numeral system is converted to binary, which uses only two symbols: “0” (zero) and “1” (one). For example, 3 in the binary number system can be written as 11, and 9 as 1001. Read more about numeral systems in the corresponding guide.
- As a result, a set of zeros and ones are stored in computer memory cells. For example, the presence of electric current on a memory element means one, its absence means zero.
In the late 1950s, incandescent lamps were replaced with semiconductor elements (transistors and diodes). This made it possible to reduce the size of computers, increase the speed and reliability of calculations, and also affect the final cost. And while the first computers were very expensive and could be afforded only by states or large companies, serial computers, though not personal ones, began to appear with semiconductors.
Computers and symbols
Over time, computers started to be used to solve not only computational or mathematical problems. Then it became necessary to process textual information but the situation was a bit more complicated with letters and other symbols than with numbers. Symbols are visual objects. Even the same letter “a” can be represented by two different characters: “a” and “A” due to case sensitivity.
The number “one” can be represented by various symbols as well. It can be the Arabic numeral “1” or the Roman numeral ‘I”. The value of the number doesn’t change; only the symbols do.
Computers were created to work with numbers, they cannot store symbols. When entering information into a computer, symbols are converted into numbers and stored in the computer’s memory as ordinary numbers, and when displaying information, numbers are converted back into symbols.
The rules for converting symbols and numbers were written as a symbol table (character encoding). In accordance with it, each computer had to have a unique input/output device (for example, a keyboard and a printer).
Computers become available
In the early 1960s, computers were incompatible with each other, even within the same manufacturing company. For example, IBM corporation had about 20 design bureaus, and each of them developed its own model. Such computers weren’t versatile but were supposed to resolve specific tasks. For each task, the necessary symbol tables and input/output devices were designed.
It was during this period that first connected computer networks began to emerge. Thus, in 1958, the Semi-Automatic Ground Environment System (SAGE) was created, it united the radar stations of the U.S. and Canada into the first large-scale computer network. In order to use the results of calculations on computers within the same network, they had to have the same symbol tables.
In 1962, IBM formed two main principles for its production development:
- Computers should be universal. This meant the transition from the production of highly specialized computers to those that could solve different tasks.
- Computers should be compatible with one another; that is, data from one computer should be usable on another.
As a result, in 1965, IBM System/360 computers entered the market. There was a range of six models, all with compatible modules. The models differed in performance and cost, allowing customers to have a sufficient choice. The modularity of the systems resulted in the new industry emergence — the production of System/360 computers compatible with computing modules. Companies didn’t need to produce a whole computer anymore, all they had to do was sell separate compatible modules. The result was an even greater spread of computers.
ASCII as the first information encoding standard
Teletype and terminal
At the same time, they were developing teletypes. It’s a system that enables us to transmit text information at a distance. Two printers and two keyboards (typewriter-like keyboards powered by an electric motor) are connected to each other by wires. The text typed on the keyboard of the first user is printed on the printer of the second user and vice versa. That’s how, for example, a “hotline” was organized between the US president and the leadership of the USSR until the early 1970s.
Teletypes also convert text information into some signals conveyed through wires. Yet binary code is not always in use, for example, Morse code is comprised of two signal units—dots and dashes, plus, of course, a pause. But teletypes require symbol tables, the correspondence in which is built between symbols and signals in wires. At the same time, each teletype (pairs, connected teletypes) could have their own symbol tables, based on the tasks they address. For example, the language could differ, and therefore the character set itself, depending on the device. To optimize the teletype work, the most popular (frequently encountered) symbols were encoded with the shortest set of signals, which means that the set of symbols could differ even within the same language.
Computer terminals were developed on the basis of teletypes. Such a device didn’t send messages to the second user. The information was entered into some remote computer, which, after processing the specified commands, returned the result as a response message. This innovation made it possible to use the computing power of computers that were very expensive at the time, without having physical access to the computer. For example, a computer could be located in a separate computing center or an institute, and employees from other branches or cities could access the computing power of that computer through installed terminals.
The worldwide spread of computers and means of data exchange required the development of a universal character encoding standard for the transmission and storage of information. Such a standard was developed in the USA in 1963. The table of 128 characters was named ASCII (American Standard Code for Information Exchange).
ASCII reserves the first 32 codes for control characters. They intend to control devices (such as teletype printers) and receive some composite characters. For example:
- the symbol Ø could be obtained as follows: print O, then use the BS code (BackSpace) to move back and print the symbol /,
- the symbol à could be obtained by pressing BS `
- the symbol Ç could be obtained by pressing C BS ,
The introduction of control characters made it possible to obtain new symbols as a combination of existing ones without any additional symbol tables.
ASCII standard resolved this issue only in English-speaking countries. However, in countries with different writing (for example, with Cyrillic in the USSR), the problem remained.
Other language encodings
For more than 20 years, the issue had been solved by the introduction of its own local standards. For example, in the USSR, based on the ASCII table, they developed their own encoding variants, KOI7 and KOI8, where 7 and 8 indicate the number of bytes needed to encode one character, and KOI stands for “Kod Obmena Informatsiey” (Code for Information Exchange).
ASCII was soon expanded to an 8-byte system that has 256 code points. It has become quite common to mix the first ASCII 128 characters with 128 local ones (in particular, in the KOI8 encoding).
However, these encodings did not become a single standard. For example, in Russia, there was a cp866 (Code page 866) encoding used under DOS to write Cyrillic script, and then a cp1251 encoding in MS Windows. The encodings cp851 and cp1253 were used for the Greek language. As a result, documents made using the old encoding became unreadable on the new ones.
Other countries also needed their own encodings with a unique set of characters. This resulted in widespread confusion and difficulties in the information exchange. Below is the text written in KOI8-R encoding and read in cp851.
KOI8-R cp851 English text. English text. Это - русский текст :-). ΰΨΣ - ΦΩΧΧ╦╔╩ Ψ┼╦ΧΨ :-).
Both encodings are based on ASCII Code, so punctuation marks and letters of the English alphabet look the same. The Cyrillic text becomes completely unreadable, though.
It was the time when computer memory was expensive and communication between computers was slow. Therefore, it was more advantageous to use encodings in which the bit size of each character was small. The symbol table consists of 256 characters. This means that 8 bytes are enough to encode any of them (2^8 = 256).
Switching to Unicode
The rise of the Internet, the proliferation of computers, and the availability of low-cost memory devices all contributed to encoding confusion. This was especially evident on the Internet, when the text written on one computer had to be displayed correctly on many other devices. This became a huge problem both for programmers, who had to decide which encoding to use, and for end users, who could not access the texts they needed.
As a result, in October 1991, the first version of the universal symbol table, called Unicode, was developed. It contained 7161 different characters from 24 different writings from around the world at the time.
Unicode was gradually replenished with new languages and symbols. For example, more than 20,000 ideograms of Chinese, Japanese and Korean were added to version 1.0.1 in mid-1992. The current version already contains more than 143,000 characters.
Unicode - based encodings
Unicode can be imagined as a huge table of characters. It’s not the symbols that are recorded in the computer’s memory but the corresponding numbers from the table. You can write them in different ways. For this purpose several Unicode-based encodings were developed. They differ in the way the Unicode number is written as a set of bytes. They are called UTF — Unicode Transformation Format. There are fixed- and variable length encodings. For example, UTF-32 uses only 4 bytes per character regardless of what character it is. UTF-8 (variable length encoding) has become the most popular, it uses 1-2 bytes for common characters, and 4 bytes for the rare ones. For example, all the characters in the ASCII use just one byte, so text written in English using UTF-8 encoding will take up the same amount of space as text written using the ASCII Code.
But either way, everyone who works with computers and texts use Unicode. Unicode enables to use thousands of different characters and display them the same way on all devices from mobile phones to computers on space stations.
- Encoding is the correspondence between visual symbols and numbers
- Encodings are necessary because computers were meant to work with numbers and do not understand text
- Until the 1990s there was no single encoding, which often made texts completely unreadable
- Unicode is the universal character representation standard. The Internet development and the need to exchange a large amount of text information resulted in its widespread use
- UTF-8, UTF-16, UTF-32 are Unicode-based encodings. The main difference between them is how many bytes it requires to represent a character in memory
- UTF-8 is the most widely used encoding, it uses 1-2 bytes for the usual characters, and 3-4 bytes for the unusual ones. This saves significant amount of memory, for example, when working with English text