ASCII & UNICODE

Mohammed Riyan Abbas
6 min readNov 7, 2022

ASCII :-

  • ASCII, in full American Standard Code for Information Interchange, a standard data-encoding format for electronic communication between computers. ASCII assigns standard numeric values to letters, numerals, punctuation marks, and other characters used in computers.
  • Before ASCII was developed, different makes and models of computers could not communicate with one another. Each computer manufacturer represented alphabets, numerals, and other characters in its own way. IBM (International Business Machines Corporation) alone used nine different character sets. In 1961 Bob Bemer of IBM submitted a proposal to the American National Standards Institute (ANSI) for a common computer code. The X3.4 committee, with representation from key computer manufacturers of the day, was formed to work on the new code. On June 17, 1963, ASCII was approved as the American standard. However, it did not gain wide acceptance, mainly because IBM chose to use EBCDIC (Extended Binary Coded Decimal Interchange Code) in its OS/360 series of computers released in 1964. Nevertheless, ASCII underwent further development, and revisions were issued in 1965 and 1967. On March 11, 1968, U.S. Pres. Lyndon B. Johnson mandated that ASCII be termed a federal standard to minimize incompatibility across federal computer and telecommunications systems. Furthermore, he mandated that all new computers and related equipment purchased by the U.S. government from July 1, 1969, onward should be
  • ASCII-compatible. The code was revised again in 1968, 1977, and 1986.
  • ASCII was originally developed for teleprinters, or teletypewriters, but it eventually found wide application in personal computers (PCs), beginning with IBM’s first PC, in 1981. ASCII uses seven-digit binary numbers—i.e., numbers consisting of various sequences of 0’s and 1’s. Since there are 128 different possible combinations of seven 0’s and 1’s, the code can represent 128 different characters. The binary sequence 1010000, for example, represents an uppercase P, while the sequence 1110000 represents a lowercase p.
  • Digital computer use a binary code that is arranged in groups of eight, rather than seven, digits, or bits; each such eight-bit group is called a byte. Consequently, ASCII is commonly embedded in an eight-bit field, which consists of the seven information bits and a parity bit that is used for error checking or for representing special symbols. This eight-bit system increases the number of characters ASCII can represent to 256, and it ensures that all special characters, as well as characters from other languages, can be represented. Extended ASCII, as the eight-bit code is known, was introduced by IBM in 1981 for use in its first PC, and it soon became the industry standard for personal computers. In extended ASCII, 32 code combinations are used for machine and control commands, such as “start of text,” “carriage return,” and “form feed.” Control commands do not represent printable information, but rather they help control devices, such as printers, that may use ASCII. For example, the binary sequence 00001000 represents “backspace.” Another group of 32 combinations is used for numerals and various punctuation marks, another for uppercase letters and a few other punctuation marks, and yet another for lowercase letters.
  • However, even extended ASCII does not include enough code combinations to support all written languages. Asian languages, for instance, require thousands of characters. This limitation gave rise to new encoding standards—Unicode and UCS (Universal Coded Character Set)—that can support all the principal written languages. Because it incorporates ASCII as its first 128 code combinations, Unicode (specifically UTF-8) is backward-compatible with ASCII while also representing many characters that ASCII cannot. Unicode, which was introduced in 1991, saw its usage jump sharply in the first decade of the 21st century, and it became the most common character-encoding system on the World Wide Web.

ASCII art :-

  • ASCII art, computer text art created with ASCII (American Standard Code For Information Interchange) code. ASCII art uses ASCII characters to produce images ranging from simple and functional emoticons to elaborate works of art.
  • The ASCII code was established by the American National Standards Institute (ANSI) in the early 1960s as a standardized way of presenting and reading Latin-based alphanumeric keyboard characters. ASCII art uses those characters to mimic pen lines, brushstrokes, benday dots, and so on. Some ASCII art relies on line characters, such as \, |, /, and –, but other pieces use the whole range of keys.
  • ASCII art is most commonly found in online chat environments, in e-mail, and as “signatures” at the end of e-mail or USENET messages. It is also found on dedicated Web sites, where users exhibit their work and provide links to other exhibitors. While ASCII emoticons continue to be a significant element of text-based communication, more complex artistic forms are a specialist or niche interest. The ease with which ASCII art can be developed means that it remains an entertaining staple of computer-mediated communication.

UNICODE :-

  • Unicode is a computing standard for the consistent encoding symbols. It was created in 1991. It’s just a table, which shows glyphs position to encoding system. Encoding takes symbol from table, and tells font what should be painted. But computer can understand binary code only. So, encoding is used number 1 or 0 to represent characters. Like In Morse code dots and dashes represents letters and digits. Each unit (1 or 0) is calling bit. 16 bits is two byte. Most known and often used coding is UTF-8. It needs 1 or 4 bytes to represent each symbol. Older coding types takes only 1 byte, so they can’t contains enough glyphs to supply more than one language.

UNICODE symbol :-

  • Each Unicode character has its own number and HTML-code. Example: Cyrillic capital letter Э has number U+042D (042D – it is hexadecimal number), code ъ. In a table, letter Э located at intersection line no. 0420 and column D. If you want to know number of some Unicode symbol, you may found it in a table. Or paste it to the search string. Or search by description («Cyrillic letter E»). On the symbol page you can see how it’s looking like in different fonts and operating systems. You may copy this and paste it to Word or Facebook. Also, there are several character sets on this site for more comfortable coping.
  • Different part of the Unicode table includes a lot characters of different languages. Almost all writing systems using these days represent. Latin, Arabic, Cyrillic, hieroglyphs, pictographic. Letters, digits, punctuation. Also Unicode standard covers a lot of dead scripts (abugidas, syllabaries) with the historical purpose. Many other symbols, which are not belong specific writing system coded too. It’s arrows, stars, control characters etc. All humanity needs to produce high-quality text.
  • Unicode standard doesn’t freeze, it continues to evolve. In June 2015 was released version 8.0. More than 120 thousands characters coded for now. The Consortium does not create new symbols, just add often used. Faces (emoji) included because it was often used by Japanese mobile operators. But some units does not containing a matter of principle. There are not trademarks in Unicode table, even Windows flag or registered trademark of apple.

Example :-

  • A character code that defines every character in most of the speaking languages in the world. Although commonly thought to be only a two-byte coding system, Unicode characters can use only one byte, or up to four bytes, to hold a Unicode "code point" (see below). The code point is a unique number for a character or some symbol such as an accent mark or ligature. Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).

Character encoding schemes :-

  • Character Encoding Schemes
    There are several formats for storing Unicode code points. When combined with the byte order of the hardware (big endian or little endian), they are known officially as "character encoding schemes." They are also known by their UTF acronyms, which stand for "Unicode Transformation Format" or "Universal Character Set Transformation Format."

--

--

Mohammed Riyan Abbas

#code #coding #programmig # deep learning #productivity #startup #developing skills