Western_Latin_character_sets_(computing)

Western Latin character sets (computing)

Western Latin character sets (computing)

Technically obsolete extensions to ASCII


Several 8-bit character sets (encodings) were designed for binary representation of common Western European languages (Italian, Spanish, Portuguese, French, German, Dutch, English, Danish, Swedish, Norwegian, and Icelandic), which use the Latin alphabet, a few additional letters and ones with precomposed diacritics, some punctuation, and various symbols (including some Greek letters). These character sets also happen to support many other languages such as Malay, Swahili, and Classical Latin.

This material is technically obsolete, having been functionally replaced by Unicode. However it continues to have historical interest.

Summary

The ISO-8859 series of 8-bit character sets encodes all Latin character sets used in Europe, albeit that the same code points have multiple uses that caused some difficulty (including mojibake, or garbled characters, and communication issues). The arrival of Unicode, with a unique code point for every glyph, resolved these issues.

History

The earlier seven-bit U.S. American Standard Code for Information Interchange ('ASCII') encoding has characters sufficient to properly represent only a few languages such as English, Latin, Malay and Swahili. It is missing some letters and letter-diacritic combinations used in other Latin-alphabet languages. However, since there was no other choice on most US-supplied computer platforms, use of ASCII was unavoidable except where there was a strong national computing industry. There was the ISO 646 group of encodings which replaced some of the symbols in ASCII with local characters, but space was very limited, and some of the symbols replaced were quite common in things like programming languages.

Most computers internally used eight-bit bytes but communication (seen as inherently unreliable) used seven data bits plus one parity bit. In time, it became common to use all eight bits for data, creating space for another 128 characters. In the early days most of these were system specific, but gradually the ISO/IEC 8859 standards emerged to provide some cross-platform similarity to enable information interchange.

Towards the end of the 20th century, as storage and memory costs fell, the issues associated with multiple meanings of a given eight-bit code (there are seven ISO-Latin code sets alone) have ceased to be justified. All major operating systems have moved to Unicode as their main internal representation. However, as Windows did not support the UTF-8 method of encoding Unicode (preferring UTF-16), many applications continued to be restricted to these legacy character sets.

The euro sign

The introduction of the euro and its associated euro sign () introduced significant pressure on computer systems developers to support this new symbol, and most 8-bit character sets had to be adapted in some way.

  • Apple with MacRoman and Sun Microsystems with Solaris OS simply replaced the generic currency sign (¤). This caused difficulty in some places because organisations had found other uses for its code point, such as the company logo.
  • ISO introduced a further variant of ISO 8859, ISO 8859-15, which replaced the generic currency sign with the euro sign as well as making some other replacements of symbols with letters with diacritics. ISO 8859-15 never received widespread adoption.
  • With Windows-1252, Microsoft placed the euro sign in a gap (position 80hex) in the existing C1 control codes, a decision that other vendors considered counter-architectural.

Whilst these decisions had limited effect for documents that were only used within a single computer (or at least within a single vendor's "digital ecosystem"), it meant that documents containing a euro sign would fail to render as expected when interchanged between ecosystems.

All of these issues have been resolved as operating systems have been upgraded to support Unicode as standard, which encodes the euro sign at U+20AC (decimal 8364).

Comparison table

Code points U+0000 to U+007F are not shown in this table currently, as they are directly mapped in all character sets listed here. The ASCII coding standard defines the original specification for the mapping of the first 0-127 characters.

The table is arranged by Unicode code point. Character sets are referred to here by their IANA names in upper case.

More information Character, Code point ...
  • The mappings for the IBM code pages are from the Unicode site supplied by Microsoft.[citation needed] The Unicode Consortium's document has links to sources giving the differences between IBM's and Microsoft's mappings for these code pages.[4]
  • IBM437 and IBM850 defined printable characters for the control code ranges. While these could not be used when printing text through DOS, as they would be trapped before reaching the screen, they could be used by applications that used screen memory directly.
  • Macintosh has an Apple logo at 0xF0, and translates it to U+F8FF in the Private Use Area for Unicode.

Notes

  1. IBM's PC DOS 2000, released in 1998, changed their definition of code page 850 to what they called modified code page 850 now including the euro sign at code point 213 instead of adding support for the new code page 858. The reason for this might have been down to existing restrictions in the implementation of the codepage switching logic under MS-DOS/PC DOS, which limited .CPI files to 64 KB in size or about six codepages maximum, a limitation, which was circumvented in some OEM versions of MS-DOS, in Windows NT, and also does not exist in DR-DOS. Further, the parser in MS-DOS/PC DOS limits the number of possible country / codepage entries in COUNTRY.SYS files to a maximum of 146 or 438, a limitation non-existent in DR-DOS. So, adding support for codepage 858 might have meant to drop another (e.g. codepage 850) at the same time, which might not have been a viable solution at that time, given that some applications were hard-wired to use codepage 850.

References

  1. "00858". Code pages by CPGID. IBM. Archived from the original on 2016-06-06. Retrieved 2016-06-06.
  2. Paul, Matthias R. (2001-08-15). "Changing codepages in FreeDOS" (Technical design specification based on fd-dev post ). Archived from the original on 2016-06-06. Retrieved 2016-06-06. The new official ID for the Multilingual "codepage 850 with EURO SIGN" is 858, not 850. IBM will switch to use 858 instead of their 850 variant with future issues of their products. [...] I can only guess why they didn't add 858 to their EGAx.CPI, COUNTRY.SYS, and KEYBOARD.SYS files in PC DOS 2000. Many third-party applications are designed to work with 850 and didn't know about 858 at the time PC DOS 2000 was released, so it's easier for everyone, but unfortunately it's not compatible. [...] As explained above, COUNTRY.SYS and KEYBOARD.SYS contain only two codepage entries for a given country in Western issues of DOS. (In Arabic and Hebrew issues there can be up to 8 codepages for one country, in theory there is no limit below the range of allowed codepages 1..65534). [...] The problem is that removing support for 850 might have caused compatibility problems with applications which are hard-wired to use 850. Adding 858 as a third choice to all the files would have increased the file and table sizes significantly. The COUNTRY.SYS file parser in MS-DOS/PC DOS IO.SYS/IBMBIO.COM sets aside a 6 Kb (for DOS 6) scratchpad to load all the info. This allows a maximum of 438 entries in a COUNTRY.SYS file to be accepted, otherwise you will get the message "COUNTRY.SYS too large.". The NLSFUNC parser does not have this limitation, and the file parsers in DR-DOS (kernel and NLSFUNC) also do not know of such a restriction. Older issues of MS-DOS/PC DOS even had a 2 Kb buffer for a maximum of 146 entries. {{cite web}}: External link in |type= (help)
  3. Paul, Matthias R. (2001-08-27). "Changing codepages in FreeDOS (follow-up)". Archived from the original on 2014-10-01. Retrieved 2013-05-08. [...] one could also create custom .CPI files in the traditional FONT style without difficulties, but you could only store up to [...] six codepages in such a file if it should be useable by MS-DOS/PC DOS (some OEM issues and NT can handle files larger than 64 Kb, but MS-DOS/PC DOS can not).

Share this article:

This article uses material from the Wikipedia article Western_Latin_character_sets_(computing), and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.