Internationalization

Internationalization
Prev	Chapter 2. Setting Up Cygwin	Next

Internationalization

Overview

Internationalization support is controlled by the LANG andLC_xxx environment variables. You can set all of thembut Cygwin itself only honors the variables LC_ALL,LC_CTYPE, and LANG, in this order, accordingto the POSIX standard. The content of these variables should follow thePOSIX standard for a locale specifier. The correct form of a localespecifier is

  language[[_TERRITORY][.charset][@modifier]]

"language" is a lowercase two character string per ISO 639-1, or,if there is no ISO 639-1 code for the language (for instance, "Lower Sorbian"),a three character string per ISO 639-3.

"TERRITORY" is an uppercase two character string per ISO 3166, charset isone of a list of supported character sets. The modifier doesn't matterhere (though some are recognized, see below). If you're interested in theexact description, you can find it in the online publication of the POSIXmanual pages on the homepage of theOpen Group.

Typical locale specifiers are

  "de_CH"	   language = German, territory = Switzerland, default charset  "fr_FR.UTF-8"    language = french, territory = France, charset = UTF-8  "ko_KR.eucKR"    language = korean, territory = South Korea, charset = eucKR  "syr_SY"         language = Syriac, territory = Syria, default charset

If the locale specifier does not follow the above form, Cygwin checksif the locale is one of the locale aliases defined in the file/usr/share/locale/locale.alias. If so, and ifthe replacement localename is supported by the underlying Windows,the locale is accepted, too. So, given the default content of the/usr/share/locale/locale.alias file, the belowexamples would be valid locale specifiers as well.

  "catalan"        defined as "ca_ES.ISO-8859-1" in locale.alias  "japanese"       defined as "ja_JP.eucJP"      in locale.alias  "turkish"        defined as "tr_TR.ISO-8859-9" in locale.alias

The file /usr/share/locale/locale.alias isprovided by the gettext package under Cygwin.

At application startup, the application's locale is set to the default"C" or "POSIX" locale. Under Cygwin 1.7.2 and later, this locale defaultsto the ASCII character set on the application level. If you want to stickto the "C" locale and only change to another charset, you can define thisby setting one of the locale environment variables to "C.charset". Forinstance

  "C.ISO-8859-1"

Note

The default locale in the absence of the aforementioned localeenvironment variables is "C.UTF-8".

Windows uses the UTF-16 charset exclusively to store the namesof any object used by the Operating System. This is especially importantwith filenames. Cygwin uses the setting of the locale environment variablesLC_ALL, LC_CTYPE, and LANG, todetermine how to convert Windows filenames from their UTF-16 representationto the singlebyte or multibyte character set used by Cygwin.

The setting of the locale environment variables at process startupis effective for Cygwin's internal conversions to and from the Windows UTF-16object names for the entire lifetime of the current process. Changingthe environment variables to another value changes the way filenames areconverted in subsequently started child processes, but not within the sameprocess.

However, even if one of the locale environment variables is set tosome other value than "C", this does only affecthow Cygwin itself converts filenames. As the POSIX standard requires,it's the application's responsibility to activate that locale for itsown purposes, typically by using the call

  setlocale (LC_ALL, "");

early in the application code. Again, so that this doesn't getlost: If the application calls setlocale as above, and there is noneof the important locale variables set in the environment, the localeis set to the default locale, which is "C.UTF-8".

But what about applications which are not locale-aware? Per POSIX,they are running in the "C" or "POSIX" locale, which implies the ASCIIcharset. The Cygwin DLL itself, however, will nevertheless use the localeset in the environment (or the "C.UTF-8" default locale) for convertingfilenames etc.

When the locale in the environment specifies an ASCII charset,for example "C" or "en_US.ASCII", Cygwin will still use UTF-8under the hood to translate filenames. This allows for easierinteroperability with applications running in the default "C.UTF-8" locale.

Starting with Cygwin 1.7.2, the language and territory are used tofetch locale-dependent information from Windows. If the language andterritory are not known to Windows, the setlocalefunction fails.

The following modifiers are recognized. Any other modifier is simplyignored for now.

For locales which use the Euro (EUR) as currency, the modifier "@euro"can be added to enforce usage of the ISO-8859-15 character set, whichincludes a character for the "Euro" currency sign.
The default script used for all Serbian language locales (sr_BA, sr_ME, sr_RS,and the deprecated sr_CS and sr_SP) is cyrillic. With the "@latin" modifierit gets switched to the latin script with the respective collation behaviour.
The default charset of the "be_BY" locale (Belarusian/Belarus) is CP1251.With the "@latin" modifier it's UTF-8.
The default charset of the "tt_RU" locale (Tatar/Russia) is ISO-8859-5.With the "@iqtelif" modifier it's UTF-8.
The default charset of the "uz_UZ" locale (Uzbek/Uzbekistan) is ISO-8859-1.With the "@cyrillic" modifier it's UTF-8.
There's a class of characters in the Unicode character set, called the"CJK Ambiguous Width" characters. For these characters, the widthreturned by the wcwidth/wcswidth functions is usually 1. This can be aproblem with East-Asian languages, which historically use character setswhere these characters have a width of 2. Therefore, wcwidth/wcswidthreturn 2 as the width of these characters when an East-Asian charset suchas GBK or SJIS is selected, or when UTF-8 is selected and the language isspecified as "zh" (Chinese), "ja" (Japanese), or "ko" (Korean). This isnot correct in all circumstances, hence the locale modifier "@cjknarrow"can be used to force wcwidth/wcswidth to return 1 for the ambiguous widthcharacters.

How to set the locale

Assume that you've set one of the aforementioned environment variables to somevalid POSIX locale value, other than "C" and "POSIX". Assume further thatyou're living in Japan. You might want to use the language code "ja" and theterritory "JP", thus setting, say, LANG to "ja_JP". You didn'tset a character set, so what will Cygwin use now? Starting with Cygwin 1.7.2,the default character set is determined by the default Windows ANSI codepagefor this language and territory. Cygwin uses a character set which is thetypical Unix-equivalent to the Windows ANSI codepage. For instance:
```
  "en_US"		ISO-8859-1  "el_GR"		ISO-8859-7  "pl_PL"		ISO-8859-2  "pl_PL@euro"		ISO-8859-15  "ja_JP"		EUCJP  "ko_KR"		EUCKR  "te_IN"		UTF-8
```
You don't want to use the default character set? In that case you have tospecify the charset explicitly. For instance, assume you're from Japan anddon't want to use the japanese default charset EUC-JP, but the Windowsdefault charset SJIS. What you can do, for instance, is to set theLANG variable in the mintty Cygwin Terminalin the "Text" section of its "Options" dialog. If you're starting yourCygwin session via a batch file or a shortcut to a batch file, you can alsojust set LANG there:
```
  @echo off  C:  chdir C:\cygwin\bin  set LANG=ja_JP.SJIS  bash --login -i
```
Note
For a list of locales supported by your Windows machine, use the newlocale -a command, which is part of the Cygwin package.For a description see the section called “locale”
Note
For a list of supported character sets, seethe section called “List of supported character sets”
Last, but not least, most singlebyte or doublebyte charsets have a bigdisadvantage. Windows filesystems use the Unicode character set in theUTF-16 encoding to store filename information. Not all charactersfrom the Unicode character set are available in a singlebyte or doublebytecharset. While Cygwin has a workaround to access files with unusualcharacters (see the section called “Filenames with unusual (foreign) characters”), a betterworkaround is to use always the UTF-8 character set.
UTF-8 is the only multibyte character set which can representevery Unicode character.
```
  set LANG=es_MX.UTF-8
```
For a description of the Unicode standard, see the homepage of theUnicode Consortium.

The Windows Console character set

Sometimes the Windows console is used to run Cygwin applications.While terminal emulations like the Cygwin Terminal minttyor xterm have a distinct way to set the character setused for in- and output, the Windows console hasn't such a way, since it'snot an application in its own right.

This problem is solved in Cygwin as follows. When a Cygwinprocess is started in a Windows console (either explicitly from cmd.exe,or implicitly by, for instance, running theC:\cygwin\Cygwin.bat batch file), the Console characterset is determined by the setting of the aforementionedinternationalization environment variables, the same way as described inthe section called “How to set the locale”.

What is that good for? Why not switch the console character set withthe applications requirements? After all, the application knows if it useslocalization or not. However, what if a non-localized application callsa remote application which itself is localized? This can happen withssh or rlogin. Both commands don'thave and don't need localization and they never callsetlocale. Setting one of the internationalizationenvironment variable to the same charset as the remote machine beforestarting ssh or rlogin fixes thatproblem.

Potential Problems when using Locales

You can set the above internationalization variables not only whenstarting the first Cygwin process, but also in your Cygwin shell on thefly, even switch to yet another character set, and yet another. In bashfor instance:

  bash$ export LC_CTYPE="nl_BE.UTF-8"

However, here's a problem. At the start of the first Cygwin processin a session, the Windows environment is converted from UTF-16 to UTF-8.The environment is another of the system objects stored in UTF-16 inWindows.

As long as the environment only contains ASCII characters, this isno problem at all. But if it contains native characters, and you're planningto use, say, GBK, the environment will result in invalid characters inthe GBK charset. This would be especially a problem in variables likePATH. To circumvent the worst problems, Cygwin convertsthe PATH environment variable to the charset set in theenvironment, if it's different from the UTF-8 charset.

Note

Per POSIX, the name of an environment variable should onlyconsist of valid ASCII characters, and only of uppercase letters, digits, andthe underscore for maximum portability.

Symbolic links, too, may pose a problem when switching charsets onthe fly. A symbolic link contains the filename of the target file thesymlink points to. When a symlink had been created with older versionsof Cygwin, the current ANSI or OEM character set had been used to storethe target filename, dependent on the old CYGWINenvironment variable setting codepage (see the section called “Obsolete options”. If the target filenamecontains non-ASCII characters and you use another character set thanyour default ANSI/OEM charset, the target filename of the symlink is nowpotentially an invalid character sequence in the new character set.This behaviour is not different from the behaviour in other OperatingSystems. So, if you suddenly can't access a symlink anymore whichworked all these years before, maybe it's because you switched toanother character set. This doesn't occur with symlinks created withCygwin 1.7 or later.

Another problem you might encounter is that older versions ofWindows did not install all charsets by default. If you are runningWindows XP or older, you can open the "Regional and Language Options"portion of the Control Panel, select the "Advanced" tab, and selectentries from the "Code page conversion tables" list. The followingentries are useful to cygwin: 932/SJIS, 936/GBK, 949/EUC-KR, 950/Big5,20932/EUC-JP.

List of supported character sets

Last but not least, here's the list of currently supported charactersets. The left-hand expression is the name of the charset, as you would useit in the internationalization environment variables as outlined above.Note that charset specifiers are case-insensitive. EUCJPis equivalent to eucJP or eUcJp.Writing the charset in the exact case as given in the list below is agood convention, though.

The right-hand side is the number of the equivalent Windowscodepage as well as the Windows name of the codepage. They are onlynoted here for reference. Don't try to use the bare codepage number orthe Windows name of the codepage as charset in locale specifiers, unlessthey happen to be identical with the left-hand side. Especially in caseof the "CPxxx" style charsets, always use them with the trailing "CP".

This works:

  set LC_ALL=en_US.CP437

This does not work:

  set LC_ALL=en_US.437

You can find a full list of Windows codepages on the Microsoft MSDN pageCode Page Identifiers.

    Charset               Codepage    -------------------   -------------------------------------------    ASCII                 20127 (US_ASCII)    CP437                   437 (OEM United States)    CP720                   720 (DOS Arabic)    CP737                   737 (OEM Greek)    CP775                   775 (OEM Baltic)    CP850                   850 (OEM Latin 1, Western European)    CP852                   852 (OEM Latin 2, Central European)    CP855                   855 (OEM Cyrillic)    CP857                   857 (OEM Turkish)    CP858                   858 (OEM Latin 1 + Euro Symbol)    CP862                   862 (OEM Hebrew)    CP866                   866 (OEM Russian)    CP874                   874 (ANSI/OEM Thai)    CP932		    932 (Shift_JIS, not exactly identical to SJIS)    CP1125                 1125 (OEM Ukraine)    CP1250                 1250 (ANSI Central European)    CP1251                 1251 (ANSI Cyrillic)    CP1252                 1252 (ANSI Latin 1, Western European)    CP1253                 1253 (ANSI Greek)    CP1254                 1254 (ANSI Turkish)    CP1255                 1255 (ANSI Hebrew)    CP1256                 1256 (ANSI Arabic)    CP1257                 1257 (ANSI Baltic)    CP1258                 1258 (ANSI/OEM Vietnamese)    ISO-8859-1            28591 (ISO-8859-1)    ISO-8859-2            28592 (ISO-8859-2)    ISO-8859-3            28593 (ISO-8859-3)    ISO-8859-4            28594 (ISO-8859-4)    ISO-8859-5            28595 (ISO-8859-5)    ISO-8859-6            28596 (ISO-8859-6)    ISO-8859-7            28597 (ISO-8859-7)    ISO-8859-8            28598 (ISO-8859-8)    ISO-8859-9            28599 (ISO-8859-9)    ISO-8859-10             -   (not available)    ISO-8859-11             -   (not available)    ISO-8859-13           28603 (ISO-8859-13)    ISO-8859-14             -   (not available)    ISO-8859-15           28605 (ISO-8859-15)    ISO-8859-16             -   (not available)    Big5                    950 (ANSI/OEM Traditional Chinese)    EUCCN or euc-CN         936 (ANSI/OEM Simplified Chinese)    EUCJP or euc-JP       20932 (EUC Japanese)    EUCKR or euc-KR         949 (EUC Korean)    GB2312                  936 (ANSI/OEM Simplified Chinese)    GBK                     936 (ANSI/OEM Simplified Chinese)    GEORGIAN-PS             -   (not available)    KOI8-R                20866 (KOI8-R Russian Cyrillic)    KOI8-U                21866 (KOI8-U Ukrainian Cyrillic)    PT154                   -   (not available)    SJIS                    -   (not available, almost, but not exactly CP932)    TIS620 or TIS-620       874 (ANSI/OEM Thai)    UTF-8 or utf8         65001 (UTF-8)

Prev	Up	Next
Changing Cygwin's Maximum Memory	Home

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。