lua

s Lua support unicode?

In short, yes and no. Lua gives you the bare bones support and enoughrope and not much else. Unicode is a large and complex standard andquestions like "does lua support unicode" are extremely vague.

Some of the issues are:

Can I store and retrieve Unicode strings?
Can my Lua programs be written in Unicode?
Can I compare Unicode strings for equality?
Sorting strings.
Pattern matching.
Can I determine the length of a Unicode string?
Support for bracket matching, bidirectional printing, arbitrary composition of characters, and other issues that arise in high quality typesetting.

Lua strings are fully 8-bit clean, so simple uses are supported (like storing and retrieving), but there's no built in support for more sophisticated uses. For a fuller story, see below.

Unicode strings and Lua strings

A Lua string is an arbitrary sequence of values which have at least 8 bits (octets);they map directly into the char type of the C compiler. (This may be wider than eight bits, but eight bits are guaranteed.) Lua does not reserve any value, including NUL.That means that you can store a UTF-8 string in Lua without problems.

Note that UTF-8 is just one option for storing Unicode strings. There are many other encoding schemes,including UTF-16 and UTF-32 and their various big-endian/little-endian variants. However, all of these are simply sequences of octets and can be stored in a Lua string without problems.

Input and output of strings in Lua (using the io library) uses C'sstdio library. ANSI C does not require the stdio library tohandle arbitrary octet sequences unless the file is opened in binary mode;furthermore, in non-binary mode, some octet sequences are converted into other ones (in orderto deal with varying end-of-line markers on different platforms).

This may affect your ability to do non-binary file input and output of Unicodestrings in formats other than UTF-8. UTF-8 strings will probably be safe becauseUTF-8 does not use control characters such as \n and \r as part of multi-octetencodings. However, there are no guarantees; if you need to be certain, you mustuse binary mode input and output. (If you do so, line-endings will not be converted.)

Unix file IO has been 8-bit clean for a long while. If you are not concerned withportability and are only using Unix and Unix-like operating systems, you can almostcertainly not worry about the above.

If your use of Unicode is restricted to passing the strings to external librarieswhich support Unicode, you should be OK. For example, you should be able to extracta Unicode string from a database and pass it to a Unicode-aware graphics library.But see the sections below on pattern matching and string equality.

Unicode Lua programs

Literal Unicode strings can appear in your lua programs. Either a UTF-8encoded string can appear directly with 8-bit characters or you can usethe \ddd syntax (note that ddd is a decimalnumber, unlike some other languages). However, there is no facility forencoding multi-octet sequences (such as \U+20B4); you would need toeither manually encode them to UTF-8, or insert individual octets in thecorrect big-endian/little-endian order (for UTF-16 or UTF-32).

Unless you are using an operating system in which a char is morethan eight bits wide, you will not be able to use arbitrary Unicodecharacters in Lua identifers (for the names of variables and so on).You may be able to use eight-bit characters outside of the ANSI range.Lua uses the C functions isalpha and isalnum to identify validcharacters in identifiers, so it will depend on the current locale.To be honest, using characters outside of the ANSI rangein Lua identifiers is not a good idea, since your programs will not compilein the standard C locale.

Comparison and Sorting

Lua string comparison (using the == operator) is done byte-by-byte.That means that == can only be used to compare Unicode strings forequality if the strings have been normalized in one of the four Unicodenormalizations. (See the [Unicode FAQ on normalization] for details.) The standard Lua library does not provide any facility for normalizing Unicode strings.Consequently, non-normalized Unicode strings cannot be reliably used astable keys.

If you want to use the Unicode notion of string equality,or use Unicode strings as table keys, and you cannotguarantee that your strings are normalized, then you'll have to write or find anormalization function and use that; this is non-trivial exercise!

The Lua comparison operators on strings (< and <=) use the C functionstrcoll which is locale dependent. This means that two stringscan compare in different ways according to what the current locale is.For example, strings will compare differently when using SpanishTraditional sorting to that when using Welsh sorting.

It may be that your operating system has a locale that implements thesorting algorithm that you want, in which case you can just use that,otherwise you will have to write a function to sort Unicode strings.This is an even more non-trivial exercise.

UTF-8 was designed so that a naive octet-by-octet string comparisonof an octet sequence would produce the same result if a naive octet-by-octetstring comparison were done on the UTF-8 encoding of the octet sequence.This is also true of UTF-32BE but I do not know of any system which usesthat encoding. Unfortunately, naive octet-by-octet comparison isnot the collation order used by any language.

(Note: sometimes people use the terms UCS-2 and UCS-4 for "two-byte"and four-byte encodings. These are not Unicode standards; they come from theclosely corresponding ISO standard ISO/IEC 10646-1:2000 and currentlydiffer in that they allow codes outside of the Unicode range, which runs from0x0 to 0x10FFFF.)

Pattern Matching

Lua's pattern matching facilities work character by character.In general, this will not work for Unicode pattern matching, althoughsome things will work as you want. For example, "%u"will not match all Unicode upper case letters. You can matchindividual Unicode characters in a normalized Unicode string, butyou might want to worry about combining character sequences.If there are no following combining characters, "a" willmatch only the letter a in a UTF-8 string. In UTF-16LE you couldmatch "a%z". (Remember that you cannot use \0 in a Lua pattern.)

Length and string indexing

If you want to know the length of a Unicode string there are differentanswers you might want according to the circumstances.

If you just want to know how many bytes the string occupies, sothat you can make space for copying it into a buffer for example,then the existing Lua function string.len will work.

You might want to know how many Unicode characters are in a string.Depending on the encoding used, a single Unicode character mayoccupy up to four bytes. Only UTF-32LE and UTF-32BE are constantlength encodings (four bytes per character); UTF-32 is mostly aconstant length encoding but the first element in a UTF-32 sequenceshould be a "Byte Order Mark", which does not count as a character.(UTF-32 and variants are part of Unicode with the latest version,Unicode 4.0.)

Some implementations of UTF-16 assume that all characters are twobytes long, but this has not been true since Unicode version 3.0.

Happily UTF-8 is designedso that it is relatively easy to count the number of unicode symbols ina string: simply count the number of octets that are in the ranges 0x00to 0x7f (inclusive) or 0xC2 to 0xF4 (inclusive). (In decimal,0-127 and 194-244.) These are the codes whichcan start a UTF-8 character code. Octets 0xC0, 0xC1 and 0xF5 to 0xFF(192, 193 and 245-255) cannotappear in a conforming UTF-8 sequence; octets in the range 0x80 to 0xBF(128-191) can only appear in the second and subsequent octets of a multi-octetencoding. Remember that you cannot use \0 in a Lua pattern.

For example, you could use the following code snippet to count UTF-8 charactersin a string you knew to be conforming (it will incorrectly count some invalidcharacters):

        local _, count = string.gsub(unicode_string, "[^\128-\193]", "")

If you want to know how many printing columns a Unicode string willoccupy when you print it out using a fixed-width font (imagine you arewriting something like the Unix ls program that formats itsoutput into several columns), then that is a different answer again.That's because some Unicode characters do not have a printing width,while others are double-width characters. Combining characters areused to add accents to other letters, and generally they do nottake up any extra space when printed.

So that's at least 3 different notions of length that you might want atdifferent times. Lua provides one of them (string.len) theothers you'll need to write functions for.

There's a similar issue with indexing the characters of a string byposition. string.sub(s, -3) will return the last 3 bytes ofthe string which is not necessarily the same as the last threecharacters of the string, and may or may not be a completecode.

You could use the following code snippet to iterate over UTF-8 sequences(this will simply skip over most invalid codes):

        for uchar in string.gfind(ustring, "([%z\1-\127\194-\244][\128-\191]*)") do          -- something        end

UTF8 decoding function

--[[| bits | U+first   | U+last     | bytes | Byte_1   | Byte_2   | Byte_3   | Byte_4   | Byte_5   | Byte_6   |+------+-----------+------------+-------+----------+----------+----------+----------+----------+----------+|   7  | U+0000    | U+007F     |   1   | 0xxxxxxx |          |          |          |          |          ||  11  | U+0080    | U+07FF     |   2   | 110xxxxx | 10xxxxxx |          |          |          |          ||  16  | U+0800    | U+FFFF     |   3   | 1110xxxx | 10xxxxxx | 10xxxxxx |          |          |          ||  21  | U+10000   | U+1FFFFF   |   4   | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |          |          ||  26  | U+200000  | U+3FFFFFF  |   5   | 111110xx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |          ||  31  | U+4000000 | U+7FFFFFFF |   6   | 1111110x | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |--]]

This function converts a lua string which contains UTF-8 encoded characters into a lua table with its corresponding unicode codepoints (UTF-32)

function Utf8to32(utf8str)	assert(type(utf8str) == "string")	local res, seq, val = {}, 0, nil	for i = 1, #utf8str do		local c = string.byte(utf8str, i)		if seq == 0 then			table.insert(res, val)			seq = c < 0x80 and 1 or c < 0xE0 and 2 or c < 0xF0 and 3 or			      c < 0xF8 and 4 or c < 0xFC and 5 or c < 0xFE and 6 or				  error("invalid UTF-8 character sequence")			val = bit.band(c, 2^(8-seq) - 1)		else			val = bit.bor(bit.lshift(val, 6), bit.band(c, 0x3F))		end		seq = seq - 1	end	table.insert(res, val)	table.insert(res, 0)	return resend

More sophisticated issues

As you might have guessed by now, Lua provides no support for things likebidirectional printing or the proper formatting of Thai accents. Normallysuch things will be taken care of by a graphics or typography library. Itwould of course be possible to interface to such a library that did thesethings if you had access to one.

There is a little string-like package [slnunicode] with upper/lower, len/sub and pattern matching for UTF-8.
See ValidateUnicodeString for a smaller library.
[utf-8.lua] provides functions string.utf8len, string.utf8sub, string.utf8reverse, string.utf8upper, and string.utf8lower. Has upper and lowercase mappings as a separate file. Tested on Lua 5.1 and 5.2.
Another [utf-8.lua] with functions utf8charbytes, utf8len, utf8sub), and utf8replace. The latter replaces characters based on a mapping table (not provided).
[ICU4Lua] is a Lua binding to ICU (International Components for Unicode [1]), an open-source library originally developed by IBM.

See UnicodeIdentifers for platform independent Unicode Lua programs.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。