Converting buffers between UTF32, UTF16 and UTF8 | |
void | sUTF32ToUTF8 (const UTF32 *iSource, size_t iSourceCount, size_t *oSourceCount, UTF8 *iDest, size_t iDestCU, size_t *oDestCU, size_t *oCountCP) |
bool | sUTF8ToUTF32 (const UTF8 *iSource, size_t iSourceCU, size_t *oSourceCU, UTF32 *iDest, size_t iDestCount, size_t *oDestCount) |
void | sUTF32ToUTF16 (const UTF32 *iSource, size_t iSourceCount, size_t *oSourceCount, UTF16 *iDest, size_t iDestCU, size_t *oDestCU, size_t *oCountCP) |
bool | sUTF16ToUTF32 (const UTF16 *iSource, size_t iSourceCU, size_t *oSourceCU, UTF32 *iDest, size_t iDestCount, size_t *oDestCount) |
bool | sUTF16ToUTF8 (const UTF16 *iSource, size_t iSourceCU, size_t *oSourceCU, UTF8 *iDest, size_t iDestCU, size_t *oDestCU, size_t iMaxCP, size_t *oCountCP) |
bool | sUTF8ToUTF16 (const UTF8 *iSource, size_t iSourceCU, size_t *oSourceCU, UTF16 *iDest, size_t iDestCU, size_t *oDestCU, size_t iMaxCP, size_t *oCountCP) |
Counting code units or code points, looking for zero terminator. | |
template<class I> | |
size_t | sCountCU (I iSource) |
template<class I> | |
size_t | sCountCP (I iSource) |
template<class I> | |
void | sCount (I iSource, size_t *oCountCU, size_t *oCountCP) |
Mapping offsets between code points and code units. | |
template<class I> | |
size_t | sCPToCU (I iSource, size_t iCountCP) |
template<class I> | |
size_t | sCPToCU (I iSource, size_t iCountCU, size_t iCountCP, size_t *oCountCP) |
template<class I> | |
size_t | sCPToCU (I iSource, I iEnd, size_t iCountCP, size_t *oCountCP) |
template<class I> | |
size_t | sCUToCP (I iSource, size_t iCountCU) |
template<class I> | |
size_t | sCUToCP (I iSource, I iEnd) |
Ensure a pointer is aligned with the first code unit of a valid code point. | |
template<class I> | |
void | sAlign (I &ioCurrent) |
template<class I> | |
void | sAlign (I &ioCurrent, I iEnd) |
Iterating, reading and writing indvidual code points. | |
template<class I> | |
void | sInc (I &ioCurrent) |
template<class I> | |
bool | sInc (I &ioCurrent, I iEnd) |
template<class I> | |
void | sDec (I &ioCurrent) |
template<class I> | |
bool | sDec (I iStart, I &ioCurrent, I iEnd) |
template<class I> | |
UTF32 | sRead (I iCurrent) |
template<class I> | |
bool | sRead (I iCurrent, I iEnd, UTF32 &oCP) |
template<class I> | |
UTF32 | sReadInc (I &ioCurrent) |
template<class I> | |
bool | sReadInc (I &ioCurrent, I iEnd, UTF32 &oCP) |
template<class I> | |
bool | sReadInc (I &ioCurrent, I iEnd, UTF32 &oCP, size_t &ioCountSkipped) |
template<class I> | |
UTF32 | sDecRead (I &ioCurrent) |
template<class I> | |
bool | sDecRead (I iStart, I &ioCurrent, I iEnd, UTF32 &oCP) |
template<class I> | |
bool | sWrite (I iDest, I iEnd, UTF32 iCP) |
template<class I> | |
bool | sWriteInc (I &ioDest, I iEnd, UTF32 iCP) |
ZUnicode defines three integer types used to hold code units:
UTF32
. A 32 bit integer. On platforms where wchar_t
is 32 bits in size, UTF32
is a wchar_t
, otherwise it is a uint32
.UTF16
. A 16 bit integer. On platforms where wchar_t
is 16 bits in size, UTF16
is a wchar_t
, otherwise it is a uint16
.UTF8
. An 8 bit integer. It is a typedef of char
.and three corresponding string types:
string32
.string16
.string8
. This will be equivalent to std::string
.
If you need to work with individual code points use UTF32
. In general UTF16
and UTF8
can hold only individual code units of their type. It may be that the UTF-16 or UTF-8 representation of a particular code point can be represented in a single code unit, but should never be relied upon in your code.
ISO 10646 formally defines a 31 bit character set, the range of legal code points is thus from 0 to 0x7FFFFFFF. This range is considered to be divided into 32768 planes of 65536 code points. As of this writing characters have only been assigned to plane 0, the so-called Basic Multilingual Plane (BMP). There is a committment to never assign characters beyond plane 16. So the range of legal code points is restricted to 0 to 0x10FFFF (1,114,112 distinct code points). UCS-4 and UTF-8 can represent the entire 31 bit range, but UTF-16 is unable to represent code points beyond plane 16.
There are two illegal code units in UTF-8, 0xFE and 0xFF. It's also possible to have sequences of UTF-8 code units which are illegal, that is which do not map to a code point. For example if a continuation byte is not preceded by a start byte, or if a start byte is not immediately followed by enough continuation bytes.
UTF-16 also has illegal code unit sequences, in particular a high surrogate that is not followed by a low surrogate, or a low surrogate without a preceding high surrogate.
UTF32
is a 32 bit integer, and thus can represent values outside the 31 bit ISO 10646 range. So it's clearly possible to have UTF32
code units that do not map to a code point. Given that Unicode and ISO 10646 will never assign characters beyond plane 16, we further restrict valid UTF-32 code units to be in the range 0 to 0x10FFFF. Finally, code units from the high and low surrgate blocks (0xDC00 to 0xDFFF) are also illegal. Strictly speaking U+FFFF and all code points of the form U+xxFFFF are illegal, but we don't filter them out.
The Unicode standard recommends that illegal code units are each decoded as U+FFFD, the replacement character. That's a good strategy when decoding a body of material in one hit, but it's ambiguous when randomly accessing data in memory. ZUnicode's convention is that illegal code units and code unit sequences are skipped. They contribute to any count of code units, but do not generate code points and do not contribute to any count of code points. So a pointer into a sequence of code units is considered to reference the first valid code point starting at or subsequent to the pointer. Conversely, illegal code points, those outside the ranges 0-0xD7FF and 0xE000-0x10FFFF, are treated as being of zero length. They will not cause generation of code units nor contribute to counts of code points.
An example of mapping between offsets and code points is perhaps in order. The following represents ten UTF-8 code units, with offsets of contained bytes from zero to nine and offset ten being the end of the buffer:
Offset Description ------ ----------- 0 Start of 3 byte sequence, with two continuation bytes following 3 Single byte character 4 Single byte character 5 An out of order continuation byte (illegal) 6 Start of 2 byte sequence, with single continuation byte following 8 Single byte character 9 Start byte of two byte sequence without continuation byte (illegal) 10 End of the buffer Offset 01234567890 Value 3CCNNC2CN2 Illegal -----X---X The table below shows for each offset which offset will be returned/used when decremented, accessed or incremented. You should note that illegal byte sequences are effectively transparent to those operations. Offset Dec Acc Inc 0 - 0 3 1 0 3 4 2 0 3 4 3 0 3 4 4 3 4 6 5 4 6 8 6 4 6 8 7 6 8 - 8 6 8 - 9 8 - - 10 8 - -
pragma ushort_wchar_t on
void ZUnicode::sUTF32ToUTF8 | ( | const UTF32 * | iSource, | |
size_t | iSourceCount, | |||
size_t * | oSourceCount, | |||
UTF8 * | iDest, | |||
size_t | iDestCU, | |||
size_t * | oDestCU, | |||
size_t * | oCountCP | |||
) |
Read UTF32 code units from iSource, convert them into valid code points and store them as UTF8 code units starting at iDest. Do not read more than iSourceCount UTF32 code units, and do not store more than iDestCU UTF8 code units. Report the counts read and written in oSourceCount and oDestCU.
bool ZUnicode::sUTF8ToUTF32 | ( | const UTF8 * | iSource, | |
size_t | iSourceCU, | |||
size_t * | oSourceCU, | |||
UTF32 * | iDest, | |||
size_t | iDestCount, | |||
size_t * | oDestCount | |||
) |
Read UTF8 code units from iSource, convert them into valid code points and store them as UTF32 code units starting at iDest. Do not read more than iSourceCU UTF8 code units, and do not store more than iDestCount UTF32 code units. Report the counts read and written in oSourceCU and oDestCount. Return false if fewer than iSourceCU UTF8 code units were read because there was a valid prefix at the end of the buffer that could represent a valid code point if more data were provided.
void ZUnicode::sUTF32ToUTF16 | ( | const UTF32 * | iSource, | |
size_t | iSourceCount, | |||
size_t * | oSourceCount, | |||
UTF16 * | iDest, | |||
size_t | iDestCU, | |||
size_t * | oDestCU, | |||
size_t * | oCountCP | |||
) |
Read UTF32 code units from iSource, convert them into valid code points and store them as UTF16 code units starting at iDest. Do not read more than iSourceCount UTF32 code units, and do not store more than iDestCU UTF16 code units. Report the code units read and written in oSourceCount and oDestCU, and the code points written in oCountCP.
bool ZUnicode::sUTF16ToUTF32 | ( | const UTF16 * | iSource, | |
size_t | iSourceCU, | |||
size_t * | oSourceCU, | |||
UTF32 * | iDest, | |||
size_t | iDestCount, | |||
size_t * | oDestCount | |||
) |
Read UTF16 code units from iSource, convert them into valid code points and store them as UTF32 code units starting at iDest. Do not read more than iSourceCU UTF16 code units, and do not store more than iDestCount UTF32 code units. Report the counts read and written in oSourceCU and oDestCount. Return false if fewer than iSourceCU UTF8 code units were read because there was a valid prefix at the end of the buffer that could represent a valid code point if more data were provided.
bool ZUnicode::sUTF16ToUTF8 | ( | const UTF16 * | iSource, | |
size_t | iSourceCU, | |||
size_t * | oSourceCU, | |||
UTF8 * | iDest, | |||
size_t | iDestCU, | |||
size_t * | oDestCU, | |||
size_t | iMaxCP, | |||
size_t * | oCountCP | |||
) |
Read UTF16 code units from iSource, convert them into valid code points and store them as UTF8 code units starting at iDest. Do not read more than iSourceCU UTF16 code units, and do not store more than iDestCU UTF8 code units. Do not consume/generate more than iMaxCP code points. Report the counts read and written in oSourceCU and oDestCU, and the number of code points in oCountCP. Return false if fewer than iSourceCU UTF16 code units were read because there was a valid prefix at the end of the buffer that could represent a valid code point if more data were provided.
bool ZUnicode::sUTF8ToUTF16 | ( | const UTF8 * | iSource, | |
size_t | iSourceCU, | |||
size_t * | oSourceCU, | |||
UTF16 * | iDest, | |||
size_t | iDestCU, | |||
size_t * | oDestCU, | |||
size_t | iMaxCP, | |||
size_t * | oCountCP | |||
) |
Read UTF8 code units from iSource, convert them into valid code points and store them as UTF16 code units starting at iDest. Do not read more than iSourceCU UTF8 code units, and do not store more than iDestCU UTF16 code units. Do not consume/generate more than iMaxCP code points. Report the counts read and written in oSourceCU and oDestCU, and the number of code points in oCountCP. Return false if fewer than iSourceCU UTF8 code units were read because there was a valid prefix at the end of the buffer that could represent a valid code point if more data were provided.
size_t ZUnicode::sCountCU | ( | I | iSource | ) | [inline] |
Return the number of code units between iSource and the first ocurrence of a zero code unit.
size_t ZUnicode::sCountCP | ( | I | iSource | ) | [inline] |
Return the number of correctly encoded code points between iSource and the first ocurrence of a zero code unit.
void ZUnicode::sCount | ( | I | iSource, | |
size_t * | oCountCU, | |||
size_t * | oCountCP | |||
) | [inline] |
Return both the number of code units and the number of correctly encoded code points between iSource and the first occurrence of a zero code unit.
size_t ZUnicode::sCPToCU | ( | I | iSource, | |
size_t | iCountCP | |||
) | [inline] |
Return the number of code units that must be traversed to generate iCountCP valid code points in the string buffer starting at iSource.
size_t ZUnicode::sCPToCU | ( | I | iSource, | |
size_t | iCountCU, | |||
size_t | iCountCP, | |||
size_t * | oCountCP | |||
) | [inline] |
Return the number of code units that must be traversed to generate iCountCP valid code points in the string buffer starting at iSource and extending to iSource + iCountCU. Return the number of code points actually traversed in oCountCP.
size_t ZUnicode::sCPToCU | ( | I | iSource, | |
I | iEnd, | |||
size_t | iCountCP, | |||
size_t * | oCountCP | |||
) | [inline] |
Return the number of code units that must be traversed to generate iCountCP valid code points in the string buffer starting at iSource and extending to iEnd. Return the number of code points actually traversed in oCountCP.
size_t ZUnicode::sCUToCP | ( | I | iSource, | |
size_t | iCountCU | |||
) | [inline] |
Return the number of valid code points represented by the code units between iSource and iSource + iCountCU.
size_t ZUnicode::sCUToCP | ( | I | iSource, | |
I | iEnd | |||
) | [inline] |
Return the number of valid code points represented by the code units between iSource and iEnd.
void ZUnicode::sAlign | ( | I & | ioCurrent | ) | [inline] |
If ioCurrent references the first code unit of a valid code point then leave it unchanged. Otherwise advance it until it does.
void ZUnicode::sAlign | ( | I & | ioCurrent, | |
I | iEnd | |||
) | [inline] |
If ioCurrent references the first code unit of a valid code point then leave it unchanged. Otherwise advance it until it does or until it equals iEnd.
void ZUnicode::sInc | ( | I & | ioCurrent | ) | [inline] |
Update ioCurrent to take it past the current valid code point.
bool ZUnicode::sInc | ( | I & | ioCurrent, | |
I | iEnd | |||
) | [inline] |
Update ioCurrent to take it past the current valid code point. If that would move ioCurrent past iEnd then return false, otherwise return true.
void ZUnicode::sDec | ( | I & | ioCurrent | ) | [inline] |
Decrement ioCurrent until it references a valid code point.
bool ZUnicode::sDec | ( | I | iStart, | |
I & | ioCurrent, | |||
I | iEnd | |||
) | [inline] |
Decrement ioCurrent until it references a valid code point. If that would move ioCurrent past iStart then return false, otherwise return true. iEnd is passed to ensure that the function does not attempt to read beyond the end of the buffer (only actually an issue for UTF-8).
UTF32 ZUnicode::sRead | ( | I | iCurrent | ) | [inline] |
Return the first valid code point at or after iCurrent.
bool ZUnicode::sRead | ( | I | iCurrent, | |
I | iEnd, | |||
UTF32 & | oCP | |||
) | [inline] |
Return in oCP the first valid code point at or after iCurrent. If there is no valid code point between iCurrent and iEnd then return false.
UTF32 ZUnicode::sReadInc | ( | I & | ioCurrent | ) | [inline] |
Return the first valid code point at or after ioCurrent, and update ioCurrent to point just past its final code unit (not necessarily at the first code unit of the next valid code point).
bool ZUnicode::sReadInc | ( | I & | ioCurrent, | |
I | iEnd, | |||
UTF32 & | oCP | |||
) | [inline] |
Put in oCP the first valid code point at or after ioCurrent, and update ioCurrent to point just past its final code unit (not necessarily at the first code unit of the next valid code point). If there is no valid code point between ioCurrent and iEnd then return false.
bool ZUnicode::sReadInc | ( | I & | ioCurrent, | |
I | iEnd, | |||
UTF32 & | oCP, | |||
size_t & | ioCountSkipped | |||
) | [inline] |
Put in oCP the first valid code point at or after ioCurrent, and update ioCurrent to point just past its final code unit (not necessarily at the first code unit of the next valid code point). If there is no valid code point between ioCurrent and iEnd then return false. Additionally, add to ioCountSkipped the number of code units that were skipped.
UTF32 ZUnicode::sDecRead | ( | I & | ioCurrent | ) | [inline] |
Return the first valid code point starting prior to ioCurrent.
bool ZUnicode::sDecRead | ( | I | iStart, | |
I & | ioCurrent, | |||
I | iEnd, | |||
UTF32 & | oCP | |||
) | [inline] |
Put in oCP the first valid code point starting prior to ioCurrent. If there is no valid code point between iState and ioCurrent then return false.
bool ZUnicode::sWrite | ( | I | iDest, | |
I | iEnd, | |||
UTF32 | iCP | |||
) | [inline] |
If ICP is a valid code point then write it to iDest. If there is insufficient space to hold the code units then return false. Writing an invalid code point will return true, as invalid code points require zero code units to represent them.
bool ZUnicode::sWriteInc | ( | I & | ioDest, | |
I | iEnd, | |||
UTF32 | iCP | |||
) | [inline] |
If ICP is a valid code point then write it to ioDest and advance ioDest appropriately. If there is insufficient space to hold the code units then return false. Writing an invalid code point will return true, as invalid code points require zero code units to represent them.