ZUnicode Namespace Reference
[Unicode]

The ZUnicode namespace defines a collection of data types and functions for working with Unicode data. More...


Converting buffers between UTF32, UTF16 and UTF8

void sUTF32ToUTF8 (const UTF32 *iSource, size_t iSourceCount, size_t *oSourceCount, UTF8 *iDest, size_t iDestCU, size_t *oDestCU, size_t *oCountCP)
bool sUTF8ToUTF32 (const UTF8 *iSource, size_t iSourceCU, size_t *oSourceCU, UTF32 *iDest, size_t iDestCount, size_t *oDestCount)
void sUTF32ToUTF16 (const UTF32 *iSource, size_t iSourceCount, size_t *oSourceCount, UTF16 *iDest, size_t iDestCU, size_t *oDestCU, size_t *oCountCP)
bool sUTF16ToUTF32 (const UTF16 *iSource, size_t iSourceCU, size_t *oSourceCU, UTF32 *iDest, size_t iDestCount, size_t *oDestCount)
bool sUTF16ToUTF8 (const UTF16 *iSource, size_t iSourceCU, size_t *oSourceCU, UTF8 *iDest, size_t iDestCU, size_t *oDestCU, size_t iMaxCP, size_t *oCountCP)
bool sUTF8ToUTF16 (const UTF8 *iSource, size_t iSourceCU, size_t *oSourceCU, UTF16 *iDest, size_t iDestCU, size_t *oDestCU, size_t iMaxCP, size_t *oCountCP)

Counting code units or code points, looking for zero terminator.

template<class I>
size_t sCountCU (I iSource)
template<class I>
size_t sCountCP (I iSource)
template<class I>
void sCount (I iSource, size_t *oCountCU, size_t *oCountCP)

Mapping offsets between code points and code units.

template<class I>
size_t sCPToCU (I iSource, size_t iCountCP)
template<class I>
size_t sCPToCU (I iSource, size_t iCountCU, size_t iCountCP, size_t *oCountCP)
template<class I>
size_t sCPToCU (I iSource, I iEnd, size_t iCountCP, size_t *oCountCP)
template<class I>
size_t sCUToCP (I iSource, size_t iCountCU)
template<class I>
size_t sCUToCP (I iSource, I iEnd)

Ensure a pointer is aligned with the first code unit of a valid code point.

template<class I>
void sAlign (I &ioCurrent)
template<class I>
void sAlign (I &ioCurrent, I iEnd)

Iterating, reading and writing indvidual code points.

template<class I>
void sInc (I &ioCurrent)
template<class I>
bool sInc (I &ioCurrent, I iEnd)
template<class I>
void sDec (I &ioCurrent)
template<class I>
bool sDec (I iStart, I &ioCurrent, I iEnd)
template<class I>
UTF32 sRead (I iCurrent)
template<class I>
bool sRead (I iCurrent, I iEnd, UTF32 &oCP)
template<class I>
UTF32 sReadInc (I &ioCurrent)
template<class I>
bool sReadInc (I &ioCurrent, I iEnd, UTF32 &oCP)
template<class I>
bool sReadInc (I &ioCurrent, I iEnd, UTF32 &oCP, size_t &ioCountSkipped)
template<class I>
UTF32 sDecRead (I &ioCurrent)
template<class I>
bool sDecRead (I iStart, I &ioCurrent, I iEnd, UTF32 &oCP)
template<class I>
bool sWrite (I iDest, I iEnd, UTF32 iCP)
template<class I>
bool sWriteInc (I &ioDest, I iEnd, UTF32 iCP)


Detailed Description

The ZUnicode namespace defines a collection of data types and functions for working with Unicode data.

See also:
Unicode
ZUnicode does not directly address internationalization and localization issues, but it is an important building block in support of such work.

ZUnicode defines three integer types used to hold code units:

and three corresponding string types:

If you need to work with individual code points use UTF32. In general UTF16 and UTF8 can hold only individual code units of their type. It may be that the UTF-16 or UTF-8 representation of a particular code point can be represented in a single code unit, but should never be relied upon in your code.

Legal code points

ISO 10646 formally defines a 31 bit character set, the range of legal code points is thus from 0 to 0x7FFFFFFF. This range is considered to be divided into 32768 planes of 65536 code points. As of this writing characters have only been assigned to plane 0, the so-called Basic Multilingual Plane (BMP). There is a committment to never assign characters beyond plane 16. So the range of legal code points is restricted to 0 to 0x10FFFF (1,114,112 distinct code points). UCS-4 and UTF-8 can represent the entire 31 bit range, but UTF-16 is unable to represent code points beyond plane 16.

Illegal code units and code unit sequences

There are two illegal code units in UTF-8, 0xFE and 0xFF. It's also possible to have sequences of UTF-8 code units which are illegal, that is which do not map to a code point. For example if a continuation byte is not preceded by a start byte, or if a start byte is not immediately followed by enough continuation bytes.

UTF-16 also has illegal code unit sequences, in particular a high surrogate that is not followed by a low surrogate, or a low surrogate without a preceding high surrogate.

UTF32 is a 32 bit integer, and thus can represent values outside the 31 bit ISO 10646 range. So it's clearly possible to have UTF32 code units that do not map to a code point. Given that Unicode and ISO 10646 will never assign characters beyond plane 16, we further restrict valid UTF-32 code units to be in the range 0 to 0x10FFFF. Finally, code units from the high and low surrgate blocks (0xDC00 to 0xDFFF) are also illegal. Strictly speaking U+FFFF and all code points of the form U+xxFFFF are illegal, but we don't filter them out.

The Unicode standard recommends that illegal code units are each decoded as U+FFFD, the replacement character. That's a good strategy when decoding a body of material in one hit, but it's ambiguous when randomly accessing data in memory. ZUnicode's convention is that illegal code units and code unit sequences are skipped. They contribute to any count of code units, but do not generate code points and do not contribute to any count of code points. So a pointer into a sequence of code units is considered to reference the first valid code point starting at or subsequent to the pointer. Conversely, illegal code points, those outside the ranges 0-0xD7FF and 0xE000-0x10FFFF, are treated as being of zero length. They will not cause generation of code units nor contribute to counts of code points.

An example of mapping between offsets and code points is perhaps in order. The following represents ten UTF-8 code units, with offsets of contained bytes from zero to nine and offset ten being the end of the buffer:

Offset  Description
------  -----------
 0      Start of 3 byte sequence, with two continuation bytes following
 3      Single byte character
 4      Single byte character
 5      An out of order continuation byte (illegal)
 6      Start of 2 byte sequence, with single continuation byte following
 8      Single byte character
 9      Start byte of two byte sequence without continuation byte (illegal)
10      End of the buffer

Offset    01234567890
Value     3CCNNC2CN2
Illegal   -----X---X

The table below shows for each offset which offset will be returned/used when
decremented, accessed or incremented. You should note that illegal byte sequences
are effectively transparent to those operations.

Offset  Dec  Acc  Inc
 0       -    0    3
 1       0    3    4
 2       0    3    4
 3       0    3    4
 4       3    4    6
 5       4    6    8
 6       4    6    8
 7       6    8    -
 8       6    8    -
 9       8    -    -
10       8    -    -

Bug:
CodeWarrior building for Mach-o defines wchar_t to be a 32 bits in size. The CW debugger, as of 8.3 at least, always treats wchar_t as being 16 bits in size, which can make things very confusing. You can have the compiler define wchar_t to be 16 bits in size by using pragma ushort_wchar_t on


Function Documentation

void ZUnicode::sUTF32ToUTF8 ( const UTF32 *  iSource,
size_t  iSourceCount,
size_t *  oSourceCount,
UTF8 *  iDest,
size_t  iDestCU,
size_t *  oDestCU,
size_t *  oCountCP 
)

Read UTF32 code units from iSource, convert them into valid code points and store them as UTF8 code units starting at iDest. Do not read more than iSourceCount UTF32 code units, and do not store more than iDestCU UTF8 code units. Report the counts read and written in oSourceCount and oDestCU.

bool ZUnicode::sUTF8ToUTF32 ( const UTF8 *  iSource,
size_t  iSourceCU,
size_t *  oSourceCU,
UTF32 *  iDest,
size_t  iDestCount,
size_t *  oDestCount 
)

Read UTF8 code units from iSource, convert them into valid code points and store them as UTF32 code units starting at iDest. Do not read more than iSourceCU UTF8 code units, and do not store more than iDestCount UTF32 code units. Report the counts read and written in oSourceCU and oDestCount. Return false if fewer than iSourceCU UTF8 code units were read because there was a valid prefix at the end of the buffer that could represent a valid code point if more data were provided.

void ZUnicode::sUTF32ToUTF16 ( const UTF32 *  iSource,
size_t  iSourceCount,
size_t *  oSourceCount,
UTF16 *  iDest,
size_t  iDestCU,
size_t *  oDestCU,
size_t *  oCountCP 
)

Read UTF32 code units from iSource, convert them into valid code points and store them as UTF16 code units starting at iDest. Do not read more than iSourceCount UTF32 code units, and do not store more than iDestCU UTF16 code units. Report the code units read and written in oSourceCount and oDestCU, and the code points written in oCountCP.

bool ZUnicode::sUTF16ToUTF32 ( const UTF16 *  iSource,
size_t  iSourceCU,
size_t *  oSourceCU,
UTF32 *  iDest,
size_t  iDestCount,
size_t *  oDestCount 
)

Read UTF16 code units from iSource, convert them into valid code points and store them as UTF32 code units starting at iDest. Do not read more than iSourceCU UTF16 code units, and do not store more than iDestCount UTF32 code units. Report the counts read and written in oSourceCU and oDestCount. Return false if fewer than iSourceCU UTF8 code units were read because there was a valid prefix at the end of the buffer that could represent a valid code point if more data were provided.

bool ZUnicode::sUTF16ToUTF8 ( const UTF16 *  iSource,
size_t  iSourceCU,
size_t *  oSourceCU,
UTF8 *  iDest,
size_t  iDestCU,
size_t *  oDestCU,
size_t  iMaxCP,
size_t *  oCountCP 
)

Read UTF16 code units from iSource, convert them into valid code points and store them as UTF8 code units starting at iDest. Do not read more than iSourceCU UTF16 code units, and do not store more than iDestCU UTF8 code units. Do not consume/generate more than iMaxCP code points. Report the counts read and written in oSourceCU and oDestCU, and the number of code points in oCountCP. Return false if fewer than iSourceCU UTF16 code units were read because there was a valid prefix at the end of the buffer that could represent a valid code point if more data were provided.

bool ZUnicode::sUTF8ToUTF16 ( const UTF8 *  iSource,
size_t  iSourceCU,
size_t *  oSourceCU,
UTF16 *  iDest,
size_t  iDestCU,
size_t *  oDestCU,
size_t  iMaxCP,
size_t *  oCountCP 
)

Read UTF8 code units from iSource, convert them into valid code points and store them as UTF16 code units starting at iDest. Do not read more than iSourceCU UTF8 code units, and do not store more than iDestCU UTF16 code units. Do not consume/generate more than iMaxCP code points. Report the counts read and written in oSourceCU and oDestCU, and the number of code points in oCountCP. Return false if fewer than iSourceCU UTF8 code units were read because there was a valid prefix at the end of the buffer that could represent a valid code point if more data were provided.

template<class I>
size_t ZUnicode::sCountCU ( iSource  )  [inline]

Return the number of code units between iSource and the first ocurrence of a zero code unit.

template<class I>
size_t ZUnicode::sCountCP ( iSource  )  [inline]

Return the number of correctly encoded code points between iSource and the first ocurrence of a zero code unit.

template<class I>
void ZUnicode::sCount ( iSource,
size_t *  oCountCU,
size_t *  oCountCP 
) [inline]

Return both the number of code units and the number of correctly encoded code points between iSource and the first occurrence of a zero code unit.

template<class I>
size_t ZUnicode::sCPToCU ( iSource,
size_t  iCountCP 
) [inline]

Return the number of code units that must be traversed to generate iCountCP valid code points in the string buffer starting at iSource.

template<class I>
size_t ZUnicode::sCPToCU ( iSource,
size_t  iCountCU,
size_t  iCountCP,
size_t *  oCountCP 
) [inline]

Return the number of code units that must be traversed to generate iCountCP valid code points in the string buffer starting at iSource and extending to iSource + iCountCU. Return the number of code points actually traversed in oCountCP.

template<class I>
size_t ZUnicode::sCPToCU ( iSource,
iEnd,
size_t  iCountCP,
size_t *  oCountCP 
) [inline]

Return the number of code units that must be traversed to generate iCountCP valid code points in the string buffer starting at iSource and extending to iEnd. Return the number of code points actually traversed in oCountCP.

template<class I>
size_t ZUnicode::sCUToCP ( iSource,
size_t  iCountCU 
) [inline]

Return the number of valid code points represented by the code units between iSource and iSource + iCountCU.

template<class I>
size_t ZUnicode::sCUToCP ( iSource,
iEnd 
) [inline]

Return the number of valid code points represented by the code units between iSource and iEnd.

template<class I>
void ZUnicode::sAlign ( I &  ioCurrent  )  [inline]

If ioCurrent references the first code unit of a valid code point then leave it unchanged. Otherwise advance it until it does.

template<class I>
void ZUnicode::sAlign ( I &  ioCurrent,
iEnd 
) [inline]

If ioCurrent references the first code unit of a valid code point then leave it unchanged. Otherwise advance it until it does or until it equals iEnd.

template<class I>
void ZUnicode::sInc ( I &  ioCurrent  )  [inline]

Update ioCurrent to take it past the current valid code point.

template<class I>
bool ZUnicode::sInc ( I &  ioCurrent,
iEnd 
) [inline]

Update ioCurrent to take it past the current valid code point. If that would move ioCurrent past iEnd then return false, otherwise return true.

template<class I>
void ZUnicode::sDec ( I &  ioCurrent  )  [inline]

Decrement ioCurrent until it references a valid code point.

template<class I>
bool ZUnicode::sDec ( iStart,
I &  ioCurrent,
iEnd 
) [inline]

Decrement ioCurrent until it references a valid code point. If that would move ioCurrent past iStart then return false, otherwise return true. iEnd is passed to ensure that the function does not attempt to read beyond the end of the buffer (only actually an issue for UTF-8).

template<class I>
UTF32 ZUnicode::sRead ( iCurrent  )  [inline]

Return the first valid code point at or after iCurrent.

template<class I>
bool ZUnicode::sRead ( iCurrent,
iEnd,
UTF32 &  oCP 
) [inline]

Return in oCP the first valid code point at or after iCurrent. If there is no valid code point between iCurrent and iEnd then return false.

template<class I>
UTF32 ZUnicode::sReadInc ( I &  ioCurrent  )  [inline]

Return the first valid code point at or after ioCurrent, and update ioCurrent to point just past its final code unit (not necessarily at the first code unit of the next valid code point).

template<class I>
bool ZUnicode::sReadInc ( I &  ioCurrent,
iEnd,
UTF32 &  oCP 
) [inline]

Put in oCP the first valid code point at or after ioCurrent, and update ioCurrent to point just past its final code unit (not necessarily at the first code unit of the next valid code point). If there is no valid code point between ioCurrent and iEnd then return false.

template<class I>
bool ZUnicode::sReadInc ( I &  ioCurrent,
iEnd,
UTF32 &  oCP,
size_t &  ioCountSkipped 
) [inline]

Put in oCP the first valid code point at or after ioCurrent, and update ioCurrent to point just past its final code unit (not necessarily at the first code unit of the next valid code point). If there is no valid code point between ioCurrent and iEnd then return false. Additionally, add to ioCountSkipped the number of code units that were skipped.

template<class I>
UTF32 ZUnicode::sDecRead ( I &  ioCurrent  )  [inline]

Return the first valid code point starting prior to ioCurrent.

template<class I>
bool ZUnicode::sDecRead ( iStart,
I &  ioCurrent,
iEnd,
UTF32 &  oCP 
) [inline]

Put in oCP the first valid code point starting prior to ioCurrent. If there is no valid code point between iState and ioCurrent then return false.

template<class I>
bool ZUnicode::sWrite ( iDest,
iEnd,
UTF32  iCP 
) [inline]

If ICP is a valid code point then write it to iDest. If there is insufficient space to hold the code units then return false. Writing an invalid code point will return true, as invalid code points require zero code units to represent them.

template<class I>
bool ZUnicode::sWriteInc ( I &  ioDest,
iEnd,
UTF32  iCP 
) [inline]

If ICP is a valid code point then write it to ioDest and advance ioDest appropriately. If there is insufficient space to hold the code units then return false. Writing an invalid code point will return true, as invalid code points require zero code units to represent them.


Generated on Thu Jul 26 11:22:09 2007 for ZooLib by  doxygen 1.4.7