ZooLib: ZUnicode Namespace Reference


Converting buffers between UTF32, UTF16 and UTF8
void	sUTF32ToUTF8 (const UTF32 iSource, size_t iSourceCount, size_t oSourceCount, UTF8 iDest, size_t iDestCU, size_t oDestCU, size_t *oCountCP)
bool	sUTF8ToUTF32 (const UTF8 iSource, size_t iSourceCU, size_t oSourceCU, UTF32 iDest, size_t iDestCount, size_t oDestCount)
void	sUTF32ToUTF16 (const UTF32 iSource, size_t iSourceCount, size_t oSourceCount, UTF16 iDest, size_t iDestCU, size_t oDestCU, size_t *oCountCP)
bool	sUTF16ToUTF32 (const UTF16 iSource, size_t iSourceCU, size_t oSourceCU, UTF32 iDest, size_t iDestCount, size_t oDestCount)
bool	sUTF16ToUTF8 (const UTF16 iSource, size_t iSourceCU, size_t oSourceCU, UTF8 iDest, size_t iDestCU, size_t oDestCU, size_t iMaxCP, size_t *oCountCP)
bool	sUTF8ToUTF16 (const UTF8 iSource, size_t iSourceCU, size_t oSourceCU, UTF16 iDest, size_t iDestCU, size_t oDestCU, size_t iMaxCP, size_t *oCountCP)
Counting code units or code points, looking for zero terminator.
template<class I>
size_t	sCountCU (I iSource)
template<class I>
size_t	sCountCP (I iSource)
template<class I>
void	sCount (I iSource, size_t oCountCU, size_t oCountCP)
Mapping offsets between code points and code units.
template<class I>
size_t	sCPToCU (I iSource, size_t iCountCP)
template<class I>
size_t	sCPToCU (I iSource, size_t iCountCU, size_t iCountCP, size_t *oCountCP)
template<class I>
size_t	sCPToCU (I iSource, I iEnd, size_t iCountCP, size_t *oCountCP)
template<class I>
size_t	sCUToCP (I iSource, size_t iCountCU)
template<class I>
size_t	sCUToCP (I iSource, I iEnd)
Ensure a pointer is aligned with the first code unit of a valid code point.
template<class I>
void	sAlign (I &ioCurrent)
template<class I>
void	sAlign (I &ioCurrent, I iEnd)
Iterating, reading and writing indvidual code points.
template<class I>
void	sInc (I &ioCurrent)
template<class I>
bool	sInc (I &ioCurrent, I iEnd)
template<class I>
void	sDec (I &ioCurrent)
template<class I>
bool	sDec (I iStart, I &ioCurrent, I iEnd)
template<class I>
UTF32	sRead (I iCurrent)
template<class I>
bool	sRead (I iCurrent, I iEnd, UTF32 &oCP)
template<class I>
UTF32	sReadInc (I &ioCurrent)
template<class I>
bool	sReadInc (I &ioCurrent, I iEnd, UTF32 &oCP)
template<class I>
bool	sReadInc (I &ioCurrent, I iEnd, UTF32 &oCP, size_t &ioCountSkipped)
template<class I>
UTF32	sDecRead (I &ioCurrent)
template<class I>
bool	sDecRead (I iStart, I &ioCurrent, I iEnd, UTF32 &oCP)
template<class I>
bool	sWrite (I iDest, I iEnd, UTF32 iCP)
template<class I>
bool	sWriteInc (I &ioDest, I iEnd, UTF32 iCP)

Detailed Description

If you need to work with individual code points use UTF32. In general UTF16 and UTF8 can hold only individual code units of their type. It may be that the UTF-16 or UTF-8 representation of a particular code point can be represented in a single code unit, but should never be relied upon in your code.

Legal code points

ISO 10646 formally defines a 31 bit character set, the range of legal code points is thus from 0 to 0x7FFFFFFF. This range is considered to be divided into 32768 planes of 65536 code points. As of this writing characters have only been assigned to plane 0, the so-called Basic Multilingual Plane (BMP). There is a committment to never assign characters beyond plane 16. So the range of legal code points is restricted to 0 to 0x10FFFF (1,114,112 distinct code points). UCS-4 and UTF-8 can represent the entire 31 bit range, but UTF-16 is unable to represent code points beyond plane 16.

Illegal code units and code unit sequences

There are two illegal code units in UTF-8, 0xFE and 0xFF. It's also possible to have sequences of UTF-8 code units which are illegal, that is which do not map to a code point. For example if a continuation byte is not preceded by a start byte, or if a start byte is not immediately followed by enough continuation bytes.

UTF-16 also has illegal code unit sequences, in particular a high surrogate that is not followed by a low surrogate, or a low surrogate without a preceding high surrogate.

UTF32 is a 32 bit integer, and thus can represent values outside the 31 bit ISO 10646 range. So it's clearly possible to have UTF32 code units that do not map to a code point. Given that Unicode and ISO 10646 will never assign characters beyond plane 16, we further restrict valid UTF-32 code units to be in the range 0 to 0x10FFFF. Finally, code units from the high and low surrgate blocks (0xDC00 to 0xDFFF) are also illegal. Strictly speaking U+FFFF and all code points of the form U+xxFFFF are illegal, but we don't filter them out.

The Unicode standard recommends that illegal code units are each decoded as U+FFFD, the replacement character. That's a good strategy when decoding a body of material in one hit, but it's ambiguous when randomly accessing data in memory. ZUnicode's convention is that illegal code units and code unit sequences are skipped. They contribute to any count of code units, but do not generate code points and do not contribute to any count of code points. So a pointer into a sequence of code units is considered to reference the first valid code point starting at or subsequent to the pointer. Conversely, illegal code points, those outside the ranges 0-0xD7FF and 0xE000-0x10FFFF, are treated as being of zero length. They will not cause generation of code units nor contribute to counts of code points.

An example of mapping between offsets and code points is perhaps in order. The following represents ten UTF-8 code units, with offsets of contained bytes from zero to nine and offset ten being the end of the buffer:

Offset  Description
------  -----------
 0      Start of 3 byte sequence, with two continuation bytes following
 3      Single byte character
 4      Single byte character
 5      An out of order continuation byte (illegal)
 6      Start of 2 byte sequence, with single continuation byte following
 8      Single byte character
 9      Start byte of two byte sequence without continuation byte (illegal)
10      End of the buffer

Offset    01234567890
Value     3CCNNC2CN2
Illegal   -----X---X

The table below shows for each offset which offset will be returned/used when
decremented, accessed or incremented. You should note that illegal byte sequences
are effectively transparent to those operations.

Offset  Dec  Acc  Inc
 0       -    0    3
 1       0    3    4
 2       0    3    4
 3       0    3    4
 4       3    4    6
 5       4    6    8
 6       4    6    8
 7       6    8    -
 8       6    8    -
 9       8    -    -
10       8    -    -

Function Documentation

void ZUnicode::sUTF32ToUTF8	(	const UTF32 *	iSource,
		size_t	iSourceCount,
		size_t *	oSourceCount,
		UTF8 *	iDest,
		size_t	iDestCU,
		size_t *	oDestCU,
		size_t *	oCountCP
	)

Read UTF32 code units from iSource, convert them into valid code points and store them as UTF8 code units starting at iDest. Do not read more than iSourceCount UTF32 code units, and do not store more than iDestCU UTF8 code units. Report the counts read and written in oSourceCount and oDestCU.

bool ZUnicode::sUTF8ToUTF32	(	const UTF8 *	iSource,
		size_t	iSourceCU,
		size_t *	oSourceCU,
		UTF32 *	iDest,
		size_t	iDestCount,
		size_t *	oDestCount
	)

Read UTF8 code units from iSource, convert them into valid code points and store them as UTF32 code units starting at iDest. Do not read more than iSourceCU UTF8 code units, and do not store more than iDestCount UTF32 code units. Report the counts read and written in oSourceCU and oDestCount. Return false if fewer than iSourceCU UTF8 code units were read because there was a valid prefix at the end of the buffer that could represent a valid code point if more data were provided.

void ZUnicode::sUTF32ToUTF16	(	const UTF32 *	iSource,
		size_t	iSourceCount,
		size_t *	oSourceCount,
		UTF16 *	iDest,
		size_t	iDestCU,
		size_t *	oDestCU,
		size_t *	oCountCP
	)

Read UTF32 code units from iSource, convert them into valid code points and store them as UTF16 code units starting at iDest. Do not read more than iSourceCount UTF32 code units, and do not store more than iDestCU UTF16 code units. Report the code units read and written in oSourceCount and oDestCU, and the code points written in oCountCP.

bool ZUnicode::sUTF16ToUTF32	(	const UTF16 *	iSource,
		size_t	iSourceCU,
		size_t *	oSourceCU,
		UTF32 *	iDest,
		size_t	iDestCount,
		size_t *	oDestCount
	)

Read UTF16 code units from iSource, convert them into valid code points and store them as UTF32 code units starting at iDest. Do not read more than iSourceCU UTF16 code units, and do not store more than iDestCount UTF32 code units. Report the counts read and written in oSourceCU and oDestCount. Return false if fewer than iSourceCU UTF8 code units were read because there was a valid prefix at the end of the buffer that could represent a valid code point if more data were provided.

bool ZUnicode::sUTF16ToUTF8	(	const UTF16 *	iSource,
		size_t	iSourceCU,
		size_t *	oSourceCU,
		UTF8 *	iDest,
		size_t	iDestCU,
		size_t *	oDestCU,
		size_t	iMaxCP,
		size_t *	oCountCP
	)

Read UTF16 code units from iSource, convert them into valid code points and store them as UTF8 code units starting at iDest. Do not read more than iSourceCU UTF16 code units, and do not store more than iDestCU UTF8 code units. Do not consume/generate more than iMaxCP code points. Report the counts read and written in oSourceCU and oDestCU, and the number of code points in oCountCP. Return false if fewer than iSourceCU UTF16 code units were read because there was a valid prefix at the end of the buffer that could represent a valid code point if more data were provided.

bool ZUnicode::sUTF8ToUTF16	(	const UTF8 *	iSource,
		size_t	iSourceCU,
		size_t *	oSourceCU,
		UTF16 *	iDest,
		size_t	iDestCU,
		size_t *	oDestCU,
		size_t	iMaxCP,
		size_t *	oCountCP
	)

Read UTF8 code units from iSource, convert them into valid code points and store them as UTF16 code units starting at iDest. Do not read more than iSourceCU UTF8 code units, and do not store more than iDestCU UTF16 code units. Do not consume/generate more than iMaxCP code points. Report the counts read and written in oSourceCU and oDestCU, and the number of code points in oCountCP. Return false if fewer than iSourceCU UTF8 code units were read because there was a valid prefix at the end of the buffer that could represent a valid code point if more data were provided.

template<class I>

size_t ZUnicode::sCountCU ( I iSource ) [inline]

Return the number of code units between iSource and the first ocurrence of a zero code unit.

template<class I>

size_t ZUnicode::sCountCP ( I iSource ) [inline]

Return the number of correctly encoded code points between iSource and the first ocurrence of a zero code unit.

template<class I>

void ZUnicode::sCount	(	I	iSource,
		size_t *	oCountCU,
		size_t *	oCountCP
	)			`[inline]`

Return both the number of code units and the number of correctly encoded code points between iSource and the first occurrence of a zero code unit.

template<class I>

size_t ZUnicode::sCPToCU	(	I	iSource,
		size_t	iCountCP
	)			`[inline]`

Return the number of code units that must be traversed to generate iCountCP valid code points in the string buffer starting at iSource.

template<class I>

size_t ZUnicode::sCPToCU	(	I	iSource,
		size_t	iCountCU,
		size_t	iCountCP,
		size_t *	oCountCP
	)			`[inline]`

Return the number of code units that must be traversed to generate iCountCP valid code points in the string buffer starting at iSource and extending to iSource + iCountCU. Return the number of code points actually traversed in oCountCP.

template<class I>

size_t ZUnicode::sCPToCU	(	I	iSource,
		I	iEnd,
		size_t	iCountCP,
		size_t *	oCountCP
	)			`[inline]`

Return the number of code units that must be traversed to generate iCountCP valid code points in the string buffer starting at iSource and extending to iEnd. Return the number of code points actually traversed in oCountCP.

template<class I>

size_t ZUnicode::sCUToCP	(	I	iSource,
		size_t	iCountCU
	)			`[inline]`

Return the number of valid code points represented by the code units between iSource and iSource + iCountCU.

template<class I>

size_t ZUnicode::sCUToCP	(	I	iSource,
		I	iEnd
	)			`[inline]`

Return the number of valid code points represented by the code units between iSource and iEnd.

template<class I>

void ZUnicode::sAlign ( I & ioCurrent ) [inline]

If ioCurrent references the first code unit of a valid code point then leave it unchanged. Otherwise advance it until it does.

template<class I>

void ZUnicode::sAlign	(	I &	ioCurrent,
		I	iEnd
	)			`[inline]`

If ioCurrent references the first code unit of a valid code point then leave it unchanged. Otherwise advance it until it does or until it equals iEnd.

template<class I>

void ZUnicode::sInc ( I & ioCurrent ) [inline]

Update ioCurrent to take it past the current valid code point.

template<class I>

bool ZUnicode::sInc	(	I &	ioCurrent,
		I	iEnd
	)			`[inline]`

Update ioCurrent to take it past the current valid code point. If that would move ioCurrent past iEnd then return false, otherwise return true.

template<class I>

void ZUnicode::sDec ( I & ioCurrent ) [inline]

Decrement ioCurrent until it references a valid code point.

template<class I>

bool ZUnicode::sDec	(	I	iStart,
		I &	ioCurrent,
		I	iEnd
	)			`[inline]`

Decrement ioCurrent until it references a valid code point. If that would move ioCurrent past iStart then return false, otherwise return true. iEnd is passed to ensure that the function does not attempt to read beyond the end of the buffer (only actually an issue for UTF-8).

template<class I>

UTF32 ZUnicode::sRead ( I iCurrent ) [inline]

Return the first valid code point at or after iCurrent.

template<class I>

bool ZUnicode::sRead	(	I	iCurrent,
		I	iEnd,
		UTF32 &	oCP
	)			`[inline]`

Return in oCP the first valid code point at or after iCurrent. If there is no valid code point between iCurrent and iEnd then return false.

template<class I>

UTF32 ZUnicode::sReadInc ( I & ioCurrent ) [inline]

Return the first valid code point at or after ioCurrent, and update ioCurrent to point just past its final code unit (not necessarily at the first code unit of the next valid code point).

template<class I>

bool ZUnicode::sReadInc	(	I &	ioCurrent,
		I	iEnd,
		UTF32 &	oCP
	)			`[inline]`

Put in oCP the first valid code point at or after ioCurrent, and update ioCurrent to point just past its final code unit (not necessarily at the first code unit of the next valid code point). If there is no valid code point between ioCurrent and iEnd then return false.

template<class I>

bool ZUnicode::sReadInc	(	I &	ioCurrent,
		I	iEnd,
		UTF32 &	oCP,
		size_t &	ioCountSkipped
	)			`[inline]`

template<class I>

UTF32 ZUnicode::sDecRead ( I & ioCurrent ) [inline]

Return the first valid code point starting prior to ioCurrent.

template<class I>

bool ZUnicode::sDecRead	(	I	iStart,
		I &	ioCurrent,
		I	iEnd,
		UTF32 &	oCP
	)			`[inline]`

Put in oCP the first valid code point starting prior to ioCurrent. If there is no valid code point between iState and ioCurrent then return false.

template<class I>

bool ZUnicode::sWrite	(	I	iDest,
		I	iEnd,
		UTF32	iCP
	)			`[inline]`

If ICP is a valid code point then write it to iDest. If there is insufficient space to hold the code units then return false. Writing an invalid code point will return true, as invalid code points require zero code units to represent them.

template<class I>

bool ZUnicode::sWriteInc	(	I &	ioDest,
		I	iEnd,
		UTF32	iCP
	)			`[inline]`

If ICP is a valid code point then write it to ioDest and advance ioDest appropriately. If there is insufficient space to hold the code units then return false. Writing an invalid code point will return true, as invalid code points require zero code units to represent them.

ZUnicode Namespace Reference [Unicode]

Converting buffers between UTF32, UTF16 and UTF8

Counting code units or code points, looking for zero terminator.

Mapping offsets between code points and code units.

Ensure a pointer is aligned with the first code unit of a valid code point.

Iterating, reading and writing indvidual code points.

Detailed Description

Legal code points

Illegal code units and code unit sequences

Function Documentation

ZUnicode Namespace Reference
[Unicode]