Functions
substr	c4::skip_bom (substr s)
	skip the Byte Order Mark, or get the full string if there is Byte Order Mark.
csubstr	c4::skip_bom (csubstr s)
	skip the Byte Order Mark, or get the full string if there is Byte Order Mark
substr	c4::get_bom (substr s)
	get the Byte Order Mark, or an empty string if there is no Byte Order Mark
csubstr	c4::get_bom (csubstr s)
	get the Byte Order Mark, or an empty string if there is no Byte Order Mark
size_t	c4::first_non_bom (csubstr s)
	return the position of the first character not belonging to the Byte Order Mark, or 0 if there is no Byte Order Mark.
substr	c4::decode_code_point (substr out, csubstr code_point)
	decode the given `code_point`, writing into the output string in `out`.
size_t	c4::decode_code_point (uint8_t *buf, size_t buflen, uint32_t code)
	decode the given `code` point, writing into the output string `buf`, of size `buflen`

Detailed Description

Function Documentation

◆ skip_bom() [1/2]

substr c4::skip_bom ( substr s )

skip the Byte Order Mark, or get the full string if there is Byte Order Mark.

See also: Implements the Byte Order Marks as described in https://en.wikipedia.org/wiki/Byte_order_mark#Byte-order_marks_by_encoding

Definition at line 103 of file utf.cpp.

{
    return s.sub(first_non_bom(s));
}

◆ skip_bom() [2/2]

csubstr c4::skip_bom ( csubstr s )

skip the Byte Order Mark, or get the full string if there is Byte Order Mark

See also: Implements the Byte Order Marks as described in https://en.wikipedia.org/wiki/Byte_order_mark#Byte-order_marks_by_encoding

Definition at line 107 of file utf.cpp.

{
    return s.sub(first_non_bom(s));
}

◆ get_bom() [1/2]

substr c4::get_bom ( substr s )

get the Byte Order Mark, or an empty string if there is no Byte Order Mark

See also: Implements the Byte Order Marks as described in https://en.wikipedia.org/wiki/Byte_order_mark#Byte-order_marks_by_encoding

Definition at line 95 of file utf.cpp.

{
    return s.first(first_non_bom(s));
}

◆ get_bom() [2/2]

csubstr c4::get_bom ( csubstr s )

get the Byte Order Mark, or an empty string if there is no Byte Order Mark

See also: Implements the Byte Order Marks as described in https://en.wikipedia.org/wiki/Byte_order_mark#Byte-order_marks_by_encoding

Definition at line 99 of file utf.cpp.

{
    return s.first(first_non_bom(s));
}

◆ first_non_bom()

size_t c4::first_non_bom ( csubstr s )

return the position of the first character not belonging to the Byte Order Mark, or 0 if there is no Byte Order Mark.

See also: Implements the Byte Order Marks as described in https://en.wikipedia.org/wiki/Byte_order_mark#Byte-order_marks_by_encoding

Definition at line 59 of file utf.cpp.

{
    #define c4check2_(s, c0, c1)         ((s).len >= 2) && (((s).str[0] == (c0)) && ((s).str[1] == (c1)))
    #define c4check3_(s, c0, c1, c2)     ((s).len >= 3) && (((s).str[0] == (c0)) && ((s).str[1] == (c1)) && ((s).str[2] == (c2)))
    #define c4check4_(s, c0, c1, c2, c3) ((s).len >= 4) && (((s).str[0] == (c0)) && ((s).str[1] == (c1)) && ((s).str[2] == (c2)) && ((s).str[3] == (c3)))
    // see https://en.wikipedia.org/wiki/Byte_order_mark#Byte-order_marks_by_encoding
    if(s.len < 2u)
        return false;
    else if(c4check3_(s, '\xef', '\xbb', '\xbf')) // UTF-8
        return 3u;
    else if(c4check4_(s, '\x00', '\x00', '\xfe', '\xff')) // UTF-32BE
        return 4u;
    else if(c4check4_(s, '\xff', '\xfe', '\x00', '\x00')) // UTF-32LE
        return 4u;
    else if(c4check2_(s, '\xfe', '\xff')) // UTF-16BE
        return 2u;
    else if(c4check2_(s, '\xff', '\xfe')) // UTF-16BE
        return 2u;
    else if(c4check3_(s, '\x2b', '\x2f', '\x76')) // UTF-7
        return 3u;
    else if(c4check3_(s, '\xf7', '\x64', '\x4c')) // UTF-1
        return 3u;
    else if(c4check4_(s, '\xdd', '\x73', '\x66', '\x73')) // UTF-EBCDIC
        return 4u;
    else if(c4check3_(s, '\x0e', '\xfe', '\xff')) // SCSU
        return 3u;
    else if(c4check3_(s, '\xfb', '\xee', '\x28')) // BOCU-1
        return 3u;
    else if(c4check4_(s, '\x84', '\x31', '\x95', '\x33')) // GB18030
        return 4u;
    return 0u;
    #undef c4check2_
    #undef c4check3_
    #undef c4check4_
}

Referenced by get_bom(), get_bom(), skip_bom(), and skip_bom().

◆ decode_code_point() [1/2]

substr c4::decode_code_point	(	substr	out,
		csubstr	code_point )

decode the given code_point, writing into the output string in out.

Parameters

out	the output string. must have at least 4 bytes (this is asserted), and must not have a null string.
code_point	must have length in ]0,8], and must not begin with any of `U+`,`\x`,`\u`,`\U`,`0` (asserted)

Returns: the part of out that was written, which will always be at most 4 bytes.

Definition at line 42 of file utf.cpp.

{
    C4_ASSERT(out.len >= 4);
    C4_ASSERT(!code_point.begins_with("U+"));
    C4_ASSERT(!code_point.begins_with("\\x"));
    C4_ASSERT(!code_point.begins_with("\\u"));
    C4_ASSERT(!code_point.begins_with("\\U"));
    C4_ASSERT(!code_point.begins_with('0'));
    C4_ASSERT(code_point.len <= 8);
    C4_ASSERT(code_point.len > 0);
    uint32_t code_point_val;
    C4_CHECK(read_hex(code_point, &code_point_val));
    size_t ret = decode_code_point((uint8_t*)out.str, out.len, code_point_val);
    C4_ASSERT(ret <= 4);
    return out.first(ret);
}

Referenced by decode_code_point(), and decode_code_point().

◆ decode_code_point() [2/2]

size_t c4::decode_code_point	(	uint8_t *	buf,
		size_t	buflen,
		uint32_t	code )

decode the given code point, writing into the output string buf, of size buflen