Text Helpers

The text helper API lives in the erbsland::unittest::th namespace and provides small utilities for working with encoded text in tests. This includes counting Unicode code-points, converting between string encodings, generating malformed UTF-8 for validation tests, splitting text into lines, and validating or comparing text output.

The namespace reference below lists all currently available helpers, including the Utf8Error enum and the allUtf8Errors convenience constant.

namespace th

Namespace for text helpers.

These text helper functions are mainly provided for handling diagnostic output and validating text in unit tests. They are designed to be robust but can be inefficient, as they internally convert the text to UTF-8 for processing, which can be costly for large strings.

For performance critical hot-paths in your tests, please use a custom implementation that avoids the UTF-8 conversion.

Enums

enum class Utf8Error : uint8_t

All supported malformed UTF-8 sequence categories for invalidUtf8().

Values:

enumerator UnexpectedContinuationByte: A standalone continuation byte (0x80-0xBF) without a valid start byte.

enumerator Overlong2ByteSequence: A two-byte sequence that encodes a value representable in one byte.

enumerator Truncated2ByteSequence: A two-byte sequence whose continuation byte is missing.

enumerator InvalidContinuationByteIn2ByteSequence: A two-byte sequence followed by a non-continuation byte.

enumerator Overlong3ByteSequence: A three-byte sequence that uses more bytes than required for the code-point.

enumerator Truncated3ByteSequence: A three-byte sequence with one or more missing continuation bytes.

enumerator InvalidContinuationByteIn3ByteSequence: A three-byte sequence containing an invalid continuation byte.

enumerator SurrogateCodePoint: A UTF-8 sequence that decodes to a UTF-16 surrogate code-point.

enumerator Overlong4ByteSequence: A four-byte sequence that over-encodes a smaller code-point.

enumerator Truncated4ByteSequence: A four-byte sequence with one or more missing continuation bytes.

enumerator InvalidContinuationByteIn4ByteSequence: A four-byte sequence containing an invalid continuation byte.

enumerator CodePointBeyondUnicodeRange: A UTF-8 sequence that decodes to a code-point above U+10FFFF.

enumerator InvalidStartByte: A byte that cannot start a valid UTF-8 sequence.

enumerator _count

Functions

auto characterCount(const std::string_view text) noexcept -> std::size_t

Get the code-point count for UTF-8 encoded text.

Parameters:: text – The UTF-8 encoded text to count the code-points for.
Returns:: The number of code-points in the UTF-8 encoded text.

auto characterCount(std::u8string_view text) noexcept -> std::size_t: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto characterCount(std::u16string_view text) noexcept -> std::size_t: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto characterCount(std::u32string_view text) noexcept -> std::size_t: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto characterCount(std::wstring_view text) noexcept -> std::size_t: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto toConsoleSafeString(const std::string_view text, const std::size_t maxLength) noexcept -> std::string

Create text that is safe for console output from UTF-8 encoded input.

This call scans the UTF-8 encoded text for any invalid UTF-8 sequences and replaces the invalid bytes with escape sequences “\x<XX>”. Valid UTF-8 sequences are preserved, but all code-points >=0x80 are escaped as “\u{<XXXX>}” and control codes are escaped as “\x<XX>” or as “\n”, “\r”, “\t”. Any backslash and double quote are escaped as “\” or “””.

If the string contains spaces, double quotes, or escape sequences - it is automatically enclosed in quotes.

@param text The UTF-8 encoded text to convert to console safe text.

@param maxLength The maximum length of the returned string. If the text is longer, it will be truncated

and the text “(… +<count> more)” will be added to at the end. This maximum is for the processed text, containing the escape sequences.

Returns:: The console safe text.

auto toConsoleSafeString(std::u8string_view text, std::size_t maxLength) noexcept -> std::string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto toConsoleSafeString(std::u16string_view text, std::size_t maxLength) noexcept -> std::string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto toConsoleSafeString(std::u32string_view text, std::size_t maxLength) noexcept -> std::string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto toConsoleSafeString(std::wstring_view text, std::size_t maxLength) noexcept -> std::string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto toStdString(std::string_view text) -> std::string

Convert a string into a UTF-8 encoded std::string.

std::string_view and std::u8string_view are copied byte-for-byte. std::u16string_view, std::u32string_view and std::wstring_view are transcoded to UTF-8.

auto toStdString(std::u8string_view text) -> std::string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto toStdString(std::u16string_view text) -> std::string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto toStdString(std::u32string_view text) -> std::string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto toStdString(std::wstring_view text) -> std::string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto toStdU32String(const std::string_view text) -> std::u32string

Convert a string into a UTF-32 encoded std::u32string.

std::u32string_view is copied unmodified. All other string views are decoded according to their encoding and converted to UTF-32 code-points. Invalid UTF-8 input is replaced with the Unicode replacement character (U+FFFD).

auto toStdU32String(std::u8string_view text) -> std::u32string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto toStdU32String(std::u16string_view text) -> std::u32string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto toStdU32String(std::u32string_view text) -> std::u32string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto toStdU32String(std::wstring_view text) -> std::u32string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto stdStringFromHex(const std::string_view hexString) -> std::string

Parse a hexadecimal byte string into a std::string.

Spaces are allowed only between complete byte groups, not inside them. Examples: "0102c8" or "01 02 A1".

Parameters:: hexString – The hexadecimal string to parse.
Throws:: std::logic_error – if the format is invalid.
Returns:: The decoded string.

auto stdU8StringFromHex(const std::string_view hexString) -> std::u8string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. Parses two hexadecimal digits per element into a std::u8string.

auto stdU16StringFromHex(const std::string_view hexString) -> std::u16string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. Parses four hexadecimal digits per element into a std::u16string.

auto stdU32StringFromHex(const std::string_view hexString) -> std::u32string: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. Parses eight hexadecimal digits per element into a std::u32string.

auto stdWStringFromHex(const std::string_view hexString) -> std::wstring: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. Parses one wchar_t value per element using sizeof(wchar_t) * 2 hexadecimal digits.

auto appendInvalidUtf8Sequence(std::string &result, const Utf8Error error) -> void

auto invalidUtf8(const Utf8Error error, const std::optional<std::string_view> &prefix, const std::optional<std::string_view> &suffix) -> std::string

Generate a malformed UTF-8 byte sequence, optionally wrapped with valid prefix and suffix text.

This helper is intended for UTF-8 decoder and validator tests. The generated malformed sequence is inserted between the optional prefix and suffix without modifying those surrounding bytes.

Parameters:

error – The malformed UTF-8 error kind to generate.
prefix – Optional valid UTF-8 bytes to prepend.
suffix – Optional valid UTF-8 bytes to append.

Returns:

A string containing the malformed UTF-8 sequence.

auto splitLineViews(const std::string_view text) -> std::vector<std::string_view>

Split a string into individual lines without copying the line contents.

This function returns string views into the original input and therefore does not allocate storage for the individual line contents. The original text must stay alive and unchanged while the returned views are used. Lines are split at newline characters (\n), and the resulting views do not include the newline characters. A trailing newline does not produce an additional empty line at the end.

Parameters:: text – The text to split into lines.
Returns:: A vector of string views, each representing a line from the input text.

auto splitLineViews(std::u8string_view text) -> std::vector<std::u8string_view>: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto splitLineViews(std::u16string_view text) -> std::vector<std::u16string_view>: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto splitLineViews(std::u32string_view text) -> std::vector<std::u32string_view>: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto splitLineViews(std::wstring_view text) -> std::vector<std::wstring_view>: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto splitLines(const std::string_view text) -> std::vector<std::string>

Split a string into individual lines and return owned strings for each line.

Lines are split at newline characters (\n), and the resulting strings do not include the newline characters. A trailing newline does not produce an additional empty line at the end.

Parameters:: text – The text to split into lines.
Returns:: A vector containing one string per line from the input text.

auto splitLines(std::u8string_view text) -> std::vector<std::u8string>: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto splitLines(std::u16string_view text) -> std::vector<std::u16string>: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto splitLines(std::u32string_view text) -> std::vector<std::u32string>: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

auto splitLines(std::wstring_view text) -> std::vector<std::wstring>: This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

void requireValidUtf8(UnitTest &test, const std::string_view text)

Test if the given string contains valid UTF-8 text.

This helper validates both the byte-level UTF-8 encoding and the decoded Unicode scalar values. Inputs containing malformed byte sequences, surrogate code-points, or code-points outside the Unicode range fail the current test with a diagnostic that includes the offending byte position.

Parameters:

test – The active test instance.
text – The text to validate.

void requireValidUtf8(UnitTest &test, const std::u8string_view text)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Parameters:

test – The active test instance.
text – The UTF-8 text to validate.

template<typename tActual, typename tExpected> void requireEqualLines(UnitTest &test, const tActual &actual, const tExpected &expected)

Compare a container with lines, against a container with line patterns.

Common usage: void testLines() { // ... std::vector<std::string> actual = //... from test auto expected = std::vector<std::string>{ "the first line", "a second line"}; REQUIRE_EQUAL_LINES(actual, expected); }

Each line on the expected side can contain one optional asterisk (*) character, that matches any number of characters at this location. The expected line can also contain any number of question mark (?) characters that match exactly one character at this location.

If the number of lines differs, or if at least one line does not match, the test fails and reports a side-by-side comparison of the actual and expected lines.

Template Parameters:

tActual – A range type that provides size(), begin(), end(), and string-like line values.
tExpected – A range type that provides size(), begin(), end(), and pattern line values.

Parameters:

test – The active test instance.
actual – The produced lines from the test.
expected – The expected lines or line patterns.

Variables

A list of all malformed UTF-8 categories supported by invalidUtf8().

This constant is convenient for parameterized tests that verify UTF-8 validation or error reporting against every malformed sequence generator provided by this API.

Text Helpers

Related Macros