Text Helpers

The text helper API in erbsland::unittest::th provides practical utilities for working with text in unit tests.

It helps you:

Count Unicode code-points instead of raw code units.
Convert between UTF-8, UTF-16, UTF-32, and wchar_t based strings.
Produce readable diagnostics for text containing control characters or malformed UTF-8.
Generate malformed UTF-8 input for validation tests.
Split text into lines for comparisons.
Compare multi-line output with simple wildcard support.
Verify that a byte string contains valid UTF-8.

If you use these helpers frequently, introducing a namespace alias keeps your test code concise:

#include <erbsland/unittest/TextHelper.hpp>

namespace th = erbsland::unittest::th;

Warning

These helpers are designed for reliability, diagnostics, and convenience — not for raw speed.

Several functions internally normalize text to UTF-8 or UTF-32 before processing. This makes them robust and predictable, but also unsuitable for performance-critical hot paths.

If your test repeatedly processes large volumes of text in tight loops, prefer a specialized implementation that operates directly on your native data representation.

Supported String Types

Most helpers provide overloads for the following string view types:

std::string_view
std::u8string_view
std::u16string_view
std::u32string_view
std::wstring_view

The overloads are intentionally parallel. In most cases, you can call the same helper regardless of the text type your test already uses.

The `Utf8Error` Enum

The Utf8Error enum specifies which malformed UTF-8 sequence invalidUtf8() should generate.

Available values:

UnexpectedContinuationByte A continuation byte without a valid lead byte.
Overlong2ByteSequence A two-byte overlong encoding.
Truncated2ByteSequence A two-byte sequence missing its continuation byte.
InvalidContinuationByteIn2ByteSequence A two-byte sequence followed by an invalid continuation byte.
Overlong3ByteSequence A three-byte overlong encoding.
Truncated3ByteSequence A three-byte sequence with missing trailing bytes.
InvalidContinuationByteIn3ByteSequence A three-byte sequence containing an invalid continuation byte.
SurrogateCodePoint Bytes that decode to a UTF-16 surrogate code-point (invalid in UTF-8).
Overlong4ByteSequence A four-byte overlong encoding.
Truncated4ByteSequence A four-byte sequence with missing trailing bytes.
InvalidContinuationByteIn4ByteSequence A four-byte sequence containing an invalid continuation byte.
CodePointBeyondUnicodeRange A sequence above U+10FFFF.
InvalidStartByte A byte that is not valid as a UTF-8 start byte.

The `allUtf8Errors` Constant

The allUtf8Errors constant contains all enum values in a single iterable array.

This makes it easy to write exhaustive tests:

for (const auto error : th::allUtf8Errors) {
    auto malformed = th::invalidUtf8(error, "pre", "suf");
    CHECK_FALSE(myUtf8Validator(malformed));
}

Use this pattern when you want to ensure your UTF-8 handling rejects all supported malformed cases.

The `characterCount()` Function

Use characterCount() to count Unicode code-points instead of bytes or UTF code units.

auto count1 = th::characterCount("abc");
auto count2 = th::characterCount("abc→😀");

REQUIRE_EQUAL(count1, 3);
REQUIRE_EQUAL(count2, 5);

This is especially useful when your test data contains non-ASCII characters and std::string::size() would give misleading results.

Note

The result counts Unicode code-points, not grapheme clusters. A single visible character may still consist of multiple code-points.

The `toConsoleSafeString()` Function

This helper converts text into a representation that is safe and readable in console output.

It escapes:

control characters (newline, carriage return, tab),
invalid UTF-8 bytes,
non-ASCII code-points,
backslashes and double quotes.

If the result contains spaces or escape sequences, it is automatically wrapped in double quotes.

auto text = th::toConsoleSafeString("abc\n\r\t\\\"\u1234", 100);
REQUIRE_EQUAL(text, R"("abc\n\r\t\\\"\u{1234}")");

Use this helper whenever raw strings would make diagnostics hard to interpret.

The maxLength parameter limits the length of the escaped result. If the output is truncated, a suffix like (... +<count> more) is appended.

The `toStdString()` Function

This function converts supported string types into a UTF-8 encoded std::string.

std::string_view and std::u8string_view are copied as-is.
std::u16string_view, std::u32string_view, and std::wstring_view are transcoded to UTF-8.

Use this to normalize text across platforms before comparing or printing it.

Important

For std::string_view, the function does not validate UTF-8. It simply copies the bytes.

If you need validation, call requireValidUtf8() explicitly.

The `toStdU32String()` Function

This function converts supported input into a UTF-32 std::u32string.

Use it when you want one Unicode code-point per element.

For UTF-8 input, malformed sequences are replaced with the Unicode replacement character U+FFFD instead of causing errors.

This makes the function safe for diagnostics and normalization, even when the input is not fully trustworthy.

The Hex Parsing Functions

These helpers parse hexadecimal text into strings with fixed-width element types:

Use them when expressing test data as hex is clearer than using source code string literals.

Format Rules for Hex Parsing

All hex parsing helpers follow the same rules:

Spaces are optional.
Spaces may only appear between complete groups.
Leading and trailing spaces are not allowed.
Invalid hex digits throw std::logic_error.

The `invalidUtf8()` Function

This helper intentionally creates malformed UTF-8 sequences.

Use it to test validators, decoders, or error handling:

auto sample = th::invalidUtf8(th::Utf8Error::InvalidStartByte, "before ", " after");
CHECK_FALSE(myUtf8Validator(sample));

Combining it with allUtf8Errors allows you to build exhaustive negative tests with very little code.

The `splitLineViews()` Function

Splits text at \\n and returns views into the original string.

Use this when you want to inspect lines without copying data.

Important

The returned views reference the original input. Do not modify or destroy the source string while they are in use.

The `splitLines()` Function

Same behavior as splitLineViews(), but returns owned strings instead of views.

Use this when you need independence from the original input lifetime.

The `requireEqualLines()` Function

Compares two sequences of lines and produces a side-by-side diff on failure.

The expected lines support simple wildcards:

* matches any number of characters.
? matches exactly one character.

Use this for stable comparisons of output that may contain variable fragments.

void testRequireEqualLines() {
    const auto expected = std::vector<std::string_view>{
        "hello one two three",
        "anything*",
        "*anything",
        "anything*anything",
        "two??wildcards",
        "*any??",
    };
    const auto actual = std::vector<std::string_view>{
        "hello one two three",
        "anything→goes.here",
        "another line with anything",
        "anything can be in the middle, anything",
        "two+-wildcards",
        "is there any::",
    };
    // manual use
    WITH_CONTEXT(th::requireEqualLines(*this, actual, expected));
    // macro use
    REQUIRE_EQUAL_LINES(actual, expected);
}

If the two line sequences don’t match, you will get diagnostic output like this:

| Actual (2)                  |     | Expected (3)                |
|-----------------------------|-----|-----------------------------|
| one two three four five six | === | one two three four five six |
| different                   |  X  | another line                |
|                             |  X  | last line                   |
|-----------------------------|-----|-----------------------------|

The `REQUIRE_EQUAL_LINES` Macro

Convenience wrapper around requireEqualLines() that automatically integrates with the test context:

REQUIRE_EQUAL_LINES(actualLines, expectedLines);

The `requireValidUtf8()` Function

Validates that a string contains well-formed UTF-8 and only valid Unicode scalar values.

On failure, the test reports:

the byte position of the error,
a readable preview of the input.

Rejected cases include:

malformed byte sequences,
truncated sequences,
surrogate code-points,
code-points above U+10FFFF.

Use this when correctness of UTF-8 encoding is part of what your test verifies.

The `REQUIRE_VALID_UTF8` Macro

Convenience macro for requireValidUtf8():

REQUIRE_VALID_UTF8(resultText);

For the full API reference, see Text Helpers.

Text Helpers

Supported String Types

The Utf8Error Enum

The allUtf8Errors Constant

The characterCount() Function

The toConsoleSafeString() Function

The toStdString() Function

The toStdU32String() Function

The Hex Parsing Functions

Format Rules for Hex Parsing

The invalidUtf8() Function

The splitLineViews() Function

The splitLines() Function

The requireEqualLines() Function

The REQUIRE_EQUAL_LINES Macro

The requireValidUtf8() Function

The REQUIRE_VALID_UTF8 Macro