Text Helpers
The text helper API in erbsland::unittest::th provides practical utilities for working with text in unit tests.
It helps you:
Count Unicode code-points instead of raw code units.
Convert between UTF-8, UTF-16, UTF-32, and
wchar_tbased strings.Produce readable diagnostics for text containing control characters or malformed UTF-8.
Generate malformed UTF-8 input for validation tests.
Split text into lines for comparisons.
Compare multi-line output with simple wildcard support.
Verify that a byte string contains valid UTF-8.
If you use these helpers frequently, introducing a namespace alias keeps your test code concise:
#include <erbsland/unittest/TextHelper.hpp>
namespace th = erbsland::unittest::th;
Warning
These helpers are designed for reliability, diagnostics, and convenience — not for raw speed.
Several functions internally normalize text to UTF-8 or UTF-32 before processing. This makes them robust and predictable, but also unsuitable for performance-critical hot paths.
If your test repeatedly processes large volumes of text in tight loops, prefer a specialized implementation that operates directly on your native data representation.
Supported String Types
Most helpers provide overloads for the following string view types:
std::string_view
std::u8string_view
std::u16string_view
std::u32string_view
std::wstring_view
The overloads are intentionally parallel. In most cases, you can call the same helper regardless of the text type your test already uses.
The Utf8Error Enum
The Utf8Error enum specifies which malformed UTF-8 sequence
invalidUtf8() should generate.
Available values:
UnexpectedContinuationByteA continuation byte without a valid lead byte.Overlong2ByteSequenceA two-byte overlong encoding.Truncated2ByteSequenceA two-byte sequence missing its continuation byte.InvalidContinuationByteIn2ByteSequenceA two-byte sequence followed by an invalid continuation byte.Overlong3ByteSequenceA three-byte overlong encoding.Truncated3ByteSequenceA three-byte sequence with missing trailing bytes.InvalidContinuationByteIn3ByteSequenceA three-byte sequence containing an invalid continuation byte.SurrogateCodePointBytes that decode to a UTF-16 surrogate code-point (invalid in UTF-8).Overlong4ByteSequenceA four-byte overlong encoding.Truncated4ByteSequenceA four-byte sequence with missing trailing bytes.InvalidContinuationByteIn4ByteSequenceA four-byte sequence containing an invalid continuation byte.CodePointBeyondUnicodeRangeA sequence aboveU+10FFFF.InvalidStartByteA byte that is not valid as a UTF-8 start byte.
The allUtf8Errors Constant
The allUtf8Errors constant contains all enum values in a
single iterable array.
This makes it easy to write exhaustive tests:
for (const auto error : th::allUtf8Errors) {
auto malformed = th::invalidUtf8(error, "pre", "suf");
CHECK_FALSE(myUtf8Validator(malformed));
}
Use this pattern when you want to ensure your UTF-8 handling rejects all supported malformed cases.
The characterCount() Function
Use characterCount() to count Unicode code-points
instead of bytes or UTF code units.
auto count1 = th::characterCount("abc");
auto count2 = th::characterCount("abc→😀");
REQUIRE_EQUAL(count1, 3);
REQUIRE_EQUAL(count2, 5);
This is especially useful when your test data contains non-ASCII characters and std::string::size() would give misleading results.
Note
The result counts Unicode code-points, not grapheme clusters. A single visible character may still consist of multiple code-points.
The toConsoleSafeString() Function
This helper converts text into a representation that is safe and readable in console output.
It escapes:
control characters (newline, carriage return, tab),
invalid UTF-8 bytes,
non-ASCII code-points,
backslashes and double quotes.
If the result contains spaces or escape sequences, it is automatically wrapped in double quotes.
auto text = th::toConsoleSafeString("abc\n\r\t\\\"\u1234", 100);
REQUIRE_EQUAL(text, R"("abc\n\r\t\\\"\u{1234}")");
Use this helper whenever raw strings would make diagnostics hard to interpret.
The maxLength parameter limits the length of the escaped result. If the output is truncated, a suffix
like (... +<count> more) is appended.
The toStdString() Function
This function converts supported string types into a UTF-8 encoded std::string.
std::string_view and std::u8string_view are copied as-is.
std::u16string_view, std::u32string_view, and std::wstring_view are transcoded to UTF-8.
Use this to normalize text across platforms before comparing or printing it.
Important
For std::string_view, the function does not validate UTF-8. It simply copies the bytes.
If you need validation, call
requireValidUtf8() explicitly.
The toStdU32String() Function
This function converts supported input into a UTF-32 std::u32string.
Use it when you want one Unicode code-point per element.
For UTF-8 input, malformed sequences are replaced with the Unicode replacement character U+FFFD instead
of causing errors.
This makes the function safe for diagnostics and normalization, even when the input is not fully trustworthy.
The Hex Parsing Functions
These helpers parse hexadecimal text into strings with fixed-width element types:
Use them when expressing test data as hex is clearer than using source code string literals.
Format Rules for Hex Parsing
All hex parsing helpers follow the same rules:
Spaces are optional.
Spaces may only appear between complete groups.
Leading and trailing spaces are not allowed.
Invalid hex digits throw std::logic_error.
The invalidUtf8() Function
This helper intentionally creates malformed UTF-8 sequences.
Use it to test validators, decoders, or error handling:
auto sample = th::invalidUtf8(th::Utf8Error::InvalidStartByte, "before ", " after");
CHECK_FALSE(myUtf8Validator(sample));
Combining it with allUtf8Errors allows you to build
exhaustive negative tests with very little code.
The splitLineViews() Function
Splits text at \\n and returns views into the original string.
Use this when you want to inspect lines without copying data.
Important
The returned views reference the original input. Do not modify or destroy the source string while they are in use.
The splitLines() Function
Same behavior as splitLineViews(), but returns owned
strings instead of views.
Use this when you need independence from the original input lifetime.
The requireEqualLines() Function
Compares two sequences of lines and produces a side-by-side diff on failure.
The expected lines support simple wildcards:
*matches any number of characters.?matches exactly one character.
Use this for stable comparisons of output that may contain variable fragments.
void testRequireEqualLines() {
const auto expected = std::vector<std::string_view>{
"hello one two three",
"anything*",
"*anything",
"anything*anything",
"two??wildcards",
"*any??",
};
const auto actual = std::vector<std::string_view>{
"hello one two three",
"anything→goes.here",
"another line with anything",
"anything can be in the middle, anything",
"two+-wildcards",
"is there any::",
};
// manual use
WITH_CONTEXT(th::requireEqualLines(*this, actual, expected));
// macro use
REQUIRE_EQUAL_LINES(actual, expected);
}
If the two line sequences don’t match, you will get diagnostic output like this:
| Actual (2) | | Expected (3) |
|-----------------------------|-----|-----------------------------|
| one two three four five six | === | one two three four five six |
| different | X | another line |
| | X | last line |
|-----------------------------|-----|-----------------------------|
The REQUIRE_EQUAL_LINES Macro
Convenience wrapper around requireEqualLines() that automatically integrates with the test
context:
REQUIRE_EQUAL_LINES(actualLines, expectedLines);
The requireValidUtf8() Function
Validates that a string contains well-formed UTF-8 and only valid Unicode scalar values.
On failure, the test reports:
the byte position of the error,
a readable preview of the input.
Rejected cases include:
malformed byte sequences,
truncated sequences,
surrogate code-points,
code-points above
U+10FFFF.
Use this when correctness of UTF-8 encoding is part of what your test verifies.
The REQUIRE_VALID_UTF8 Macro
Convenience macro for requireValidUtf8():
REQUIRE_VALID_UTF8(resultText);
For the full API reference, see Text Helpers.