src: allow simdutf::convert_* functions to return zero (!47471) · Merge requests · Rodrigo Test / Test Group-nodejs / node

Rodrigo Muino Tomonari requested to merge github/fork/lemire/node-string-invalid-unicode into main Apr 07, 2023

When transcoding with the simdutf library, you first scan the input to determine the size of the output (e.g., you scan the UTF-8 input to determine the size of the UTF-16 output). In a second step, you call a transcoding function. This transcoding function normally returns how many words were written. This number of words should match the size of the output computed during the first scan.

So you get three-line routines like as follow (scan, allocate, transcode):

size_t expected_utf16_length =
        simdutf::utf16_length_from_utf8(string.data(), string.length());
MaybeStackBuffer<char16_t> buffer(expected_utf16_length);
size_t utf16_length = simdutf::convert_utf8_to_utf16(
        string.data(), string.length(), buffer.out());

The scan to determine the size of the output does not validate the Unicode input: the validation occurs during the transcoding. For performance purposes, it will only seek to tell you how much memory you need to allocate, counting on the transcoding step to do the validation.

When the transcoding fails, the simdutf::convert_utf8_to_utf16 and simdutf::convert_utf16_to_utf8 functions return zero by convention, indicating an error. So you either have a successful transcoding (from valid Unicode to valid Unicode) in which case the transcoding function returns the number of written words, which matches exactly the expected number of output words, or you get zero, indicating that the input is invalid Unicode.

Currently, the simdutf library is used within src/inspector/node_string.cc with checks such as CHECK_EQ(expected_utf16_length, utf16_length);. In effect, these checks are true if and only if the inputs are valid Unicode. That should almost always be the case within Node. However, @danpeixoto reports that the check fail in their case, see https://github.com/nodejs/node/issues/47457

I cannot reproduce @danpeixoto's issue. See my comments on the issue. Nevertheless, it seems warranted to make the code more robust in case we do have bad Unicode inputs.

This is what this PR does: it checks whether the transcoding functions return 0, and if it does, then it assumes that the input was invalid.

By convention, the routines return the empty string or a null, when the input was invalid. This could be changed to some other convention.

Admin message

Admin message

src: allow simdutf::convert_* functions to return zero

Merge request reports