Skip to content

string_decoder: fix handling of malformed utf8

Checklist
  • make -j4 test (UNIX) or vcbuild test nosign (Windows) passes
  • a test and/or benchmark is included
  • the commit message follows commit guidelines
Affected core subsystem(s)
  • string_decoder
Description of change

There have been problems with utf8 decoding in cases where the input was invalid. Some cases would give different results depending on chunking, while others even led to exceptions. This commit simplifies the code in several ways, reducing the risk of breakage.

Most importantly, the text method is not only used for bulk conversion of a single chunk, but also for the conversion of the mini buffer lastChar while handling characters spanning chunks. That should make most of the problems due to incorrect handling of special cases disappear.

Secondly, that text method is now independent of encoding. The encoding-dependent method complete now has a single well-defined task: determine the buffer position up to which the input consists of complete characters. The actual conversion is handled in a central location, leading to a cleaner and leaner internal interface.

Thirdly, we no longer try to remember just how many bytes we'll need for the next complete character. We simply try to fill up the nextChar buffer and perform a conversion on that. This reduces the number of internal counter variables from two to one, namely partial which indicates the number of bytes currently stored in nextChar. A possible drawback of this approach is that there is chance of degraded performance if input arrives one byte at a time and is from a script using long utf8 sequences. As a benefit, though, this allows us to emit a U+FFFD replacement character sooner in cases where the first byte of an utf8 sequence is not followed by the expected number of continuation bytes.

Fixes: https://github.com/nodejs/node/issues/7308

This is an alternative to #7310; merging this makes that obsolete.

Merge request reports

Loading