79778937

Date: 2025-09-30 09:37:35
Score: 0.5
Natty:
Report link

UTF-8 and UTF-16 encodings both use variable length representation for Unicode codepoints, so you cannot assume, that s[i] contains a whole character. If you want to process each codepoint individually, you need to decode UTF-8 (and UTF-16) strings to a sequence of 32 bit codepoints (e.g. std::u32string). For decoding UTF-8 you can for example use the ICU library (also see Decoding UTF-8 std::string to std::u32string? or How do I decode UTF-8? or C++ & Boost: encode/decode UTF-8).

If you know that your UTF-8 string contains ASCII characters only, you don't need to decode it.

If you know that your UTF-16 string contains codepoints between 0000 and D7FF or between E000 and FFFF only (which is the "Basic Multilingual Plane" containing Latin, Greek, Cyrillic etc.), you don't need to decode either. But if the string contains all kinds of unicode characters (like emojis), you certainly should decode for processing individual characters.

Note that for processing tokens instead (for example words separated by spaces) you don't always need to decode.

Reasons:
  • Blacklisted phrase (1): How do I
  • Long answer (-0.5):
  • Has code block (-0.5):
  • Low reputation (0.5):
Posted by: Jens