79342701

Date: 2025-01-09 13:08:29
Score: 1
Natty:
Report link

I do it by reading the first byte and then you know whether to return or read more and how much more to read, so it is two read operations from stream, which in my case is buffered so I assume it should be ok, for some reason I don't want to manipulate the stream pointer up and down every time just consume the bytes continuously.

It returns an int since I want to know if (valid) EOF was reached, in case of an incomplete multi-byte character or invalid formatting or read error it throws an exception. It supports the whole variable length that it can encode, not just 4 bytes max.

int readUTF8Char(std::istream* is) {
    int v = is->get();
    if (v == EOF) {
        return EOF;
    } else if (!is->good()) {
        throw std::exception("Error reading next character: read error");
    }
    char ch = (char)v;
    int res = ch;
    if (ch & 0b10000000) {
        int rem = 0;
        char buf[5] = { 0 };
        if ((ch & 0b11100000) == 0b11000000) {
            rem = 1;
            res = ch & 0b00011111;
        } else if ((ch & 0b11110000) == 0b11100000) {
            rem = 2;
            res = ch & 0b00001111;
        } else if ((ch & 0b11111000) == 0b11110000) {
            rem = 3;
            res = ch & 0b00000111;
        } else if ((ch & 0b11111100) == 0b11111000) {
            rem = 4;
            res = ch & 0b00000011;
        } else if ((ch & 0b11111110) == 0b11111100) {
            rem = 5;
            res = ch & 0b00000001;
        } else {
            std::string msg = "Invalid UTF8 formatting: " + std::to_string(ch) + " is not a valid starting byte";
            throw std::exception(&msg[0]);
        }
        is->read(buf, rem);
        if (is->rdstate() & std::ios_base::failbit && is->rdstate() & std::ios_base::eofbit) {
            throw std::exception("Error reading composite character: end of stream");
        } else if (!is->good()) {
            throw std::exception("Error reading next character: read error");
        }
        for (int i = 0; i < rem; i++) {
            ch = buf[i];
            if ((ch & 0b11000000) != 0b10000000) {
                std::string msg = "Invalid UTF8 formatting: " + std::to_string(ch) + " is not a valid follow-up byte";
                throw std::exception(&msg[0]);
            }
            res <<= 6;
            res |= ch & 0b00111111;
        }
    }
    return res;
}
Reasons:
  • Blacklisted phrase (1): I want to know
  • RegEx Blacklisted phrase (1): I want
  • Long answer (-1):
  • Has code block (-0.5):
  • Low reputation (0.5):
Posted by: Sagan