Reports

Introduction

I think I got the reason why its not working the way everyone expects it to.

Barmar's suggestion was to compare the stdout of each examples, to see how each output differs from each other. But, I made something simpler. printf returns the number of characters it outputs to the screen. The same happens with wprintf.

So, I decided to test both of them, but also, I added a few more lines for a better understanding of the problem.

// Example 1
int n = printf("\U0001F625");
unsigned char *s = (unsigned char *) "\U0001F625";
printf("%d bytes %d %d %d %d\n", n, s[0], s[1], s[2], s[3]);
// This prints: 😥4 bytes 240 159 152 165

// Example 2
int n = wprintf(L"\U0001F625");
unsigned char *s = (unsigned char *) L"\U0001F625";
printf("%d bytes %d %d %d %d\n", n, s[0], s[1]);
// This prints: 2 bytes 61 216
// Note that the emoji doesn't appear. That's the output everyone is getting.

_{As a side note, I know I repeated variable names. I tested each example separately by commenting each part to avoid name conflicts.}

Okay. So, why did I do all of that?

First, it starts on how the UTF-8 encoding works in binary level. You can read more about it here on the wikipedia. The table in the description section is an amazing resource to understand how the encoding works in low level.

Checking a few things first

I've got this output from C, from example 1: This prints: 😥4 bytes 240 159 152 165, because I want to see the binary representation of the number \U0001F625, which is 128549 in decimal. By checking the UTF-8 table, we get that it outputed a string of 4 bytes.

So according to the table, the unicode must be between U+010000 and U+10FFFF range.

By converting everything in decimals, we can easily see that 65536 <= 128549 <= 1114111 is true. So, yes, we've really got a utf-8 character of 4 bytes from that printf. Now, I want to check the order of those bytes. That is, should we mount our byte string with s[0], s[1], s[2], s[3]? Or the reverse order: s[3], s[2], s[1], s[0]?

I started in the 0-3 order.

To make things easier, I used python, and converted the s[n] sequence to a byte string:

'{:08b} {:08b} {:08b} {:08b}'.format(240, 159, 152, 165)
# '11110000 10011111 10011000 10100101'

In the UTF-8 table, we see that a 4-byte character must be in the binary form:

11110uvv 10vvwwww 10xxxxyy 10yyzzzz
11110000 10011111 10011000 10100101

So, that matches. Now, by concating the binary from where the u, v, w, x, y, z charaters are, we get: 000011111011000100101. In python, executing int('000011111011000100101', 2), we get: 128549.

So that means that the printf is really returning a UTF-8 character of the unicode 128549 or \U0001F625, and, I just proved that we can read each byte of that string from sequence 0 to 3, in this order. At least, on my PC and gcc compiler.

Seeing what happens on wprintf

Now, to the second example, let's see what's happening. We've got the output This prints: 2 bytes 61 216. So, if we get a binary representation of 61 and 216 bytes, it is: 00111101 11011000.

What's the problem with this string?

First, if we attempt to convert it to a decimal, we get int('0011110111011000', 2) -> 15832, or 0x3dd8. But that's expected. We had a very huge number that needed at least 3 bytes, and now we got just 2 bytes. There's no way it can fit inside it.

Second, the problem also lies on the UTF-8 encoding. A character of 2 bytes must be defined as:

110xxxyy 10yyzzzz
00111101 11011000

It doesn't match. So our output from wprintf is not UTF-8 encoded.

So, the only explanation is that it must be UTF-16 encoded. Because from many resources, specially this one from microsoft, after all the question in the matter seems to be in windows, it states that wchar_t is to support UTF-16 enconding.

I attempted to seek what character the unicode 0x3dd8 represents, but I didn't found anything. This site basically tells that this unicode has no representation at all. So, it's indeed a blank character.

Conclusion

That's how deep I could go on this matter. By calling wprintf with L"\U0001F625", it converts that codepoint into a smaller number, which is 0x3dd8, and this character seems to be invisible.

79802375

Introduction

Checking a few things first

Seeing what happens on wprintf

Conclusion