Setup environment

call "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Auxiliary\Build\vcvars64.bat"

source-charset and execution-charset

The source-charset is the encoding used by Visual Studio to interpret the source files into the internal representation. Specially, for Narrow String Literals in the source files, the compiler use UTF-8 (why not UTF-16?) encoded strings as the internal representation, and then these strings are converted to the execution-charset and store in the compiled object files.

To sum up, the compiler converts narrow string literals in source files from source-charset to Unicode and then to execution-charset, and finally stores the results into compiled binaries. source-charset must be the encoding of the source files used to store on disk. execution-charset is the encoding of const char[] in memory when the program runs. source-charset and execution-charset are independent. If a character in the source file cannot be represented in the execution character set, the Unicode conversion substitutes a question mark '?' character, see /validate-charset option.

By default, execution-charset is the Windows code page, a.k.a. ANSI code page (ACP), unless you have specified a character set name or code page by using the /execution-charset option. For source-charset, if no /source-charset option is specified, Visual Studio detects BOM to determine if a source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no BOM is found, it assumes the source file is encoded using ACP.

The testing source file test\execution_charset.c is encoded as Windows-1252 which cannot be auto-detected and these are characters invalid in ACP. Without /source-charset, the compiler performs ACP to Unicode conversion for Windows-1252 strings and complains C4819 for some invalid ACP characters.

cl /c test\execution_charset.c

warning C4819: The file contains a character that cannot be represented in the current code page (936). Save the file in
Unicode format to prevent data loss.

Tell compiler the real encoding of the source file, the Unicode to ACP conversion is finally performed and the compiler complains C4566 for some Unicode characters for which then substitutes a question mark '?'.

cl /c /source-charset:.1252 test\execution_charset.c

warning C4566: character represented by universal-character-name '\u00FF' cannot be represented in the current code page (936).

Encoding of Windows Console

Windows Console (conhost.exe) is a Win32 GUI app that consists of:

InputBuffer: Stores keyboard and mouse event records generated by user input.
OutputBuffer: Stores the text rendered on the Console's window client area.

OutputBuffer was ssentially a 2D array of CHAR_INFO structs which contain each cell's character data & attributes. That means only UCS-2 text was supported. Since Windows 10 October 2018 Update (Version 1809, Build Number 10.0.17763), a new OutputBuffer is introduced to fully support all unicode characters.

Another issue is that Console uses GDI for text rendering, which doesn't support font-fallback. So some complex glyphs can't be displayed even if the OutputBuffer could store them. ConPTY is introduced together with the new OutputBuffer. Then Console becomes a true "Console Host", which is windowless and not responsible for user input and rendering, supporting all Command-Line apps and/or GUI apps that communicate with Command-Line apps through Console Virtual Terminal Sequences. Terminal (TTY) is such a typical GUI app responsible for user input and rendering. With ConPTY infrastructure, Windows Terminal uses a new rendering engine that supports font-fallback and displays all testing characters correctly.

Command-Line apps use WriteConsoleW to write unicode text to OutputBuffer and ReadConsoleW to read unicode text from InputBuffer. WriteConsoleA/WriteFile can also be used for output but that involves a encoding conversion from ConsoleOutputCP (defaults to OEMCP) to Unicode before storing text into OutputBuffer. Accordingly, use ReadConsoleA/ReadFile for input will do the conversion from Unicode to ConsoleInputCP (also defaults to OEMCP). Note that ConsoleInputCP only supports DBCS, see ms-terminal/src/host/dbcs.cpp#TranslateUnicodeToOem.

The builtin command type of the "Command Prompt" shell (cmd.exe) checks the start of a file for a UTF-16LE BOM. If it finds such a mark, it displays the file content using WriteConsoleW, otherwise using WriteConsoleA/WriteFile. So type displays correctly only for UTF-16LE BOM-ed files and those encoded in current ConsoleOutputCP. In PowerShell, type detects BOM for UTF-16 and UTF-8. To verify these, just run type words\word-*.txt in Cmd and PowerShell.

UCRT and UTF-8

UCRT is the Windows' equivalent of the GNU C Library (glibc) that including C99 and POSIX functionality and some extensions since Visual Studio 2015. Some POSIX functions have historically used the ACP for doing narrow->wide conversions. In order to support UTF-8, utf8 locale is implemented in ucrt/locale/get_qualified_locale.cpp since UCRT 10.0.17134.0, and those functions have been modified so that they use CP_UTF8 when current locale is utf8, but the ACP otherwise in order to preserve backwards compatibility. These POSIX functions call ucrt/inc/corecrt_internal_win32_buffer.h#__acrt_get_utf8_acp_compatibility_codepage to grab the codepage they should use for their conversions. An example is fopen: it convert narrow path to wide path using the grabbed codepage and then delegates to wide version of ucrt/lowio/open.cpp#_sopen_nolock. Besides, the encoding of the narrow string representation of std::filesystem::path is also the grabbed codepage.

The I/O flow path in the UCRT is

C++ I/O -> C I/O -> POSIX I/O  -> Win32 File/Console I/O
filebuf -> FILE* -> read/write -> ReadFile/WriteFile/ReadConsoleW/WriteConsoleW

[w]cin/f[w]scanf/fget[w]s -> fget[w]c
[w]cout/f[w]printf/fput[w]s -> fput[w]c

fgetwc -> fgetc (*2, compose for _O_U16TEXT and _O_BINARY, mbtowc(DBCS) for _O_TEXT) -> fread -> read
fputwc -> (wctomb -> fputc, for _O_TEXT) -> fwrite -> write

The details of read with different mode:

_O_BINARY or _O_TEXT: ReadFile
_O_U8TEXT: ReadFile -> UTF-8 -> UTF-16
File _O_U16TEXT: ReadFile
Console _O_U16TEXT: ReadConsoleW

The details of write with different mode:

_O_BINARY: WriteFile
File _O_U8TEXT: UTF-16 -> UTF-8 -> WriteFile
File otherwise: WriteFile
Console Unicode: WriteConsoleW for each wchar, so only supports UCS-2
Console _O_TEXT with LC_CTYPE:
- C: WriteFile
- utf8: UTF-8 -> UTF-16 -> ~~ConsoleInputCP~~ ConsoleOutputCP -> WriteFile
- otherwise: DBCS (mbtowc) -> UTF-16 -> ~~ConsoleInputCP~~ ConsoleOutputCP -> WriteFile

Win32 Direct Console I/O and C Wide I/O are always available for Unicode Console I/O. Since UCRT 10.0.17763.0, print functions treat the text data as UTF-8 encoded if locale is set to utf8. The changes are in ucrt/lowio/write.cpp#write_double_translated_ansi_nolock. ~~The translation to ConsoleInputCP is strange, I think it should be ConsoleOutputCP and~~ (This bug is fixed in UCRT 10.0.19041.0) double translation is no need. UCRT should be reworked to use WriteConsoleW after translated to UTF-16 such that no codepage is involved: ANSI(including UTF-8) -> UTF-16 -> WriteConsoleW.

ReadConsoleA/ReadFile get ANSI characters from ConsoleInputCP, but SetConsoleCP(CP_UTF8) doesn

Wmain

Install / Use

README

Setup environment

source-charset and execution-charset

Encoding of Windows Console

UCRT and UTF-8