L"abc"s.length() == L"abcd"s.length() => true / C++の文字列エンコーディング

C++の文字列のencodingについて調べていた。

C++14からはstring literalにsuffixをつけることでそのままstd::stringなどのインスタンスが返ってくる。

string str_literal_to_str = "str"s;
wstring wide_literal_to_wstr = L"str"s;
u16string u16_literal_to_u16str = u"str"s;
u32string u32_literal_to_u32str = U"str"s;

u8stringについてはC++20からとのこと。
en.cppreference.com

また、Cのページ(https://en.cppreference.com/w/c/language/string_literal)にはエンコーディングについても言及があり

16-bit wide string literal: The type of the literal is char16_t[N], where N is the size of the string in code units of implementation-defined 16-bit encoding **(typically UTF-16)**, including the null terminator. 

The encoding of narrow multibyte string literals (1) and wide string literals (2) is implementation-defined. For example, gcc selects them with the commandline options -fexec-charset and -fwide-exec-charset.

とあり、u16stringとu32stringについては実装依存、wstringはコンパイルオプション次第らしい。とりあえず確かめる為にこんなコードを書いてみる。 https://gist.github.com/Isa-rentacs/d975e723478121d9eeab58cd4957b426

出力は

=== literal_to_str ===
Size: 4
f2ec7160, f0
f2ec7161, a9
f2ec7162, b8
f2ec7163, bd
=== wide_literal_to_wstr ===
Size: 1
f2ec7180, 3d
f2ec7181, 9e
f2ec7182, 02
f2ec7183, 00
=== u16literal_to_u16string ===
Size: 2
f2ec71c0, 67
f2ec71c1, d8
f2ec71c2, 3d
f2ec71c3, de
=== u32literal_to_u32string ===
Size: 1
f2ec71e0, 3d
f2ec71e1, 9e
f2ec71e2, 02
f2ec71e3, 00

wstring, u16string, u32stringはそれぞれUTF-16LE, UTF-16LE, UTF-32LEのようだ。(UTF-16ではBOMなしの場合はBEとして判断されるべき。 )

Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian. http://unicode.org/faq/utf_bom.html#bom5

問題はここから、コンパイルオプションで-fwide-exec-charset=utf-16としてみる。そして

wstring wide_literal_to_wstr2= L"abcd"s;

cout << "=== wide_literal_to_wstr2 ===" << endl;
cout << "Size: " << wide_literal_to_wstr2.size() << endl;
cout << "length same?:" << (L"abc"s.length() == L"abcd"s.length() ? "Yes" : "No") << endl;
print_bytes((unsigned char*)wide_literal_to_wstr2.data(),(unsigned char*)&*(wide_literal_to_wstr2.begin() + wide_literal_to_wstr2.size()));

こんな感じのを実行してやると

=== wide_literal_to_wstr2 ===
Size: 2
length same?:Yes
6e305bc0, ff
6e305bc1, fe
6e305bc2, 61
6e305bc3, 00
6e305bc4, 62
6e305bc5, 00
6e305bc6, 63
6e305bc7, 00

BOMがついてLEでエンコードされているのが解る。2つのwstringを+演算子で連結してもこのBOMは消えない。

また、L"abc"s.length() == L"abcd"s.length() ? "Yes" : "No"がYesになってしまう。この環境ではwchar_tは4バイトなのにエンコーディングに2バイト単位のUTF-16を指定しているからだろう、この場合コンパイル時に-fshort-wcharを指定してwchar_tを2バイトにしないとだめっぽい。

https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html#index-fshort-wchar

まとめ

gccのデフォルトエンコーディングはstringはUTF-8, wstring/u16stringはUTF-16LE, u32stringはUTF-32LE
string/wstringのエンコーディングはコンパイルオプションで変更可 (-fexec-charset / -fwide-exec-charset)
- UTF-16/UTF-32を指定すると文字列にBOMが付く。文字列連結する場合でも取り除かれたりはしない。
char_t/wchar_tのサイズとエンコーディングの単位が合ってないと文字数などの挙動がおかしくなる。

Isa@Diary

ソフトウェア開発やってます。プログラミングとか、US生活とかについて書きます。

L"abc"s.length() == L"abcd"s.length() => true / C++の文字列エンコーディング

まとめ