Zsh History File Encoding Issue with UTF-8 Characters
Issue Description
When opening .zsh_history file via vim, some UTF-8 encoded characters (e.g. Japanese, Chinese characters) are not displayed correctly, and vim shows the fileencoding as latin1 instead of utf-8.
-
For example, the following shows how Japanese/Chinese characters are displayed incorrectly in vim:
2466 : 1762759121:0;locale 2467 : 1762759501:0;echo "ä½<83><80>好" 2468 : 1762759519:0;echo "ã<81><83>³ã<82><83>³ã<81>«ã<81><83><81>ã<81>¯" -
The actual content of the
.zsh_historyfile is as follows:→ fc -l -3 2410 locale 2411 echo "你好" 2412 echo "こんにちは" -
And if you check the
fileencodingin vim::set fileencoding?It shows:
fileencoding=latin1
Workaround Attempts
Initially, I thought it was due to the incorrect locale settings in my terminal/zsh/vim. However, after verifying that the terminal emulator and .zshrc, .zshenv, .vimrc are all set to UTF-8 correctly, the issue persisted.
-
.vimrcsettings:" ---------- UTF-8 config ---------- set encoding=utf-8 set termencoding=utf-8 set fileencoding=utf-8 set fileencodings=utf-8,ucs-bom,latin1 set fileformats=unix,dos,mac -
.zshenv/.zshrcsettings:export LANG=en_US.UTF-8 export LC_CTYPE=en_US.UTF-8 setopt MULTIBYTE export HISTFILE="$HOME/.zsh_history"
Then, I was considering the plugins I installed in zsh might have caused the problem, so I disabled all plugins and themes, but the issue remained.
I tried converting the file from latin1 to utf-8 using iconv, but when new entries added to .zsh_history, the same issue would occur again.
-
Checking the file encoding with
iconv:→ iconv -f utf-8 -t utf-8 $HISTFILE > /dev/null && echo "OK" iconv: iconv(): Illegal byte sequence
Finally, I found the answer from Stack Overflow: Zsh history file not in UTF-8 encoding even though locale is set to UTF-8, which mentioned an arhive of a zsh mailing list post.
When I try to use UTF-8 file name in shell command, ZHS history file seems to save it with “meta code”.
For example, executing
$ ls \346\226\207\345\255\227 (octal expression of “ls 文字”)
results in histfile
$ ls \346\203\266\203\247\2\345\255\203\267That is, when 0x80-0x9F characters are used, then always 0x83 Meta character is inserted and following character is bit shifted, resulting garbage history.
Any way to avoid this situation?
Any help is really appreciated,
This isn’t a bug, the history file is saved in metafied format. If you want to print it outside zsh you can use this simple program.
#define Meta ((char) 0x83) #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> /* from zsh utils.c */ char *unmetafy(char *s, int *len) { char *p, *t; for (p = s; *p && *p != Meta; p++); for (t = p; (*t = *p++);) if (*t++ == Meta) t[-1] = *p++ ^ 32; if (len) *len = t - s; return s; } int main(int argc, char *argv[]) { char *line = NULL; size_t size; while (getline(&line, &size, stdin) != -1) { unmetafy(line, NULL); printf("%s", line); } if (line) free(line); return EXIT_SUCCESS; }
Also see the https://github.com/zsh-users/zsh/blob/master/Src/zsh.h
It contains the following comment about the Meta character:
These are the characters for which the imeta() test is true: the null character, and the characters from Meta to Marker.
/* Meta together with the character following Meta denotes the character *
* which is the exclusive or of 32 and the character following Meta. *
* This is used to represent characters which otherwise has special *
* meaning for zsh. These are the characters for which the imeta() test *
* is true: the null character, and the characters from Meta to Marker. */
#define Meta ((char) 0x83)
/*
* Character tokens.
* These should match the characters in ztokens, defined in lex.c
*/
#define Pound ((char) 0x84)
#define String ((char) 0x85)
#define Hat ((char) 0x86)
#define Star ((char) 0x87)
#define Inpar ((char) 0x88)
#define Inparmath ((char) 0x89)
#define Outpar ((char) 0x8a)
#define Outparmath ((char) 0x8b)
#define Qstring ((char) 0x8c)
#define Equals ((char) 0x8d)
#define Bar ((char) 0x8e)
#define Inbrace ((char) 0x8f)
#define Outbrace ((char) 0x90)
#define Inbrack ((char) 0x91)
#define Outbrack ((char) 0x92)
#define Tick ((char) 0x93)
#define Inang ((char) 0x94)
#define Outang ((char) 0x95)
#define OutangProc ((char) 0x96)
#define Quest ((char) 0x97)
#define Tilde ((char) 0x98)
#define Qtick ((char) 0x99)
#define Comma ((char) 0x9a)
#define Dash ((char) 0x9b) /* Only in patterns */
#define Bang ((char) 0x9c) /* Only in patterns */
/*
* Marks the last of the group above.
* Remaining tokens are even more special.
*/
#define LAST_NORMAL_TOK Bang
/*
* Null arguments: placeholders for single and double quotes
* and backslashes.
*/
#define Snull ((char) 0x9d)
#define Dnull ((char) 0x9e)
#define Bnull ((char) 0x9f)
/*
* Backslash which will be returned to "\" instead of being stripped
* when we turn the string into a printable format.
*/
#define Bnullkeep ((char) 0xa0)
/*
* Null argument that does not correspond to any character.
* This should be last as it does not appear in ztokens and
* is used to initialise the IMETA type in inittyptab().
*/
#define Nularg ((char) 0xa1)
/*
* Take care to update the use of IMETA appropriately when adding
* tokens here.
*/
/*
* Marker is used in the following special circumstances:
* - In paramsubst for rc_expand_param.
* - In pattern character arrays as guaranteed not to mark a character in
* a string.
* - In assignments with the ASSPM_KEY_VALUE flag set in order to
* mark that there is a key / value pair following. If this
* comes from [key]=value the Marker is followed by a null;
* if from [key]+=value the Marker is followed by a '+' then a null.
* All the above are local uses --- any case where the Marker has
* escaped beyond the context in question is an error.
*/
#define Marker ((char) 0xa2)
Explanation
Zsh uses the byte 0x83 (called the Meta character) to indicate that the following character has been metafied. Bytes in the range from 0x80 to 0xA2, and 0x00 are encoded with this meta marker, not stored directly.
0x00: ASCII null character (stored using Meta encoding)0x01–0x7F: ASCII characters (stored as-is)0x80–0xA2: Non-ASCII characters (stored using Meta encoding)0xA3–0xFF: Non-ASCII characters (stored as-is)
For example: A character with byte value 0x96 would be stored as two bytes: 0x83 0xB6.
bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0
↑
1 0 0 1 0 1 1 0 -> 0x96
0 0 1 0 0 0 0 0 -> Ox20 (XOR mask)
1 0 1 1 0 1 1 0 -> 0xB6
The example of “文字” in the mail will be stored as:
| Stage | Encoding | Byte (Hex) | Octal |
|---|---|---|---|
| UTF-8 | “文字” | E6 96 87 E5 AD 97 |
\346\226\207\345\255\227 |
| Zsh Metafied | - | E6 83 B6 83 A7 E5 AD 83 B7 |
\346\203\266\203\247\345\255\203\267 |
Metafy Function for Demonstration
# Written in Python for demonstration
# Metafy function to simulate zsh metafication
def metafy(byte):
if 0x80 <= byte <= 0x9F:
# Meta character + flipped byte (XOR by 0x20)
print(f"Ouput byte: {0x83:#04x} {byte^0x20:#04x}")
return [0x83, byte ^ 0x20]
else:
# No change for other bytes
print(f"Output byte: {byte:#04x}")
return [byte]
Why 0x80-0x9F (and 0x00 0xA0 0xA1 0xA2)?
My understanding is that the mail specifically mentions the 0x80-0x9F range because these bytes correspond to the C1 control set in character encoding standards.
The C1 control set (or C1 control characters) is a group of non-printable control codes defined in the ISO 6429 / ECMA-48 standard. They occupy the byte range 0x80–0x9F in 8-bit character sets such as ISO-8859-1 (Latin-1). These bytes are not printable characters, but to tell a terminal or device to conduct operations (e.g. move the cursor, start a new line, or change text color).
| Control set | Byte range | Era | Example codes | Purpose |
|---|---|---|---|---|
| C0 | 0x00–0x1F | 7-bit ASCII | NUL (0x00), LF (0x0A), ESC (0x1B) | Basic text control |
| C1 | 0x80–0x9F | 8-bit extension (ISO 6429) | NEL (0x85), CSI (0x9B), SS2 (0x8E) | Extended terminal control |
References
- Stack Overflow: Zsh history file not in UTF-8 encoding even though locale is set to UTF-8 https://stackoverflow.com/a/75090605
- Zsh Mailing List Archive “Re: Fw: ZSH history file VS. UTF-8 data” https://www.zsh.org/mla/users/2011/msg00154.html