How to Fix Garbled Characters in Pandas DataFrame to CSV with UTF-8 Encoding
Even when specifying encoding="utf-8" in DataFrame.to_csv(), international characters appear as trash because pandas delegates file handling to get_handle in pandas/io/common.py, which ignores encoding parameters when operating in binary mode or when the file is later read with a different encoding.
When exporting dataframes containing international text using pandas.DataFrame.to_csv(), developers expect UTF-8 encoding to preserve all characters correctly. However, the pandas source code reveals that the actual Unicode handling occurs deep in the IO stack, where certain parameter combinations can silently bypass your encoding specification, resulting in mojibake or replacement characters.
How Pandas Handles Encoding in to_csv
DataFrame.to_csv does not directly write to files. Instead, it delegates CSV creation to pandas.io.formats.csvs.CSVFormatter, which manages column conversion and row formatting. The critical encoding logic resides in the save method, where CSVFormatter calls pandas.io.common.get_handle to open the target file:
with get_handle(
self.filepath_or_buffer,
self.mode,
encoding=self.encoding,
errors=self.errors,
compression=self.compression,
storage_options=self.storage_options,
) as handles:
# Writing occurs here
Source: CSVFormatter.__init__ stores the encoding parameter, and CSVFormatter.save passes it to get_handle.
Because CSVFormatter only converts data to strings via _get_values_for_csv before passing rows to Python's standard csv.writer, it never modifies the characters themselves. All encoding enforcement happens inside get_handle when opening the file stream. If this handle opens in binary mode or receives an unrecognized encoding alias, the resulting file will contain incorrectly encoded bytes regardless of the encoding parameter value.
Common Causes of UTF-8 Encoding Errors
Binary Mode Discards Encoding Information
When you pass mode='wb' (write binary) to to_csv, get_handle opens a binary stream. Binary streams ignore the encoding parameter entirely, writing raw bytes that your operating system may interpret using its default code page (such as Windows-1252 or GBK) rather than UTF-8. This produces garbled output for characters like é, ü, or 中文.
Fix: Use text mode (mode='w') or allow pandas to choose the default mode automatically. When using compression, maintain encoding="utf-8" and ensure your reading application explicitly uses UTF-8.
Incorrect Encoding Spelling
Pandas validates the encoding string but passes it directly to Python's open function. A typo like "UTF8" or "utf8" (without the hyphen) may fall back to the platform default on certain systems, while "utf-8" (lower-case with hyphen) is the canonical name recognized across all platforms.
Fix: Always use the exact spelling "utf-8" or the alias "utf_8".
The errors Parameter Silently Corrupts Data
Setting errors="ignore" or errors="replace" tells Python to drop unencodable bytes or substitute them with replacement characters () rather than raising an exception. This can make it appear that the export succeeded when characters were actually lost or mutated.
Fix: Keep errors="strict" (the default) during debugging to surface encoding mismatches immediately.
System Locale and Console Misconfiguration
Even a correctly encoded UTF-8 file may display as garbage in terminals or editors configured for different code pages. This is a display issue, not an encoding issue, but it leads developers to believe the export failed.
Fix: Configure your terminal or text editor to use UTF-8, or explicitly open the file with open(path, encoding="utf-8") to verify contents.
Reading Without Specifying Encoding
pd.read_csv defaults to the system locale encoding. Reading a UTF-8 file on a non-UTF-8 locale (common on Windows) produces mojibake even though the file itself is correctly encoded.
Fix: Always specify encoding="utf-8" when reading CSV files: pd.read_csv(path, encoding="utf-8").
Solutions and Code Examples
Export a dataframe with international characters correctly:
import pandas as pd
df = pd.DataFrame({
"city": ["München", "São Paulo", "北京"],
"value": [1, 2, 3],
})
# Correct: Explicit UTF-8 text mode
df.to_csv("data_utf8.csv", index=False, encoding="utf-8")
# Correct: UTF-8 with compression
df.to_csv("data_utf8.zip", index=False, compression="zip", encoding="utf-8")
Read the file back safely:
# Always match the encoding when reading
df2 = pd.read_csv("data_utf8.csv", encoding="utf-8")
print(df2)
Avoid the binary mode pitfall:
# Wrong: Binary mode ignores encoding, produces garbled output
df.to_csv("bad.csv", mode="wb", encoding="utf-8")
# Correct: Text mode respects encoding
df.to_csv("good.csv", mode="w", encoding="utf-8")
Key Source Files
Understanding these files clarifies why encoding issues occur:
pandas/io/formats/csvs.py: ImplementsCSVFormatter, which handles column conversion and delegates file operations toget_handle. The__init__method stores encoding, andsaveapplies it.pandas/io/common.py: Providesget_handle, the centralized utility that opens file handles with the specifiedencoding,mode, andcompression. This is where binary mode overrides encoding settings.pandas/tests/io/formats/test_to_csv.py: Contains test cases for encoding parameters and error handling during CSV export.pandas/tests/io/parser/test_encoding.py: Validates round-trip read/write operations with various encodings, demonstrating the necessity of matchingencodingparameters on both export and import.
Summary
DataFrame.to_csvdelegates encoding toget_handleinpandas/io/common.py, which only respects text modes.- Binary mode (
mode='wb') forcesget_handleto ignore theencodingparameter, causing the OS default code page to interpret bytes. - Always use
encoding="utf-8"(lower-case with hyphen) when writing and reading CSV files to ensure cross-platform compatibility. - Keep
errors="strict"during development to catch encoding mismatches before they silently corrupt data. - Specify encoding on both sides: Export with
to_csv(encoding="utf-8")and import withread_csv(encoding="utf-8")to prevent locale-based misinterpretation.
Frequently Asked Questions
Why does my CSV look correct in Python but shows garbage in Excel?
Excel uses your system's default code page to open CSV files unless you use the import data wizard. Save the file with a UTF-8 BOM (Byte Order Mark) by specifying encoding="utf-8-sig" in to_csv, or import the file through Excel's Data > From Text/CSV menu and manually select UTF-8 encoding.
Does compression affect UTF-8 encoding in pandas?
No, compression algorithms handle bytes, not characters. However, when using compression="zip" or similar, ensure you still specify encoding="utf-8" so get_handle opens the underlying file in text mode with the correct codec before compression occurs.
What is the difference between utf-8 and utf-8-sig in pandas?
utf-8 writes the raw UTF-8 byte sequence, while utf-8-sig prepends a BOM (Byte Order Mark) to the file. The BOM helps some applications (like Excel) recognize the file as UTF-8, but it can interfere with Unix tools that expect plain text. Use utf-8-sig only when targeting applications that require BOM detection.
Why do I get a UnicodeEncodeError even with encoding="utf-8"?
This occurs when your dataframe contains characters that cannot be encoded in the specified encoding (for example, emoji in an ASCII file), or when errors="strict" encounters invalid surrogate pairs. Verify your data contains only valid Unicode code points, or use errors="replace" only as a last resort for lossy export.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →