Fix UTF-8 Issues in Python: Stop Garbled Text

1. Introduction

Python is a powerful programming language for string processing and is used worldwide. However, when handling Japanese or other multilingual text in Python, it’s important to choose the correct character encoding. In particular, UTF-8 supports multiple languages, including Japanese, and reduces the risk of garbled text. This guide explains how to handle UTF-8 encoding in Python and provides practical methods to prevent garbled text. It covers a wide range of topics—from the basics of encoding and decoding to file operation settings, and includes Windows-specific considerations and solutions to common errors, so you can apply it in practice.

2. Basics of Character Encoding in Python

Fundamentals of Character Encoding

Character encoding is the process of converting characters into data that a computer can understand. For example, the character ‘あ’ is encoded as three bytes in UTF-8 and represented as binary data. In Python, encoding and decoding are done using the str type (string) and the bytes type (bytes).

Encoding and Decoding in Python

In Python, use the encode() method to encode strings and the decode() method to decode bytes. This allows conversion between text data and byte data.

Encoding Example

The following example encodes a string in UTF-8 and displays it as a byte sequence.
text = "Using UTF-8 in Python"
encoded_text = text.encode("utf-8")
print(encoded_text)
# Output: b'PythonでUTF-8を使う'

Decoding Example

Next, here’s how to convert a UTF-8 encoded byte sequence back to the original string.
decoded_text = encoded_text.decode("utf-8")
print(decoded_text)
# Output: Using UTF-8 in Python
By understanding how to convert between strings and bytes, you’ll be able to handle encodings correctly.

3. Handling UTF-8 in Python

Specifying UTF-8 for file operations

When working with files in Python, it is recommended to explicitly specify UTF-8 encoding. If you do not specify an encoding, the platform-dependent default encoding will be used, which can cause garbled text.

Example: Writing to a file

with open("sample.txt", "w", encoding="utf-8") as f:
    f.write("Hello, Python!")

Example: Reading from a file

with open("sample.txt", "r", encoding="utf-8") as f:
    content = f.read()
    print(content)
# Output: Hello, Python!
Specifying UTF-8 for file operations helps prevent garbled text in multilingual content, including Japanese.

Risks of forgetting to specify the encoding

If no encoding is specified, the system’s default encoding will be used; on Windows in particular, Shift_JIS may be applied, causing garbled text. When performing file operations, make it a habit to always specify encoding="utf-8".

4. Considerations for Windows environments

On Windows the system default encoding is often Shift_JIS, and when handling data that includes Japanese, failing to specify UTF-8 can result in garbled text. Here we introduce countermeasures using UTF-8 mode (PEP 540) and environment variables.

Setting the PYTHONUTF8 environment variable

To force Python’s encoding to UTF-8 on Windows, set the PYTHONUTF8 environment variable to “1”. This causes all Python file operations to use UTF-8.

How to set the environment variable

  1. Open the Environment Variables dialog From the “Edit environment variables” dialog, add a new variable.
  2. Add the variable Set the variable name to “PYTHONUTF8” and the value to “1”.
With this setting, UTF-8 becomes the default encoding, reducing the risk of garbled text in file operations.

5. Changing the Default Encoding in Python 3

Starting with Python 3.7, UTF-8 mode can be enabled using the -X utf8 option or the PYTHONUTF8 environment variable. When enabled, Python will use UTF-8 as the default encoding regardless of the system encoding.

Enable UTF-8 Mode Using a Command-Line Argument

python -X utf8 my_script.py
This command ensures Python always uses UTF-8 encoding and prevents garbled text across different environments.

6. Causes of Garbled Text and How to Fix Them

Common Causes of Garbled Text

  1. Encoding mismatch
  • This happens when the file’s encoding differs from the encoding specified in Python.
  1. Encoding/decoding errors
  • An error occurs when you try to decode data encoded with a non-UTF-8 encoding as UTF-8.

How to Handle Encoding Errors

Error handling using errors="ignore" and errors="replace"

# Ignore encoding errors
decoded_text = encoded_text.decode("utf-8", errors="ignore")

# Handle encoding errors by replacing
decoded_text = encoded_text.decode("utf-8", errors="replace")
You can avoid errors that cause garbled text by using the ignore option to skip problematic characters and the replace option to insert replacement characters.

7. Summary

Properly handling UTF-8 in Python is important to prevent garbled text and to ensure consistent data handling across different platforms. This article provided practical guidance on the basics of encoding and decoding in Python, precautions when working with files, and how to enable UTF-8 mode. Use this knowledge to correctly configure character encoding in Python and support global application development.