Python Character Encoding
1. Setting Up the Environment
First, import the libraries you need and set a seed for reproducibility:
import pandas as pd
import numpy as np
import charset_normalizer
np.random.seed(0)
2. What Are Encodings?
Character encodings map raw bytes to human-readable text. Using the wrong encoding can produce:
- Garbled text (mojibake):
文å—化ã?? - Unknown characters:
����������
The most common and recommended encoding is UTF-8.
UTF-8 is the standard text encoding in Python 3. Converting non-UTF-8 input into UTF-8 early prevents errors and data loss.
3. Strings vs Bytes in Python
Python 3 has two main text types:
- String (
str): human-readable text. - Bytes (
bytes): raw binary representation of text.
# String to bytes
text = "This is the euro symbol: €"
bytes_text = text.encode("utf-8")
# Bytes back to string
text_back = bytes_text.decode("utf-8")
4. Problems With Wrong Encodings
Using the wrong encoding can:
- Produce garbled characters
- Replace characters with
?(loss of data) - Raise
UnicodeDecodeError
# Example of losing characters
before = "This is the euro symbol: €"
after = before.encode("ascii", errors="replace")
print(after.decode("ascii")) # Output: This is the euro symbol: ?
5. Detecting File Encodings
Use charset_normalizer to detect the correct encoding:
import charset_normalizer
with open("file.csv", 'rb') as f:
result = charset_normalizer.detect(f.read(10000))
print(result) # Example: {'encoding': 'Windows-1252', 'confidence': 0.73}
6. Reading Files With Correct Encoding
import pandas as pd
df = pd.read_csv("file.csv", encoding='Windows-1252')
df.head()
7. Saving Files in UTF-8
Keep your files in UTF-8 to avoid future encoding issues:
df.to_csv("file_utf8.csv", encoding="utf-8")
Summary
- Always be aware of text encodings.
- Use UTF-8 consistently when processing and saving files.
- Use tools like
charset_normalizerto detect unknown encodings. - Convert non-UTF-8 files to UTF-8 as soon as possible to avoid data loss.