Python Character Encoding Explained

Python Character Encoding

1. Setting Up the Environment

First, import the libraries you need and set a seed for reproducibility:

import pandas as pd
import numpy as np
import charset_normalizer

np.random.seed(0)

2. What Are Encodings?

Character encodings map raw bytes to human-readable text. Using the wrong encoding can produce:

Garbled text (mojibake): æ–‡å—åŒ–ã??
Unknown characters: ��

The most common and recommended encoding is UTF-8.

UTF-8 is the standard text encoding in Python 3. Converting non-UTF-8 input into UTF-8 early prevents errors and data loss.

3. Strings vs Bytes in Python

Python 3 has two main text types:

String (str): human-readable text.
Bytes (bytes): raw binary representation of text.

# String to bytes
text = "This is the euro symbol: €"
bytes_text = text.encode("utf-8")

# Bytes back to string
text_back = bytes_text.decode("utf-8")

4. Problems With Wrong Encodings

Using the wrong encoding can:

Produce garbled characters
Replace characters with ? (loss of data)
Raise UnicodeDecodeError

# Example of losing characters
before = "This is the euro symbol: €"
after = before.encode("ascii", errors="replace")
print(after.decode("ascii"))  # Output: This is the euro symbol: ?

5. Detecting File Encodings

Use charset_normalizer to detect the correct encoding:

import charset_normalizer

with open("file.csv", 'rb') as f:
    result = charset_normalizer.detect(f.read(10000))

print(result)  # Example: {'encoding': 'Windows-1252', 'confidence': 0.73}

6. Reading Files With Correct Encoding

import pandas as pd

df = pd.read_csv("file.csv", encoding='Windows-1252')
df.head()

7. Saving Files in UTF-8

Keep your files in UTF-8 to avoid future encoding issues:

df.to_csv("file_utf8.csv", encoding="utf-8")

Summary

Always be aware of text encodings.
Use UTF-8 consistently when processing and saving files.
Use tools like charset_normalizer to detect unknown encodings.
Convert non-UTF-8 files to UTF-8 as soon as possible to avoid data loss.

Search This Blog