How to Fix Encoding Errors in a CSV File (Accents, Special Characters)

You open a CSV file and the first name column looks like this: AntoÃ¯ne, JosÃ©, FranÃ§ois. Or the accents are gone entirely, replaced by question marks or empty boxes.

This is a CSV encoding error. It’s one of the most common data quality problems in CSV files, and it’s almost always fixable once you know which mismatch caused it.

What encoding actually is

Every character in a text file is stored as a number. Encoding is the system that maps those numbers to characters. A is stored as 65, é as 233, and so on.

The problem is that there are multiple encoding systems, and they use different numbers for the same character. When a file is written in one encoding and read in another, the numbers get interpreted incorrectly. What was é becomes garbage because the reader is using a different map.

For CSV files, the two encodings you’ll encounter most often are UTF-8 and Windows-1252.

UTF-8 vs Windows-1252

UTF-8 is the modern universal standard. It can represent every character in every language. It’s the default for web content, most modern software, and anything built in the last 10–15 years.

Windows-1252 (also called CP-1252 or “Western European”) is a legacy encoding from the Windows era. It represents Western European characters (accents, ñ, ç, ü, etc.) but nothing beyond that. Many older Windows applications (and some that still default to system locale settings) write CSV files in Windows-1252.

The characters in the ASCII range (A–Z, 0–9, common punctuation) are identical in both encodings. A CSV file with no accented characters looks the same whether it’s UTF-8 or Windows-1252. The mismatch only surfaces when accented or special characters appear.

UTF-8 file read as Windows-1252: accented characters become multi-character garbage. é (stored as bytes 0xC3 0xA9 in UTF-8) is read as two characters: Ã and ©.

Windows-1252 file read as UTF-8: accented characters may become question marks (?), replacement characters (<EFBFBD>), or be dropped entirely. Some Windows-1252 byte sequences are invalid UTF-8 and can’t be decoded at all.

The BOM problem

The BOM (Byte Order Mark) is a hidden character some tools add to the beginning of UTF-8 files. It’s invisible in most text editors but present in the raw bytes. Excel writes it when you save as “UTF-8 CSV”. Some tools read it correctly; others display it as visible garbage, usually ï»¿, prepended to the first value in the file.

This is why you sometimes open a CSV and the first column header has junk at the start: ï»¿Name instead of Name.

The BOM doesn’t break the encoding of the rest of the file. But it corrupts the first field name, which is enough to break any automated process that matches on column names.

How to detect which encoding your file uses

The most reliable approach is to use a tool that reports the encoding directly.

Notepad++ (Windows): open the file. The bottom-right status bar shows the encoding. “UTF-8” means standard UTF-8. “UTF-8-BOM” means a BOM is present. “ANSI” means Windows-1252 or your system’s default locale encoding.

VS Code: open the file. Click the encoding label in the bottom-right status bar (it shows “UTF-8” by default). VS Code shows what it detected. You can also use “Reopen with Encoding” to test how the file looks under a different encoding without modifying it.

file command (Mac/Linux terminal):

file -i yourfile.csv

Returns the charset as detected by the OS.

Python (chardet library):

import chardet
with open("yourfile.csv", "rb") as f:
    result = chardet.detect(f.read())
print(result)

Returns the detected encoding and a confidence score. Reliable for common cases.

If none of these tools are available: if the file came from a modern web app or API, it’s almost certainly UTF-8. If it came from an older Windows tool or was sent by someone on a French, German, or Spanish Windows machine, it’s probably Windows-1252.

How to convert the encoding

Once you’ve identified the encoding, conversion is straightforward.

In Notepad++

File is open → Encoding menu → “Convert to UTF-8” → Save. If your destination specifically requires a BOM, use “Convert to UTF-8-BOM” instead. For most purposes, UTF-8 without BOM is the right choice.

In VS Code

Bottom-right encoding label → “Reopen with Encoding” → select the source encoding (e.g. “Western (Windows 1252)”) → verify the file now shows characters correctly → encoding label again → “Save with Encoding” → UTF-8.

In Python

with open("yourfile.csv", "r", encoding="windows-1252") as f:
    content = f.read()

with open("yourfile_utf8.csv", "w", encoding="utf-8") as f:
    f.write(content)

Replace windows-1252 with whatever encoding you detected. Common values: latin-1, iso-8859-1, cp1252, utf-8-sig (UTF-8 with BOM).

In Google Sheets

File → Import → Upload → “File encoding” dropdown → select the detected encoding. Google Sheets decodes it correctly on import. Then File → Download → CSV (Google Sheets exports UTF-8 automatically).

What Excel does to encoding

Excel is the main source of encoding corruption in CSV workflows.

When you open a CSV by double-clicking it on Windows, Excel reads it using your system locale, not UTF-8. On a French or German machine, this is usually Windows-1252. If the file was UTF-8, accented characters are corrupted on open. If you then save, the corrupted version is written to disk. The original is gone.

The safe way to open a CSV in Excel without corrupting it:

Open Excel first. Don’t double-click the file.
Data → Get External Data → From Text
In the import wizard, set the file origin to “65001: Unicode (UTF-8)”
Complete the import wizard normally

Or use Google Sheets, which handles UTF-8 by default and never silently re-encodes.

When saving from Excel: “CSV UTF-8 (comma delimited)” adds a BOM. “CSV (comma delimited)” uses your system locale. If your destination doesn’t need a BOM, the system locale format is cleaner, but only if your system locale is already set to UTF-8.

Fixing already-corrupted data

If the file has already been saved with the wrong encoding and you can’t go back to the source, recovery depends on the direction of the mismatch.

UTF-8 file read and saved as Windows-1252: the original byte sequences are likely still intact. Re-read the file as UTF-8 (as described above) and the original characters will usually be restored.

Windows-1252 file read and saved as UTF-8: this is harder. Some Windows-1252 byte sequences produce valid UTF-8 characters (just wrong ones), making the corruption look like normal text. Recovery usually requires a custom script that reverses the specific byte-level corruption, or manual correction of affected values.

This is why converting and verifying encoding before doing anything else to a file matters. Once corrupted data gets processed and re-saved multiple times, recovery becomes progressively harder.

Checklist before using your CSV

Open the file in a text editor (not Excel) and check for visible encoding issues in the first few rows
Check the encoding label in Notepad++ or VS Code
If the encoding is Windows-1252 or ANSI, convert to UTF-8 before any further processing
Check for a BOM and strip it if your destination doesn’t handle it
Never open a CSV by double-clicking in Excel if it contains accented characters or non-ASCII text
After conversion, spot-check rows with accented characters to confirm they render correctly
If the file was merged from multiple source files, check encoding consistency: different sources may have had different encodings, and the merge may have mixed them

If encoding is one of several issues in your file (alongside column naming, value normalisation, or format inconsistencies), Asphorem’s CSV Normalizer processes files entirely in the browser, so your raw file never leaves your machine during cleanup.

CSV Encoding Errors: Frequently Asked Questions

A UTF-8 file is being read as Windows-1252. The byte sequence for é in UTF-8 (0xC3 0xA9) decodes as two characters (Ã and ©) when read with Western European encoding. Reopen the file with UTF-8 encoding and the characters restore correctly.

How do you check the encoding of a CSV file?

Open the file in Notepad++ (Windows) or VS Code and look at the encoding label in the bottom-right status bar. On Mac/Linux, run file -i yourfile.csv in a terminal. For automated detection in scripts, use Python’s chardet library.

What’s the difference between UTF-8 and UTF-8 with BOM?

A BOM (Byte Order Mark) is a hidden three-byte marker some tools add to the beginning of UTF-8 files. Excel adds it when saving as “CSV UTF-8”. Some tools handle the BOM correctly, others display it as garbage (ï»¿) prepended to the first column header. Save without BOM unless your destination specifically needs one.

How do you convert a Windows-1252 CSV to UTF-8?

In Notepad++: Encoding menu → Convert to UTF-8 → Save. In VS Code: bottom-right encoding label → “Reopen with Encoding” → Western (Windows 1252) → confirm characters look right → “Save with Encoding” → UTF-8. In Python: read with encoding="windows-1252", write with encoding="utf-8".

Why does my European CSV have encoding issues when I open it in Excel?

Excel reads CSV files using your system locale, not UTF-8. On a French or German Windows machine, that’s usually Windows-1252, which corrupts UTF-8 files on open. See why European CSV files break in CRM imports for the broader set of European-specific issues.