Encoding Detector & Converter
Detect UTF-8, ASCII, Latin-1, GBK, Shift-JIS and more — paste text or upload a file, view confidence scores and hex dump, convert between encodings — all in your browser
Drop a file here or click to browse
Any text or binary file — bytes are read directly
Paste text or upload a file to detect its encoding
A vendor sends you a CSV export and half the product names show é instead of é. Your code reads it as UTF-8 but the file is actually Latin-1. You try iconv with 5 different encoding guesses before one works. Or worse — a legacy database dump has mixed encodings: some rows are GBK Chinese, others are Shift-JIS Japanese, and the header is ASCII. You need to detect the encoding before you can convert it.
Why This Tool (Not iconv or chardet)
iconv converts but doesn’t detect — you have to guess the source encoding. Python’s chardet requires a script. This tool detects the encoding with confidence scores, shows a hex dump so you can see the raw bytes, and lets you convert between encodings — all in your browser. Drop a file or paste text and get the answer instantly. No data is sent anywhere.
What Is Text Encoding?
Every file on a computer is ultimately a sequence of bytes — raw numbers from 0 to 255. A character encoding is the mapping that tells software how to interpret those bytes as human-readable text. The letter “A” is byte 0x41 in ASCII, UTF-8, and Latin-1, but the same byte can represent a completely different character in another encoding like GBK or Shift-JIS.
When a program reads a text file with the wrong encoding, the result is a corruption pattern developers know as mojibake (文字化け) — strings of garbled symbols like é instead of é, or ÎÏα instead of Greek ΟΠα.
Common Text Encodings
ASCII (US-ASCII)
The oldest and most universal encoding. It maps 128 characters (0–127): the English alphabet, digits, punctuation, and 33 non-printable control characters. Any text that uses only standard English characters without accents is valid ASCII.
- Range: 7-bit (0x00–0x7F)
- Characters: English letters, digits, basic punctuation
- Limitation: Cannot represent accented characters, non-Latin scripts, or emoji
UTF-8
The dominant encoding on the modern web. UTF-8 encodes all 1.1 million Unicode codepoints using a variable-length scheme: ASCII characters (U+0000–U+007F) use 1 byte; extended Latin and common scripts use 2 bytes; most CJK characters use 3 bytes; supplementary characters (emoji, rare CJK) use 4 bytes.
- Range: 1–4 bytes per character
- Compatibility: 100% backward-compatible with ASCII
- Adoption: Over 98% of web pages as of 2024
- BOM: An optional
EF BB BFbyte sequence can mark UTF-8 files (rarely needed)
Latin-1 (ISO-8859-1)
A single-byte encoding covering Western European languages. Bytes 0–127 are identical to ASCII; bytes 128–255 add accented letters, symbols, and special characters used in French, German, Spanish, and other Western European languages.
- Range: 1 byte per character (0x00–0xFF)
- Characters: ASCII + 96 additional Latin characters
- Common mistake: Confusing UTF-8-encoded files with Latin-1 causes the
éforémojibake pattern
Windows-1252
Microsoft’s extension of Latin-1 that assigns printable characters to the control-code range 0x80–0x9F (which Latin-1 leaves as control characters). This makes Windows-1252 a strict superset of Latin-1 for printable text. HTTP servers often mislabel Windows-1252 content as Latin-1 because early browsers treated the two as equivalent.
GBK / GB2312 (Chinese Simplified)
The dominant encoding for Simplified Chinese before UTF-8. GB2312 covers 6,763 Chinese characters plus ASCII; GBK extends it to over 20,000 characters. GB18030 is the current national standard and adds full Unicode coverage.
- Used in: Mainland China Windows systems (pre-UTF-8), legacy databases, email
- BOM: No standard BOM
Big5 (Chinese Traditional)
The standard encoding for Traditional Chinese characters, widely used in Taiwan and Hong Kong. Big5 uses two bytes per Chinese character and is not compatible with GBK.
Shift-JIS (Japanese)
A variable-length encoding for Japanese text. ASCII characters use 1 byte; Japanese characters (Hiragana, Katakana, Kanji) use 2 bytes. Shift-JIS was the dominant encoding for Japanese on Windows and macOS before UTF-8.
EUC-JP (Japanese)
Extended Unix Code for Japanese. Preferred in Unix and older Linux environments. Like Shift-JIS, it uses 1 byte for ASCII and 2 bytes for Japanese characters, but the byte ranges differ.
Cyrillic Encodings
- Windows-1251: The dominant Cyrillic encoding on Windows systems
- KOI8-R: Designed for Russian; common in Unix/email environments from the 1990s
- ISO-8859-5: The ISO standard Cyrillic encoding, less common in practice
How Encoding Detection Works
Automatic encoding detection (chardet) is a probabilistic problem. Unlike structured formats such as ZIP files that have magic bytes, plain text has no mandatory signature. Detection tools use one or more of these strategies:
Byte Order Mark (BOM)
The most reliable indicator. Specific byte sequences at the start of a file unambiguously identify the encoding:
| BOM Bytes | Encoding |
|---|---|
EF BB BF | UTF-8 |
FF FE | UTF-16 LE |
FE FF | UTF-16 BE |
FF FE 00 00 | UTF-32 LE |
BOM detection works with 100% confidence when the BOM is present. However, most UTF-8 files do not have a BOM — it is optional and often omitted.
Byte Sequence Analysis
For files without a BOM, the detector analyses raw bytes:
- All bytes < 0x80: Almost certainly ASCII or UTF-8 (indistinguishable at byte level)
- Valid UTF-8 multi-byte sequences: Strong indicator of UTF-8
- Invalid UTF-8 sequences with high bytes: Suggests a legacy single-byte encoding
- Many null bytes (0x00): Characteristic of UTF-16
Character Frequency Analysis
Once the encoding is guessed, statistical analysis of character distributions can refine the result. For example, Russian text decoded as Windows-1251 should show character frequencies matching Russian letter usage; unusual frequency patterns suggest the wrong encoding was applied.
Unicode Script Analysis (for already-decoded text)
When text is already in Unicode (as when you paste into this tool), detection is based on which Unicode blocks the characters come from: Cyrillic characters suggest the text was originally in a Cyrillic encoding; CJK characters suggest Chinese, Japanese, or Korean; and so on.
How to Use the Encoding Detector
Text detection:
- Paste your text into the input area
- The tool analyses the Unicode codepoints and shows likely encodings with confidence scores
- The hex dump shows the UTF-8 byte representation of your text
File detection:
- Click the file drop zone or drag and drop a text file
- The tool reads the raw bytes, checks for BOM, and analyses byte patterns
- The detected encoding is used to decode the file; the decoded text is shown
Encoding conversion:
- Enter or upload text
- Select a target encoding in the conversion panel
- The tool shows how the text would be represented in that encoding and generates a hex dump
Hex Dump View
The hex dump shows the raw byte representation of your text in the selected encoding:
- Offset column: Hexadecimal byte offset from the start of the data
- Hex column: Byte values in hexadecimal (two digits per byte, grouped in 16 per row)
- ASCII column: Printable ASCII characters shown as-is; non-printable bytes shown as
·
This view is invaluable for debugging encoding problems: if you see multi-byte sequences where you expected single bytes, or unexpected bytes where you expected ASCII, the hex dump makes the mismatch visible.
Diagnosing Mojibake
Common mojibake patterns and their causes:
| Symptom | Cause | Fix |
|---|---|---|
é instead of é | UTF-8 file read as Latin-1 | Re-open the file as UTF-8 |
é instead of é | Latin-1 file read as UTF-8 | Re-open the file as Latin-1 / Windows-1252 |
锟斤拷 (repeated) | UTF-8 file read as GBK, then saved back as UTF-8 | Recover from original source; double-conversion is irreversible |
| Random symbols in Chinese doc | File encoded GBK, opened as UTF-8 | Re-open as GBK/GB18030 |
ÎÏα instead of ΟΠα | Greek ISO-8859-7 file read as UTF-8 | Re-open as ISO-8859-7 |
Black diamonds with ? | Unknown bytes treated as UTF-8 replacement character | Find and set the correct legacy encoding |
Privacy
This tool runs entirely in your browser. Text you paste and files you upload are never transmitted to any server and are processed exclusively by your device’s JavaScript engine. No data leaves your computer.