How does automatic encoding detection work?

Encoding detection uses several strategies. The most reliable is checking for a Byte Order Mark (BOM) — specific byte sequences like EF BB BF (UTF-8) or FF FE (UTF-16 LE) at the start of a file. Without a BOM, the tool analyses byte patterns: all bytes below 0x80 suggest ASCII or UTF-8, valid multi-byte sequences confirm UTF-8, and high bytes with invalid UTF-8 patterns suggest legacy encodings like Latin-1 or GBK. When text is already decoded in the browser, the tool analyses which Unicode character blocks are present.

Mojibake (文字化け) is the garbled text that results when a file is read or displayed using the wrong character encoding. For example, opening a UTF-8 file as Latin-1 turns 'é' into 'Ã©'. Opening a GBK Chinese file as UTF-8 produces a mix of replacement characters and random symbols. The encoding detector helps identify the correct encoding so you can re-open the file properly.

What is a BOM (Byte Order Mark)?

A BOM is a specific byte sequence at the very start of a file that signals its encoding and, for UTF-16, the byte order (little-endian or big-endian). The UTF-8 BOM is EF BB BF; UTF-16 LE is FF FE; UTF-16 BE is FE FF. The BOM is optional for UTF-8 but mandatory for distinguishing UTF-16 LE from UTF-16 BE. Many editors and tools omit the UTF-8 BOM by default.

Is my text or file sent to a server?

No. All encoding detection and conversion runs entirely in your browser using JavaScript. Text you paste and files you upload are never transmitted to any server and never leave your device.

Encoding Detector & Converter Online | Free Text Encoding Detector

A vendor sends you a CSV export and half the product names show Ã© instead of é. Your code reads it as UTF-8 but the file is actually Latin-1. You try iconv with 5 different encoding guesses before one works. Or worse — a legacy database dump has mixed encodings: some rows are GBK Chinese, others are Shift-JIS Japanese, and the header is ASCII. You need to detect the encoding before you can convert it.

Why This Tool (Not iconv or chardet)

iconv converts but doesn’t detect — you have to guess the source encoding. Python’s chardet requires a script. This tool detects the encoding with confidence scores, shows a hex dump so you can see the raw bytes, and lets you convert between encodings — all in your browser. Drop a file or paste text and get the answer instantly. No data is sent anywhere.

What Is Text Encoding?

Every file on a computer is ultimately a sequence of bytes — raw numbers from 0 to 255. A character encoding is the mapping that tells software how to interpret those bytes as human-readable text. The letter “A” is byte 0x41 in ASCII, UTF-8, and Latin-1, but the same byte can represent a completely different character in another encoding like GBK or Shift-JIS.

When a program reads a text file with the wrong encoding, the result is a corruption pattern developers know as mojibake (文字化け) — strings of garbled symbols like Ã© instead of é, or ÎÏÎ± instead of Greek ΟΠα.

Common Text Encodings

ASCII (US-ASCII)

The oldest and most universal encoding. It maps 128 characters (0–127): the English alphabet, digits, punctuation, and 33 non-printable control characters. Any text that uses only standard English characters without accents is valid ASCII.

Range: 7-bit (0x00–0x7F)
Characters: English letters, digits, basic punctuation
Limitation: Cannot represent accented characters, non-Latin scripts, or emoji

UTF-8

The dominant encoding on the modern web. UTF-8 encodes all 1.1 million Unicode codepoints using a variable-length scheme: ASCII characters (U+0000–U+007F) use 1 byte; extended Latin and common scripts use 2 bytes; most CJK characters use 3 bytes; supplementary characters (emoji, rare CJK) use 4 bytes.

Range: 1–4 bytes per character
Compatibility: 100% backward-compatible with ASCII
Adoption: Over 98% of web pages as of 2024
BOM: An optional EF BB BF byte sequence can mark UTF-8 files (rarely needed)

Latin-1 (ISO-8859-1)

A single-byte encoding covering Western European languages. Bytes 0–127 are identical to ASCII; bytes 128–255 add accented letters, symbols, and special characters used in French, German, Spanish, and other Western European languages.

Range: 1 byte per character (0x00–0xFF)
Characters: ASCII + 96 additional Latin characters
Common mistake: Confusing UTF-8-encoded files with Latin-1 causes the Ã© for é mojibake pattern

Windows-1252

Microsoft’s extension of Latin-1 that assigns printable characters to the control-code range 0x80–0x9F (which Latin-1 leaves as control characters). This makes Windows-1252 a strict superset of Latin-1 for printable text. HTTP servers often mislabel Windows-1252 content as Latin-1 because early browsers treated the two as equivalent.

GBK / GB2312 (Chinese Simplified)

The dominant encoding for Simplified Chinese before UTF-8. GB2312 covers 6,763 Chinese characters plus ASCII; GBK extends it to over 20,000 characters. GB18030 is the current national standard and adds full Unicode coverage.

Used in: Mainland China Windows systems (pre-UTF-8), legacy databases, email
BOM: No standard BOM

Big5 (Chinese Traditional)

The standard encoding for Traditional Chinese characters, widely used in Taiwan and Hong Kong. Big5 uses two bytes per Chinese character and is not compatible with GBK.

Shift-JIS (Japanese)

A variable-length encoding for Japanese text. ASCII characters use 1 byte; Japanese characters (Hiragana, Katakana, Kanji) use 2 bytes. Shift-JIS was the dominant encoding for Japanese on Windows and macOS before UTF-8.

EUC-JP (Japanese)

Extended Unix Code for Japanese. Preferred in Unix and older Linux environments. Like Shift-JIS, it uses 1 byte for ASCII and 2 bytes for Japanese characters, but the byte ranges differ.

Cyrillic Encodings

Windows-1251: The dominant Cyrillic encoding on Windows systems
KOI8-R: Designed for Russian; common in Unix/email environments from the 1990s
ISO-8859-5: The ISO standard Cyrillic encoding, less common in practice

How Encoding Detection Works

Automatic encoding detection (chardet) is a probabilistic problem. Unlike structured formats such as ZIP files that have magic bytes, plain text has no mandatory signature. Detection tools use one or more of these strategies:

Byte Order Mark (BOM)

The most reliable indicator. Specific byte sequences at the start of a file unambiguously identify the encoding:

BOM Bytes	Encoding
`EF BB BF`	UTF-8
`FF FE`	UTF-16 LE
`FE FF`	UTF-16 BE
`FF FE 00 00`	UTF-32 LE

BOM detection works with 100% confidence when the BOM is present. However, most UTF-8 files do not have a BOM — it is optional and often omitted.

Byte Sequence Analysis

For files without a BOM, the detector analyses raw bytes:

All bytes < 0x80: Almost certainly ASCII or UTF-8 (indistinguishable at byte level)
Valid UTF-8 multi-byte sequences: Strong indicator of UTF-8
Invalid UTF-8 sequences with high bytes: Suggests a legacy single-byte encoding
Many null bytes (0x00): Characteristic of UTF-16

Character Frequency Analysis

Once the encoding is guessed, statistical analysis of character distributions can refine the result. For example, Russian text decoded as Windows-1251 should show character frequencies matching Russian letter usage; unusual frequency patterns suggest the wrong encoding was applied.

Unicode Script Analysis (for already-decoded text)

When text is already in Unicode (as when you paste into this tool), detection is based on which Unicode blocks the characters come from: Cyrillic characters suggest the text was originally in a Cyrillic encoding; CJK characters suggest Chinese, Japanese, or Korean; and so on.

How to Use the Encoding Detector

Text detection:

Paste your text into the input area
The tool analyses the Unicode codepoints and shows likely encodings with confidence scores
The hex dump shows the UTF-8 byte representation of your text

File detection:

Click the file drop zone or drag and drop a text file
The tool reads the raw bytes, checks for BOM, and analyses byte patterns
The detected encoding is used to decode the file; the decoded text is shown

Encoding conversion:

Enter or upload text
Select a target encoding in the conversion panel
The tool shows how the text would be represented in that encoding and generates a hex dump

Hex Dump View

The hex dump shows the raw byte representation of your text in the selected encoding:

Offset column: Hexadecimal byte offset from the start of the data
Hex column: Byte values in hexadecimal (two digits per byte, grouped in 16 per row)
ASCII column: Printable ASCII characters shown as-is; non-printable bytes shown as ·

This view is invaluable for debugging encoding problems: if you see multi-byte sequences where you expected single bytes, or unexpected bytes where you expected ASCII, the hex dump makes the mismatch visible.

Diagnosing Mojibake

Common mojibake patterns and their causes:

Symptom	Cause	Fix
`Ã©` instead of `é`	UTF-8 file read as Latin-1	Re-open the file as UTF-8
`é` instead of `Ã©`	Latin-1 file read as UTF-8	Re-open the file as Latin-1 / Windows-1252
`锟斤拷` (repeated)	UTF-8 file read as GBK, then saved back as UTF-8	Recover from original source; double-conversion is irreversible
Random symbols in Chinese doc	File encoded GBK, opened as UTF-8	Re-open as GBK/GB18030
`ÎÏÎ±` instead of `ΟΠα`	Greek ISO-8859-7 file read as UTF-8	Re-open as ISO-8859-7
Black diamonds with `?`	Unknown bytes treated as UTF-8 replacement character	Find and set the correct legacy encoding

Privacy

This tool runs entirely in your browser. Text you paste and files you upload are never transmitted to any server and are processed exclusively by your device’s JavaScript engine. No data leaves your computer.

Encoding Detector & Converter