PureDevTools

Encoding Detector & Converter

Detect UTF-8, ASCII, Latin-1, GBK, Shift-JIS and more — paste text or upload a file, view confidence scores and hex dump, convert between encodings — all in your browser

All processing happens in your browser. No data is sent to any server.
Load sample:
File Upload

Drop a file here or click to browse

Any text or binary file — bytes are read directly

Paste text or upload a file to detect its encoding

A vendor sends you a CSV export and half the product names show é instead of é. Your code reads it as UTF-8 but the file is actually Latin-1. You try iconv with 5 different encoding guesses before one works. Or worse — a legacy database dump has mixed encodings: some rows are GBK Chinese, others are Shift-JIS Japanese, and the header is ASCII. You need to detect the encoding before you can convert it.

Why This Tool (Not iconv or chardet)

iconv converts but doesn’t detect — you have to guess the source encoding. Python’s chardet requires a script. This tool detects the encoding with confidence scores, shows a hex dump so you can see the raw bytes, and lets you convert between encodings — all in your browser. Drop a file or paste text and get the answer instantly. No data is sent anywhere.

What Is Text Encoding?

Every file on a computer is ultimately a sequence of bytes — raw numbers from 0 to 255. A character encoding is the mapping that tells software how to interpret those bytes as human-readable text. The letter “A” is byte 0x41 in ASCII, UTF-8, and Latin-1, but the same byte can represent a completely different character in another encoding like GBK or Shift-JIS.

When a program reads a text file with the wrong encoding, the result is a corruption pattern developers know as mojibake (文字化け) — strings of garbled symbols like é instead of é, or ÎÏα instead of Greek ΟΠα.

Common Text Encodings

ASCII (US-ASCII)

The oldest and most universal encoding. It maps 128 characters (0–127): the English alphabet, digits, punctuation, and 33 non-printable control characters. Any text that uses only standard English characters without accents is valid ASCII.

UTF-8

The dominant encoding on the modern web. UTF-8 encodes all 1.1 million Unicode codepoints using a variable-length scheme: ASCII characters (U+0000–U+007F) use 1 byte; extended Latin and common scripts use 2 bytes; most CJK characters use 3 bytes; supplementary characters (emoji, rare CJK) use 4 bytes.

Latin-1 (ISO-8859-1)

A single-byte encoding covering Western European languages. Bytes 0–127 are identical to ASCII; bytes 128–255 add accented letters, symbols, and special characters used in French, German, Spanish, and other Western European languages.

Windows-1252

Microsoft’s extension of Latin-1 that assigns printable characters to the control-code range 0x80–0x9F (which Latin-1 leaves as control characters). This makes Windows-1252 a strict superset of Latin-1 for printable text. HTTP servers often mislabel Windows-1252 content as Latin-1 because early browsers treated the two as equivalent.

GBK / GB2312 (Chinese Simplified)

The dominant encoding for Simplified Chinese before UTF-8. GB2312 covers 6,763 Chinese characters plus ASCII; GBK extends it to over 20,000 characters. GB18030 is the current national standard and adds full Unicode coverage.

Big5 (Chinese Traditional)

The standard encoding for Traditional Chinese characters, widely used in Taiwan and Hong Kong. Big5 uses two bytes per Chinese character and is not compatible with GBK.

Shift-JIS (Japanese)

A variable-length encoding for Japanese text. ASCII characters use 1 byte; Japanese characters (Hiragana, Katakana, Kanji) use 2 bytes. Shift-JIS was the dominant encoding for Japanese on Windows and macOS before UTF-8.

EUC-JP (Japanese)

Extended Unix Code for Japanese. Preferred in Unix and older Linux environments. Like Shift-JIS, it uses 1 byte for ASCII and 2 bytes for Japanese characters, but the byte ranges differ.

Cyrillic Encodings

How Encoding Detection Works

Automatic encoding detection (chardet) is a probabilistic problem. Unlike structured formats such as ZIP files that have magic bytes, plain text has no mandatory signature. Detection tools use one or more of these strategies:

Byte Order Mark (BOM)

The most reliable indicator. Specific byte sequences at the start of a file unambiguously identify the encoding:

BOM BytesEncoding
EF BB BFUTF-8
FF FEUTF-16 LE
FE FFUTF-16 BE
FF FE 00 00UTF-32 LE

BOM detection works with 100% confidence when the BOM is present. However, most UTF-8 files do not have a BOM — it is optional and often omitted.

Byte Sequence Analysis

For files without a BOM, the detector analyses raw bytes:

Character Frequency Analysis

Once the encoding is guessed, statistical analysis of character distributions can refine the result. For example, Russian text decoded as Windows-1251 should show character frequencies matching Russian letter usage; unusual frequency patterns suggest the wrong encoding was applied.

Unicode Script Analysis (for already-decoded text)

When text is already in Unicode (as when you paste into this tool), detection is based on which Unicode blocks the characters come from: Cyrillic characters suggest the text was originally in a Cyrillic encoding; CJK characters suggest Chinese, Japanese, or Korean; and so on.

How to Use the Encoding Detector

Text detection:

  1. Paste your text into the input area
  2. The tool analyses the Unicode codepoints and shows likely encodings with confidence scores
  3. The hex dump shows the UTF-8 byte representation of your text

File detection:

  1. Click the file drop zone or drag and drop a text file
  2. The tool reads the raw bytes, checks for BOM, and analyses byte patterns
  3. The detected encoding is used to decode the file; the decoded text is shown

Encoding conversion:

  1. Enter or upload text
  2. Select a target encoding in the conversion panel
  3. The tool shows how the text would be represented in that encoding and generates a hex dump

Hex Dump View

The hex dump shows the raw byte representation of your text in the selected encoding:

This view is invaluable for debugging encoding problems: if you see multi-byte sequences where you expected single bytes, or unexpected bytes where you expected ASCII, the hex dump makes the mismatch visible.

Diagnosing Mojibake

Common mojibake patterns and their causes:

SymptomCauseFix
é instead of éUTF-8 file read as Latin-1Re-open the file as UTF-8
é instead of éLatin-1 file read as UTF-8Re-open the file as Latin-1 / Windows-1252
锟斤拷 (repeated)UTF-8 file read as GBK, then saved back as UTF-8Recover from original source; double-conversion is irreversible
Random symbols in Chinese docFile encoded GBK, opened as UTF-8Re-open as GBK/GB18030
ÎÏα instead of ΟΠαGreek ISO-8859-7 file read as UTF-8Re-open as ISO-8859-7
Black diamonds with ?Unknown bytes treated as UTF-8 replacement characterFind and set the correct legacy encoding

Privacy

This tool runs entirely in your browser. Text you paste and files you upload are never transmitted to any server and are processed exclusively by your device’s JavaScript engine. No data leaves your computer.

Related Tools

More Encoding & Crypto Tools