UTF-8 Encoder / Decoder
Inspect UTF-8 byte sequences — hex, decimal, binary per character
Byte Format
Type or paste text above to see its UTF-8 byte representation. Supports multibyte characters including emoji, CJK, and special symbols.
You’re debugging a web scraper and the response body contains é instead of é. Something went wrong with the character encoding — but where? Is the server sending Latin-1 bytes? Is your code decoding them as UTF-8? To diagnose the problem, you need to see the actual byte values behind each character. Or you’re implementing a protocol parser and need to know that € is three bytes (E2 82 AC) in UTF-8 but two bytes (20 AC) in UTF-16. Or you’re writing a length validator and realized that string.length in JavaScript counts UTF-16 code units, not bytes — and you need the actual byte count for a database VARCHAR column.
Why This Tool (Not a Hex Editor or Python REPL)
Hex editors show raw bytes but don’t map them back to individual characters. Running "café".encode('utf-8') in Python gives you the full byte string but doesn’t break it down character by character. This tool shows the UTF-8 byte representation of each character — hex, decimal, and binary — side by side. You can instantly see which characters are single-byte ASCII and which expand to two, three, or four bytes.
Everything runs in your browser using the TextEncoder API. No data leaves your machine.
What Is UTF-8?
UTF-8 (Unicode Transformation Format — 8-bit) is a variable-width character encoding that can represent every character in the Unicode standard. Designed by Ken Thompson and Rob Pike in 1993, it has become the dominant encoding on the web — over 98% of all web pages use UTF-8 as of 2024.
UTF-8 encodes each Unicode code point using one to four bytes:
| Byte Count | Code Point Range | Bit Pattern | Examples |
|---|---|---|---|
| 1 byte | U+0000 – U+007F | 0xxxxxxx | ASCII: A (41), z (7A), 0 (30) |
| 2 bytes | U+0080 – U+07FF | 110xxxxx 10xxxxxx | é (C3 A9), ñ (C3 B1), ü (C3 BC) |
| 3 bytes | U+0800 – U+FFFF | 1110xxxx 10xxxxxx 10xxxxxx | 中 (E4 B8 AD), € (E2 82 AC) |
| 4 bytes | U+10000 – U+10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 😀 (F0 9F 98 80), 𝕏 (F0 9D 95 8F) |
The key design insight of UTF-8 is backward compatibility with ASCII: any valid ASCII text is also valid UTF-8, byte for byte. This made adoption seamless across existing systems.
How UTF-8 Encoding Works
The encoding algorithm converts a Unicode code point to a specific byte sequence based on the code point’s numeric value:
- Determine how many bytes are needed based on the code point range
- Fill the leading byte’s prefix bits (
0,110,1110, or11110) to indicate the byte count - Fill continuation bytes with the
10prefix - Distribute the code point’s bits across the remaining positions
# Example: encoding '€' (U+20AC)
# U+20AC = 0010 0000 1010 1100 (binary)
# Needs 3 bytes (range U+0800–U+FFFF)
# Pattern: 1110xxxx 10xxxxxx 10xxxxxx
# Fill: 11100010 10000010 10101100
# Hex: E2 82 AC
text = "€"
encoded = text.encode('utf-8')
print(' '.join(f'{b:02X}' for b in encoded))
# E2 82 AC
// In the browser
const bytes = new TextEncoder().encode("€");
console.log([...bytes].map(b => b.toString(16).toUpperCase()));
// ["E2", "82", "AC"]
Common Encoding Problems and How to Diagnose Them
Mojibake (é instead of é): UTF-8 bytes are being decoded as Latin-1 (ISO-8859-1). The two UTF-8 bytes for é — C3 A9 — are interpreted as two separate Latin-1 characters: à (C3) and © (A9). Fix: ensure both the sender and receiver agree on UTF-8.
Replacement characters (�): The decoder encountered an invalid byte sequence. Common causes: the data is not actually UTF-8 (it might be GBK or Shift-JIS), or the byte stream was truncated mid-character.
Double encoding (é becoming é): The text was UTF-8 encoded, then encoded again. The fix is to encode once and track the encoding through every layer of your pipeline.
Use this tool to inspect the exact bytes: if é shows C3 A9, it is correctly UTF-8 encoded. If it shows E9 alone, it is Latin-1 encoded.
UTF-8 vs Other Encodings
UTF-16 uses 2 or 4 bytes per character. It is the internal encoding of JavaScript strings, Java char, and Windows APIs. For mostly-ASCII text, UTF-16 wastes space (2 bytes per character vs 1).
Latin-1 (ISO-8859-1) uses exactly 1 byte per character but can only represent 256 characters — no Chinese, Japanese, Arabic, or emoji. UTF-8 replaced it as the web standard.
UTF-32 uses exactly 4 bytes per character. Simple but wasteful — rarely used for storage or transmission.
Byte Order Mark (BOM)
The UTF-8 BOM is the three-byte sequence EF BB BF at the start of a file. Unlike UTF-16, UTF-8 does not need a BOM because its byte order is unambiguous. However, some Windows applications (Notepad, Excel) add a UTF-8 BOM to signal the encoding. This can cause problems in Unix tools, shell scripts, and HTTP responses where the BOM is treated as content. This tool will show the BOM bytes if present, so you can detect and remove them.
Frequently Asked Questions
How many bytes does an emoji use in UTF-8? Most emoji are encoded as 4 bytes in UTF-8 because they fall in the supplementary planes (U+1F600 and above). Some emoji sequences — like flags, skin-tone modifiers, and family compositions — consist of multiple code points joined by zero-width joiners (ZWJ), and can be 20+ bytes total.
Why does "😀".length return 2 in JavaScript?
JavaScript strings are UTF-16 encoded internally. Characters above U+FFFF (like emoji) require a surrogate pair — two 16-bit code units. The .length property counts code units, not characters. Use [..."😀"].length or "😀".codePointAt(0) for the correct count.
Is UTF-8 the same as Unicode?
No. Unicode is the character set — a mapping from code points (like U+0041) to characters (like A). UTF-8 is one of several encodings that convert those code points into bytes for storage and transmission. UTF-16 and UTF-32 are other encodings of the same Unicode character set.