Text Cleaner

You paste text from a PDF and every line has a hard line break in the middle of sentences. Or you copy from a website and get invisible Unicode characters — zero-width spaces, soft hyphens, non-breaking spaces — that break string comparisons. Or a CSV export has tabs mixed with spaces. This tool strips all of that in one click.

Why This Tool

Text from different sources carries different invisible baggage. PDFs insert line breaks at column edges. Word processors add smart quotes and em dashes. Websites embed zero-width joiners and non-breaking spaces. Each of these causes subtle bugs in code, data processing, and content management. This tool gives you toggleable cleaning operations so you can strip exactly what you need.

Cleaning Operations

Remove Extra Whitespace

Collapses multiple consecutive spaces into a single space and trims leading/trailing whitespace from each line. Turns "Hello world" into "Hello world".

Remove Line Breaks

Joins lines that were artificially split (common in PDF copy-paste). Preserves paragraph breaks (double newlines) while removing single line breaks within paragraphs.

Remove HTML Tags

Strips all HTML tags, leaving only the text content. <p>Hello <strong>world</strong></p> becomes Hello world.

Remove Special Characters

Strips non-alphanumeric characters (punctuation, symbols) while preserving spaces and basic structure.

Normalize Unicode

Replaces fancy Unicode characters with ASCII equivalents:

Smart quotes (""'') → straight quotes ("')
Em dash (—) → double hyphen (--)
Ellipsis (…) → three dots (...)
Non-breaking space → regular space
Zero-width characters → removed entirely

Trim Lines

Removes leading and trailing whitespace from every line independently.

Common Use Cases

PDF to clean text: Remove artificial line breaks and extra spaces
Web scraping cleanup: Strip HTML tags and normalize whitespace
Data normalization: Clean CSV/TSV fields before importing
Code comments: Remove fancy Unicode from copy-pasted text
Email formatting: Fix text copied from rich-text email clients

Invisible Unicode Characters

These characters are invisible but cause real problems:

Character	Unicode	Problem
Zero-width space	U+200B	Breaks string equality checks
Zero-width joiner	U+200D	Appears in copy-paste from web
Non-breaking space	U+00A0	Looks like a space but isn’t
Soft hyphen	U+00AD	Invisible except at line breaks
BOM (Byte Order Mark)	U+FEFF	Causes “unexpected token” errors in parsers

This tool detects and removes all of these.

Frequently Asked Questions

Will this tool change my text content? Only whitespace, formatting characters, and invisible Unicode are affected. The actual words and numbers in your text remain unchanged. You can toggle each cleaning operation independently to control exactly what gets removed.

Can I clean code with this tool? Be careful — removing special characters will strip operators and syntax. Use only the whitespace normalization and Unicode cleaning options for code. The HTML tag removal option is safe for stripping markup from code snippets.

How does it handle different line ending formats? The tool normalizes all line endings (Windows CRLF, Mac CR, Unix LF) to Unix LF format. This prevents line ending mismatches when moving text between operating systems.

Does this tool preserve paragraph breaks? Yes, when using “Remove Line Breaks” mode. Single newlines (mid-paragraph breaks) are removed, but double newlines (paragraph separators) are preserved.