Guide
How to extract text from any PDF, including scanned documents
Extracting text from a PDF should be simple, but it often is not. Whether you are dealing with a text-based PDF where copy-paste produces garbled results, or a scanned document where text is trapped inside images, here are reliable solutions.
Why extracting text from PDFs can be difficult
PDFs come in two fundamentally different varieties. Text-based PDFs contain actual character data with font and positioning information. These are created by applications like Word, Pages, or InDesign. Image-based PDFs contain scanned images of documents with no actual text data. These come from scanners, cameras, and some fax machines.
Even with text-based PDFs, copying and pasting can produce unexpected results. Some PDFs use custom character encodings that map display characters to different Unicode values, so copying produces gibberish. Others store text in reading order that does not match the visual layout, causing columns to merge or text to appear jumbled.
Multi-column layouts, tables, headers and footers, and footnotes all create additional challenges for text extraction. A simple copy-paste often interleaves text from different columns or mixes body text with header content.
Method 1: Copy and paste (for simple text PDFs)
Open the PDF in Preview or any PDF viewer, select the text with your cursor, and paste it into your target application. This works well for single-column, text-based PDFs without complex formatting.
If the pasted text has extra line breaks, missing spaces, or garbled characters, the PDF's internal text encoding may be non-standard. In that case, using a dedicated extraction tool will produce better results.
Method 2: Convert PDF to a text-friendly format
Converting a PDF to a Word document, plain text file, or RTF preserves the text content while making it fully editable. File Studio can convert PDF to text-based formats, intelligently handling multi-column layouts, tables, and lists.
For scanned PDFs, File Studio applies OCR (optical character recognition) during conversion to recognize text within the scanned images. Modern OCR engines achieve accuracy rates above 99% for cleanly scanned documents in common languages.
The conversion approach is especially useful when you need to extract text from many pages at once or when you need to preserve some level of formatting (headings, lists, tables) alongside the text content.
Getting the best results from text extraction
For text-based PDFs, converting to plain text (.txt) gives the cleanest output with no formatting artifacts. If you need to preserve structure, convert to Word (.docx) or RTF format.
For scanned documents, the quality of the scan directly affects OCR accuracy. Clean, high-resolution scans (300 DPI or above) with good contrast produce the best results. Skewed pages, low resolution, and poor contrast all reduce accuracy. File Studio's preprocessing can improve results by straightening pages and enhancing contrast before OCR.
How text is stored in PDFs and why extraction varies
PDFs store text in two fundamentally different ways: as vector text (selectable characters with font information) and as raster images of text (scanned pages). Vector text can be extracted directly by reading the content stream, which contains the actual character codes and their positions. This is how text ends up in a PDF when you export from Word, save from a browser, or create a PDF from any application.
Scanned documents, by contrast, contain no text data at all. Each page is a photograph of the physical page, and what appears to be text is actually an array of pixels. Extracting text from scanned PDFs requires Optical Character Recognition (OCR), which analyzes the image to identify letter shapes and convert them to actual characters. OCR accuracy depends on scan quality, font clarity, and language complexity.
Some PDFs are hybrids: they contain scanned page images with an invisible text layer generated by OCR during the scanning process. This invisible text enables searching and copying while the visible image provides the faithful visual reproduction. When extracting text from such PDFs, you can pull from the hidden text layer without running OCR again, though the accuracy depends on the quality of the original OCR pass.
OCR accuracy and how to improve it
Modern OCR engines (based on machine learning and neural networks) achieve 95-99% accuracy on clean, well-scanned documents with standard fonts. Accuracy drops with poor scan quality, skewed pages, unusual fonts, handwritten text, low contrast, and complex layouts with multiple columns or tables.
To maximize OCR accuracy, scan at 300 DPI or higher in grayscale or black-and-white mode. Color scanning is unnecessary for text documents and increases file size without improving recognition. Ensure pages are straight (not skewed), clean (no coffee stains or fold marks), and well-lit (no shadows from book spines).
For documents with tables, OCR often extracts the text but loses the tabular structure. The resulting text reads across rows but the column alignment is gone. File Studio's table-aware extraction mode attempts to preserve the grid structure, outputting tab-separated or comma-separated values that you can paste into a spreadsheet.
Text extraction output formats and use cases
Plain text extraction strips all formatting and produces a simple text file. This is useful for indexing, searching, or feeding content into other applications. However, you lose all layout information: headings, bullet points, columns, and paragraphs are flattened into a continuous stream of text.
Rich text extraction preserves some formatting (bold, italic, font sizes, paragraph breaks) and produces output compatible with word processors. This is more useful when you need to repurpose the content while maintaining its structure, but it requires the PDF to contain explicit font style information.
For data extraction from structured PDFs (invoices, forms, reports with consistent layouts), the most effective approach is template-based extraction where you define regions of interest on the page. File Studio supports region selection, allowing you to draw a box around the data you need and extract only that area, which is far more practical than extracting the entire page and manually finding the relevant information.
Pro tips
- *Before running OCR on a scanned PDF, check whether it already has a hidden text layer. Try selecting text with your cursor in Preview or Acrobat Reader. If text highlights, OCR has already been applied and you can copy directly.
- *For scanned documents with poor OCR accuracy, try preprocessing the scan: increase contrast, convert to grayscale, and deskew the pages before running OCR. These steps can significantly improve character recognition.
- *When extracting text from a PDF with columns (like a newspaper or academic paper), use a tool that supports column-aware extraction. Otherwise, text from adjacent columns gets interleaved, producing nonsensical output.
- *For extracting tabular data, copy the text into a spreadsheet application and use text-to-columns functionality rather than trying to paste directly. The tab and space characters from PDF extraction rarely align with spreadsheet column boundaries.
- *If you need text from a specific page range rather than the entire document, specify the range in File Studio to avoid processing unnecessary pages and reduce extraction time.
How to do it with File Studio
Open your PDF in File Studio
Drag your PDF into File Studio. The app identifies whether it is text-based, image-based (scanned), or a mix of both.
Choose your extraction method
For text-based PDFs, select your target format (plain text, Word, or RTF). For scanned PDFs, enable OCR processing to recognize text in the images.
Extract and review
File Studio extracts the text and saves it in your chosen format. Review the output for accuracy, especially if OCR was involved. The text is now ready to edit, search, and reuse.
Try File Studio free
All tools work 100% offline. No sign-ups, no uploads, no subscriptions. Download and start converting right away.
FAQ
Frequently asked questions
Why does copy-paste from a PDF produce garbled text?→
Some PDFs use custom character encodings or font subsets that do not map cleanly to Unicode. The text looks correct on screen because the font displays the right glyphs, but the underlying character codes do not match standard text encoding. A dedicated extraction tool resolves this by interpreting the PDF's encoding table correctly.
Can I extract text from a scanned PDF?→
Yes, using OCR (optical character recognition). OCR analyzes the scanned images and recognizes text characters. File Studio includes OCR capabilities that work offline on your device, so your scanned documents remain private.
How accurate is OCR for scanned documents?→
Modern OCR engines achieve 99%+ accuracy on clean, high-resolution scans in common languages. Accuracy decreases with poor scan quality, unusual fonts, handwriting, or low contrast. Always review OCR output for critical documents.
Can I extract text from specific pages only?→
Yes. File Studio allows you to specify a page range for text extraction, so you can extract text from just the pages you need rather than the entire document.
Will the extracted text preserve formatting like bold and headings?→
If you extract to plain text, formatting is lost. If you convert to Word or RTF format, File Studio preserves headings, bold, italic, and basic layout structure. Table extraction accuracy depends on the complexity of the table layout.
@ayysoni · March 2, 2026
Related File Studio tools:
More guides: