PDF Text Extractor
Extracts text from PDF files using PAD's built-in Pdf.ExtractTextFromPDF action. Covers full extraction, page range selection, table extraction, and empty-text detection.
Provided as-is, without warranty of any kind. Review and test each pattern in a non-production environment before deploying it to live automations. See our Terms.
Problem this solves
Users receive invoices, POs, and reports as PDFs and need to pull data out for entry into other systems.
Cross-references: Scanned PDF OCR Pipeline for handling PDFs that return blank text. Invoice Data Extractor for parsing the extracted text into structured fields. Multi-Line Text Block Parser for key-value extraction from the text output.
1Extract All Text from PDF
Extracts all text from every page of a PDF. DetectLayout: True preserves spacing and column alignment — use for invoices and tabular documents. DetectLayout: False produces a continuous text stream — use for narrative documents.
Variables: varPdfPath (full path to PDF file), varCurrentMessage (logging), varExtractedText (output — all extracted text)
2Extract Text from Specific Pages
Extracts text from a page range using the PageSelection parameter. Supports single pages, ranges, and comma-separated combinations.
Variables: varPdfPath (PDF file path), varPageRange (page selection string), varCurrentMessage (logging), varPageText (output — extracted text from selected pages)
Unverified parameter: PageSelection on ExtractText is not in the verified reference for this action (only on ExtractPages). If it doesn't paste, extract the pages first with Pdf.ExtractPages, then run ExtractText on the resulting single-page PDF.
3Extract Tables from PDF
Extracts tabular data from a PDF into a DataTable. MultiPageTables: True merges tables that span multiple pages. SetFirstRowAsHeader: True uses the first row as column names.
Variables: varPdfPath (PDF file path), varCurrentMessage (logging), varExtractedTables (output — DataTable of table data)
4Extract Images from PDF
Extracts all embedded images from a PDF and saves them to a folder. Useful for scanned PDFs where you need the images for OCR processing.
Variables: varPdfPath (PDF file path), varImageOutputFolder (folder for extracted images), varImagePrefix (naming prefix for images), varCurrentMessage (logging)
5Batch Extract Text from Folder of PDFs
Scans a folder for all PDF files and extracts text from each, storing results in a DataTable with filename and text content.
Variables: varPdfFolder (folder containing PDFs), varCurrentMessage (logging), varPdfFiles (list of PDF files), varResultsTable (output — DataTable with filename and text), CurrentItem (loop iterator), varItemText (text from current PDF), varProcessedCount (counter)
Variable Reference Summary
| Variable | Type | Used In | Purpose |
| `varPdfPath` | Text | Patterns 1–4 | Full path to PDF file |
| `varCurrentMessage` | Text | All patterns | Logging message for Subflow_Logging |
| `varExtractedText` | Text | Pattern 1 | Full extracted text output |
| `varPageRange` | Text | Pattern 2 | Page selection string (e.g., "1-3") |
| `varTempPagesPdf` | File | Pattern 2 | Temp extracted pages PDF |
| `varPageText` | Text | Pattern 2 | Text from selected pages |
| `varExtractedTables` | DataTable | Pattern 3 | Table data from PDF |
| `varImageOutputFolder` | Text | Pattern 4 | Folder for extracted images |
| `varImagePrefix` | Text | Pattern 4 | Naming prefix for images |
| `varPdfFolder` | Text | Pattern 5 | Folder of PDFs for batch processing |
| `varPdfFiles` | List of Files | Pattern 5 | PDF files found in folder |
| `varResultsTable` | DataTable | Pattern 5 | Batch results with filename and text |
| `CurrentItem` | File | Pattern 5 | Current PDF in batch loop |
| `varItemText` | Text | Pattern 5 | Text from current PDF |
| `varProcessedCount` | Numeric | Pattern 5 | PDFs processed counter |
Notes
- DetectLayout: True vs False.
Truepreserves the visual layout including column spacing — essential for invoices, forms, and tabular documents.Falseproduces a flat text stream — better for contracts, letters, and narrative content. Default toTrueunless you know the content is non-tabular. - Empty text = scanned PDF. If
ExtractTextreturns blank, the PDF is likely a scanned image. Use the Scanned PDF OCR Pipeline (extract images → OCR each image → stitch text) as a fallback. - ExtractTables (Pattern 3) works best with PDFs that have clearly defined table borders. For borderless tables or loosely structured tabular data,
ExtractTextwithDetectLayout: True+ text parsing (Multi-Line Text Block Parser) is more reliable. - Page range via ExtractPages (Pattern 2). Since
PageSelectionmay not be a valid parameter onExtractText, Pattern 2 uses the two-step approach:ExtractPagesto split, thenExtractTexton the result. This is verified. - Batch processing (Pattern 5) logs "Blank - needs OCR" for scanned PDFs so you can route them to the OCR pipeline separately.
- Simple ON ERROR throughout. No ON ERROR on SET or IF.
Dependencies
- Scanned PDF OCR Pipeline
- Invoice Data Extractor
- Multi-Line Text Block Parser