PDF Text Extractor

Extracts text from PDF files using PAD's built-in Pdf.ExtractTextFromPDF action. Covers full extraction, page range selection, table extraction, and empty-text detection.

View in Notion

Provided as-is, without warranty of any kind. Review and test each pattern in a non-production environment before deploying it to live automations. See our Terms.

Problem this solves

Users receive invoices, POs, and reports as PDFs and need to pull data out for entry into other systems.

Cross-references: Scanned PDF OCR Pipeline for handling PDFs that return blank text. Invoice Data Extractor for parsing the extracted text into structured fields. Multi-Line Text Block Parser for key-value extraction from the text output.

1Extract All Text from PDF

Extracts all text from every page of a PDF. DetectLayout: True preserves spacing and column alignment — use for invoices and tabular documents. DetectLayout: False produces a continuous text stream — use for narrative documents. Variables: varPdfPath (full path to PDF file), varCurrentMessage (logging), varExtractedText (output — all extracted text)

PAD Script

1 **REGION Extract All Text from PDF
2 SET varPdfPath TO $'''C:\\\\Data\\\\Invoice.pdf'''
3 SET varCurrentMessage TO $fx'Extracting text from: ${varPdfPath}'
4  
5 Pdf.ExtractTextFromPDF.ExtractText PDFFile: $fx'=${varPdfPath}' DetectLayout: True ExtractedText=> varExtractedText
6 ON ERROR REPEAT 1 TIMES WAIT 2
7 END
8  
9 IF IsNotBlank(varExtractedText) THEN
10     SET varCurrentMessage TO $'''PDF text extracted successfully'''
11 ELSE
12     SET varCurrentMessage TO $'''PDF text is empty — this may be a scanned document (use Scanned PDF OCR Pipeline)'''
13 END
14 **ENDREGION

2Extract Text from Specific Pages

Extracts text from a page range using the PageSelection parameter. Supports single pages, ranges, and comma-separated combinations. Variables: varPdfPath (PDF file path), varPageRange (page selection string), varCurrentMessage (logging), varPageText (output — extracted text from selected pages)

Unverified parameter: PageSelection on ExtractText is not in the verified reference for this action (only on ExtractPages). If it doesn't paste, extract the pages first with Pdf.ExtractPages, then run ExtractText on the resulting single-page PDF.

PAD Script

1 **REGION Extract Text from Page Range
2 SET varPdfPath TO $'''C:\\\\Data\\\\LongReport.pdf'''
3 SET varPageRange TO $'''1-3'''
4 SET varCurrentMessage TO $fx'Extracting text from pages ${varPageRange}: ${varPdfPath}'
5  
6 # Strategy: Extract pages first, then extract text from the result
7 Pdf.ExtractPages PDFFile: $fx'=${varPdfPath}' PageSelection: $fx'=${varPageRange}' ExtractedPDFPath: $fx'C:\\\\Data\\\\Temp\\\\' IfFileExists: Pdf.IfFileExists.AddSequentialSuffix ExtractedPDFFile=> varTempPagesPdf
8 ON ERROR REPEAT 1 TIMES WAIT 2
9 END
10  
11 Pdf.ExtractTextFromPDF.ExtractText PDFFile: $fx'=${varTempPagesPdf}' DetectLayout: True ExtractedText=> varPageText
12 ON ERROR REPEAT 1 TIMES WAIT 2
13 END
14  
15 # Clean up temp file
16 File.Delete Files: $fx'=${varTempPagesPdf}'
17 ON ERROR REPEAT 1 TIMES WAIT 2
18 END
19  
20 SET varCurrentMessage TO $fx'Extracted text from pages ${varPageRange}'
21 **ENDREGION

3Extract Tables from PDF

Extracts tabular data from a PDF into a DataTable. MultiPageTables: True merges tables that span multiple pages. SetFirstRowAsHeader: True uses the first row as column names. Variables: varPdfPath (PDF file path), varCurrentMessage (logging), varExtractedTables (output — DataTable of table data)

PAD Script

1 **REGION Extract Tables from PDF
2 SET varPdfPath TO $'''C:\\\\Data\\\\Report.pdf'''
3 SET varCurrentMessage TO $fx'Extracting tables from: ${varPdfPath}'
4  
5 Pdf.ExtractTablesFromPDF.ExtractTables PDFFile: $fx'=${varPdfPath}' MultiPageTables: True SetFirstRowAsHeader: True ExtractedPDFTables=> varExtractedTables
6 ON ERROR REPEAT 1 TIMES WAIT 2
7 END
8  
9 IF $fx'= Count(varExtractedTables) > 0' THEN
10     SET varCurrentMessage TO $fx'Extracted table with ${varExtractedTables.Count} rows'
11 ELSE
12     SET varCurrentMessage TO $'''No tables found in PDF — try text extraction instead'''
13 END
14 **ENDREGION

4Extract Images from PDF

Extracts all embedded images from a PDF and saves them to a folder. Useful for scanned PDFs where you need the images for OCR processing. Variables: varPdfPath (PDF file path), varImageOutputFolder (folder for extracted images), varImagePrefix (naming prefix for images), varCurrentMessage (logging)

PAD Script

1 **REGION Extract Images from PDF
2 SET varPdfPath TO $'''C:\\\\Data\\\\ScannedDoc.pdf'''
3 SET varImageOutputFolder TO $'''C:\\\\Data\\\\ExtractedImages\\\\'''
4 SET varImagePrefix TO $'''page'''
5 SET varCurrentMessage TO $fx'Extracting images from: ${varPdfPath}'
6  
7 Pdf.ExtractImagesFromPDF.ExtractImages PDFFile: $fx'=${varPdfPath}' ImagesName: $fx'=${varImagePrefix}' ImagesFolder: $fx'=${varImageOutputFolder}'
8 ON ERROR REPEAT 1 TIMES WAIT 2
9 END
10  
11 SET varCurrentMessage TO $fx'Images extracted to: ${varImageOutputFolder}'
12 **ENDREGION

5Batch Extract Text from Folder of PDFs

Scans a folder for all PDF files and extracts text from each, storing results in a DataTable with filename and text content. Variables: varPdfFolder (folder containing PDFs), varCurrentMessage (logging), varPdfFiles (list of PDF files), varResultsTable (output — DataTable with filename and text), CurrentItem (loop iterator), varItemText (text from current PDF), varProcessedCount (counter)

PAD Script

1 **REGION Batch Extract from PDF Folder
2 SET varPdfFolder TO $'''C:\\\\Data\\\\InvoiceBatch\\\\'''
3 SET varCurrentMessage TO $fx'Batch extracting text from PDFs in: ${varPdfFolder}'
4  
5 Folder.GetFiles Folder: $fx'=${varPdfFolder}' FileFilter: $fx'*.pdf' IncludeSubfolders: False FailOnAccessDenied: True SortBy1: Folder.SortBy.Name SortDescending1: False SortBy2: Folder.SortBy.NoSort SortDescending2: False SortBy3: Folder.SortBy.NoSort SortDescending3: False Files=> varPdfFiles
6 ON ERROR REPEAT 1 TIMES WAIT 2
7 END
8  
9 Variables.CreateNewDatatable InputTable: { ^['FileName', 'ExtractedText', 'Status'] } DataTable=> varResultsTable
10 SET varProcessedCount TO $fx'=0'
11  
12 LOOP FOREACH CurrentItem IN $fx'=varPdfFiles'
13     Variables.IncreaseVariable Value: $fx'=varProcessedCount' IncrementValue: $fx'=1' Result=> varProcessedCount
14     SET varCurrentMessage TO $fx'Extracting ${varProcessedCount} of ${varPdfFiles.Count}: ${CurrentItem.Name}'
15  
16     Pdf.ExtractTextFromPDF.ExtractText PDFFile: $fx'=${CurrentItem.FullName}' DetectLayout: True ExtractedText=> varItemText
17     ON ERROR REPEAT 1 TIMES WAIT 2
18     END
19  
20     IF IsNotBlank(varItemText) THEN
21         Variables.AddRowToDataTable.AppendRowToDataTable DataTable: $fx'=varResultsTable' RowToAdd: $fx'[${CurrentItem.Name}, ${varItemText}, Success]'
22     ELSE
23         Variables.AddRowToDataTable.AppendRowToDataTable DataTable: $fx'=varResultsTable' RowToAdd: $fx'[${CurrentItem.Name}, , Blank - needs OCR]'
24     END
25 END
26  
27 SET varCurrentMessage TO $fx'Batch complete — ${varProcessedCount} PDF(s) processed'
28 **ENDREGION

Variable Reference Summary

Variable	Type	Used In	Purpose
`varPdfPath`	Text	Patterns 1–4	Full path to PDF file
`varCurrentMessage`	Text	All patterns	Logging message for Subflow_Logging
`varExtractedText`	Text	Pattern 1	Full extracted text output
`varPageRange`	Text	Pattern 2	Page selection string (e.g., "1-3")
`varTempPagesPdf`	File	Pattern 2	Temp extracted pages PDF
`varPageText`	Text	Pattern 2	Text from selected pages
`varExtractedTables`	DataTable	Pattern 3	Table data from PDF
`varImageOutputFolder`	Text	Pattern 4	Folder for extracted images
`varImagePrefix`	Text	Pattern 4	Naming prefix for images
`varPdfFolder`	Text	Pattern 5	Folder of PDFs for batch processing
`varPdfFiles`	List of Files	Pattern 5	PDF files found in folder
`varResultsTable`	DataTable	Pattern 5	Batch results with filename and text
`CurrentItem`	File	Pattern 5	Current PDF in batch loop
`varItemText`	Text	Pattern 5	Text from current PDF
`varProcessedCount`	Numeric	Pattern 5	PDFs processed counter

Notes

DetectLayout: True vs False. True preserves the visual layout including column spacing — essential for invoices, forms, and tabular documents. False produces a flat text stream — better for contracts, letters, and narrative content. Default to True unless you know the content is non-tabular.
Empty text = scanned PDF. If ExtractText returns blank, the PDF is likely a scanned image. Use the Scanned PDF OCR Pipeline (extract images → OCR each image → stitch text) as a fallback.
ExtractTables (Pattern 3) works best with PDFs that have clearly defined table borders. For borderless tables or loosely structured tabular data, ExtractText with DetectLayout: True + text parsing (Multi-Line Text Block Parser) is more reliable.
Page range via ExtractPages (Pattern 2). Since PageSelection may not be a valid parameter on ExtractText, Pattern 2 uses the two-step approach: ExtractPages to split, then ExtractText on the result. This is verified.
Batch processing (Pattern 5) logs "Blank - needs OCR" for scanned PDFs so you can route them to the OCR pipeline separately.
Simple ON ERROR throughout. No ON ERROR on SET or IF.

Dependencies

Scanned PDF OCR Pipeline
Invoice Data Extractor
Multi-Line Text Block Parser

1	`**REGION Extract All Text from PDF`
2	`SET varPdfPath TO $'''C:\\\\Data\\\\Invoice.pdf'''`
3	`SET varCurrentMessage TO $fx'Extracting text from: ${varPdfPath}'`
4
5	`Pdf.ExtractTextFromPDF.ExtractText PDFFile: $fx'=${varPdfPath}' DetectLayout: True ExtractedText=> varExtractedText`
6	`ON ERROR REPEAT 1 TIMES WAIT 2`
7	`END`
8
9	`IF IsNotBlank(varExtractedText) THEN`
10	`SET varCurrentMessage TO $'''PDF text extracted successfully'''`
11	`ELSE`
12	`SET varCurrentMessage TO $'''PDF text is empty — this may be a scanned document (use Scanned PDF OCR Pipeline)'''`
13	`END`
14	`**ENDREGION`

1	`**REGION Extract Text from Page Range`
2	`SET varPdfPath TO $'''C:\\\\Data\\\\LongReport.pdf'''`
3	`SET varPageRange TO $'''1-3'''`
4	`SET varCurrentMessage TO $fx'Extracting text from pages ${varPageRange}: ${varPdfPath}'`
5
6	`# Strategy: Extract pages first, then extract text from the result`
7	`Pdf.ExtractPages PDFFile: $fx'=${varPdfPath}' PageSelection: $fx'=${varPageRange}' ExtractedPDFPath: $fx'C:\\\\Data\\\\Temp\\\\' IfFileExists: Pdf.IfFileExists.AddSequentialSuffix ExtractedPDFFile=> varTempPagesPdf`
8	`ON ERROR REPEAT 1 TIMES WAIT 2`
9	`END`
10
11	`Pdf.ExtractTextFromPDF.ExtractText PDFFile: $fx'=${varTempPagesPdf}' DetectLayout: True ExtractedText=> varPageText`
12	`ON ERROR REPEAT 1 TIMES WAIT 2`
13	`END`
14
15	`# Clean up temp file`
16	`File.Delete Files: $fx'=${varTempPagesPdf}'`
17	`ON ERROR REPEAT 1 TIMES WAIT 2`
18	`END`
19
20	`SET varCurrentMessage TO $fx'Extracted text from pages ${varPageRange}'`
21	`**ENDREGION`

1	`**REGION Extract Tables from PDF`
2	`SET varPdfPath TO $'''C:\\\\Data\\\\Report.pdf'''`
3	`SET varCurrentMessage TO $fx'Extracting tables from: ${varPdfPath}'`
4
5	`Pdf.ExtractTablesFromPDF.ExtractTables PDFFile: $fx'=${varPdfPath}' MultiPageTables: True SetFirstRowAsHeader: True ExtractedPDFTables=> varExtractedTables`
6	`ON ERROR REPEAT 1 TIMES WAIT 2`
7	`END`
8
9	`IF $fx'= Count(varExtractedTables) > 0' THEN`
10	`SET varCurrentMessage TO $fx'Extracted table with ${varExtractedTables.Count} rows'`
11	`ELSE`
12	`SET varCurrentMessage TO $'''No tables found in PDF — try text extraction instead'''`
13	`END`
14	`**ENDREGION`

1	`**REGION Extract Images from PDF`
2	`SET varPdfPath TO $'''C:\\\\Data\\\\ScannedDoc.pdf'''`
3	`SET varImageOutputFolder TO $'''C:\\\\Data\\\\ExtractedImages\\\\'''`
4	`SET varImagePrefix TO $'''page'''`
5	`SET varCurrentMessage TO $fx'Extracting images from: ${varPdfPath}'`
6
7	`Pdf.ExtractImagesFromPDF.ExtractImages PDFFile: $fx'=${varPdfPath}' ImagesName: $fx'=${varImagePrefix}' ImagesFolder: $fx'=${varImageOutputFolder}'`
8	`ON ERROR REPEAT 1 TIMES WAIT 2`
9	`END`
10
11	`SET varCurrentMessage TO $fx'Images extracted to: ${varImageOutputFolder}'`
12	`**ENDREGION`

1	`**REGION Batch Extract from PDF Folder`
2	`SET varPdfFolder TO $'''C:\\\\Data\\\\InvoiceBatch\\\\'''`
3	`SET varCurrentMessage TO $fx'Batch extracting text from PDFs in: ${varPdfFolder}'`
4
5	`Folder.GetFiles Folder: $fx'=${varPdfFolder}' FileFilter: $fx'*.pdf' IncludeSubfolders: False FailOnAccessDenied: True SortBy1: Folder.SortBy.Name SortDescending1: False SortBy2: Folder.SortBy.NoSort SortDescending2: False SortBy3: Folder.SortBy.NoSort SortDescending3: False Files=> varPdfFiles`
6	`ON ERROR REPEAT 1 TIMES WAIT 2`
7	`END`
8
9	`Variables.CreateNewDatatable InputTable: { ^['FileName', 'ExtractedText', 'Status'] } DataTable=> varResultsTable`
10	`SET varProcessedCount TO $fx'=0'`
11
12	`LOOP FOREACH CurrentItem IN $fx'=varPdfFiles'`
13	`Variables.IncreaseVariable Value: $fx'=varProcessedCount' IncrementValue: $fx'=1' Result=> varProcessedCount`
14	`SET varCurrentMessage TO $fx'Extracting ${varProcessedCount} of ${varPdfFiles.Count}: ${CurrentItem.Name}'`
15
16	`Pdf.ExtractTextFromPDF.ExtractText PDFFile: $fx'=${CurrentItem.FullName}' DetectLayout: True ExtractedText=> varItemText`
17	`ON ERROR REPEAT 1 TIMES WAIT 2`
18	`END`
19
20	`IF IsNotBlank(varItemText) THEN`
21	`Variables.AddRowToDataTable.AppendRowToDataTable DataTable: $fx'=varResultsTable' RowToAdd: $fx'[${CurrentItem.Name}, ${varItemText}, Success]'`
22	`ELSE`
23	`Variables.AddRowToDataTable.AppendRowToDataTable DataTable: $fx'=varResultsTable' RowToAdd: $fx'[${CurrentItem.Name}, , Blank - needs OCR]'`
24	`END`
25	`END`
26
27	`SET varCurrentMessage TO $fx'Batch complete — ${varProcessedCount} PDF(s) processed'`
28	`**ENDREGION`

Problem this solves

1Extract All Text from PDF

2Extract Text from Specific Pages

3Extract Tables from PDF

4Extract Images from PDF

5Batch Extract Text from Folder of PDFs

Variable Reference Summary

Notes

Dependencies

Details