PDF Parser
This Engine parses information from PDF documents—including tables, charts, and images—into our predefined structure.
PDF Parser Engine Inputs
PDF Parser Configuration
The PDF Parser Engine Configuration has three parameters that take in values:
- pdf_file: required. The PDF document to parse from.
- page_filter: optional. A string describing which PDF pages to consider during parsing. Otherwise, considers all pages.
- features: required. Defines what you want to parse from your PDFs. Options include:
- pages -> Markdown
- tables -> CSV-like format
- charts -> CSV-like format
- logos -> array of strings identifying the logos
- photos -> text description.
See Template Strings for dynamic parameter configuration.
PDF Parser Output
The output will always be a JSON value containing the selected features from Engine configuration.
PDF Parser Example
Let’s run through an example using this engine together.
Create an Agent
Click on the “Add Agent” button in the top right corner of the Agents page.
Enter a name and an optional description of your Agent.
Select the PDF Parser Engine
Configure the engine as follows
-
pdf_file: $pdf_file
-
page_filter: Remove $page_filter and leave this empty
-
features: Check all features
Your configuration should look like the image below.
Create the Agent
Hit the Create button. Now, let’s run it on a PDF file through the UI.
View the Agent you just created
Create a new Agent job
Download the PDF from the default Roe Datasets
Click on Resources in the sidebar and then Roe Datasets.
Click on View dataset for the pdf-parsing-example dataset.
Click on Download to download the file.
Fill in the Agent inputs
pdf_file: Upload the file you just downloaded.
Run the job
Hit the Create button at the bottom to start the PDF parsing job.
View the Results
Click View of the respective job to view its status and results.
Scroll down the Agent Job Details page and you’ll see the job outputs.
You can see that for each page, all the features we selected during the Engine configuration are present. For example, we wanted to extract logos from the pages, so the JSON output includes the “Pinterest” logo—which was on the first page of the PDF—for the first page.