Document Parser Engine Inputs

Document Parser Configuration
- pdf_file: required. The PDF document to parse from.
- page_filter: optional. A string describing which PDF pages to consider during parsing. Otherwise, considers all pages.
- features: required. Defines what you want to parse from your PDFs. Options include:
- pages -> Markdown
- tables -> CSV-like format
- charts -> CSV-like format
- logos -> array of strings identifying the logos
- photos -> text description.
Document Parser Output
The output will always be a JSON value containing the selected features from Engine configuration.Document Parser Example
Let’s run through an example using this engine together.1
Create an Agent
Click on the “Add Agent” button in the
top right corner of the Agents page.
Enter a name and an optional description of your Agent.

2
Select the Document Parser Engine
3
Configure the engine as follows
$ starts a template string
- pdf_file: $pdf_file
- page_filter: Remove $page_filter and leave this empty
- features: Check all features

4
Create the Agent
Hit the Create button. Now, let’s run it on a PDF file through the UI.
5
View the Agent you just created
6
Create a new Agent job

7
Download the PDF from the default Roe Datasets
Click on Resources in the sidebar and then Roe Datasets.
Click on View dataset for the pdf-parsing-example dataset.
Click on Download to download the file.



8
Fill in the Agent inputs
pdf_file: Upload the file you just downloaded.
9
Run the job
Hit the Create button at the bottom to start the PDF parsing
job.
10
View the Results
Click View of the respective job to view its status and results.
Scroll down the Agent Job Details page and you’ll see the job outputs.
You can see that for each page, all the features we selected during the Engine configuration are present. For example, we wanted to extract logos from the pages, so the JSON output includes the “Pinterest” logo—which was on the first page of the PDF—for the first page.

