High-Level Structure
Task
object. When processing is successful, the output
field contains the HTML/Markdown representation of your document.
The core of this output is a list of chunks
, which are composed of individual segments
.
Chunks and Segments
The document is first broken down intosegments
, which represent individual semantic elements like a paragraph, table, or title. These segments are then grouped into chunks
based on your chunking configuration.
- Segments: The smallest building blocks. Each segment corresponds to a single, identified element from the source document.
- Chunks: A logical grouping of one or more segments. Each chunk includes
chunk_length
,content
,embed
, andsegments[]
. For RAG applications, chunks are the units of information that are typically embedded and retrieved.
Caption
, Footnote
, Formula
, FormRegion
, GraphicalItem
, Legend
, LineNumber
, ListItem
, Page
, PageFooter
, PageHeader
, PageNumber
, Picture
, Table
, Text
, Title
, Unknown
, and SectionHeader
.
By default, no segments are ignored. To change this behaviour you can
adjust your configuration to ignore specific segments like headers and footers.
Key Output Fields
Eachsegment
object contains rich information. At the chunk
level, corresponding fields are concatenated from all of the segments within that chunk. Here are the most important fields:
1. content
The content
field holds the primary, structured representation of the segment. Each segment is formatted based on it’s type.
- Tables: Converted to HTML to maintain complex col/row-span structure.
- Images: Converted to a robust markdown description, with charts/graphs including a tabular representation.
- Forms and Legends: Converted into a key-value markdown table.
- Formulas: Converted to LaTeX strings for perfect mathematical representation. They can even be embedded within an HTML table if a formula appears inside a cell.
- Text-type: Text-heavy segments like title, section headers, list-items and text blocks are converted into markdown.
2. embed
The embed
field provides the clean, RAG-optimized text that should be used for generating embeddings.
- It includes the
content
and, if present, thedescription
. This helps optimize the table segments without contaminating the content field. - This is the field used for calculating token counts when chunking, ensuring chunks fit your target length.
3. bbox
(Bounding Box)
Every segment includes a precise bounding box (bbox
) that pinpoints its exact location on the original page. This is essential for building applications that require citations or highlighting.
- The coordinates (
left
,top
,width
,height
) are pixel (px
) values in the page coordinate space. For resolution‑independent rendering, normalize them using the page dimensions — for example,left_pct = left / page_width
,top_pct = top / page_height
, and likewise forwidth
/height
. Usepage_width
/page_height
on the segment (or the page’spg_width
/pg_height
). - The
dpi
in thepages
array describes the pixel resolution of the rendered page image. You do not needdpi
when using normalized percentages; it is helpful only when mapping directly to a specific raster image in pixels or when generating images at a different scale.
Spreadsheet-Specific Outputs (ss_*
) Preview
This feature is in preview. Occasionally, extremely large spreadsheets can
fail. In that case, we still return HTML, but layout analysis is not
performed.

A financial spreadsheet showing intelligent segmentation of tables and charts.
.xlsx
, .xls
), the output includes additional ss_*
prefixed fields that provide native Excel context. These fields exist alongside the standard content
, embed
, and bbox
fields, enriching each segment with its precise location and native data from the original spreadsheet.
Key Fields
ss_range
: The cell range for the segment in A1 notation (e.g.,A1:D10
).ss_cells
: A detailed array of each cell in the segment, including its original formula, value, text and styling. Allows you to see both the raw formula (=SUM(B2:B10)
) and its calculated result ($55,000
).ss_header_*
: Fields identifying the detected header for a table, such asss_header_range
,ss_header_text
,ss_header_bbox
, andss_header_ocr
. Headers are intelligently associated even if they are not directly adjacent to the table.
-
Create Interactive Experiences: Use
ss_range
to build native citation experiences that let users click data and jump to the precise source cell in a viewer. - Get Cleaner LLM Context: Combine layout analysis with precise cell data to identify tables, associate headers, and filter out irrelevant cells. This provides cleaner, more meaningful context for LLM processing.
-
Build Powerful Spreadsheet Agents: Use the
ss_*
fields to build AI agents that can read, analyze, and even write back to spreadsheets. Understanding cell formulas and values enables agents to automate tasks like updating financial models, correcting entries, or adding new rows.
Looking for additional output fields? See the Advanced Outputs section below
for more metadata options.
Advanced Outputs
Advanced Outputs
Beyond the key fields discussed above, the Parse output is enriched with a variety of other useful metadata at the file, page, and segment levels. Here are some of the most valuable advanced fields:
-
Word-Level Bounding Boxes: Included in the
ocr
array for each page, this provides the precise coordinates for every single word detected by the OCR process. This is ideal for building applications that require highlighting specific words or phrases in a document viewer. -
Cropped Segment Images: Each segment object contains an
image
field with a URL to a cropped image of just that segment. This is incredibly useful for providing visual context to an LLM or displaying the source of a specific chunk of text. -
File & Page Metadata: The top-level
output
object contains file-level metadata like the originalfile_name
,mime_type
, andpage_count
. Additionally, thepages
array contains detailed information for each page, including a full-page image URL, dimensions (page_width
,page_height
), and DPI.