Supported segment types for Excels

The Excel parser maps identified elements to Chunkr’s existing segment types, ensuring seamless integration with your current workflow. The following segment types are supported:
Segment TypeDescription
TableRectangular data structures with headers and rows
SectionHeaderHeaders that divide content into sections
TitleMain document or sheet titles
TextRegular text content
PictureCharts, graphs, images, logos
FootnoteReferences or additional information at bottom
PageHeaderContent at the top of each page
PageFooterContent at the bottom of each page

Excel Segment Output Structure

When processing Excel files, Chunkr returns segments with both standard properties (common to all document types) and Excel-specific metadata. Understanding this structure is crucial for working with Excel parsing results effectively.

Standard Segment Properties

Each Excel segment contains the core properties found in all Chunkr outputs:
{
  "content": "Generated content in HTML/Markdown format",
  "text": "OCR-extracted raw text", 
  "segment_type": "Table|Text|Title|SectionHeader|Picture|etc",
  "bbox": {"left": 100, "top": 150, "width": 300, "height": 150},
  "page_number": 1,
  "page_width": 1024,
  "page_height": 768,
  "image": "https://presigned-url-to-cropped-segment-image"
}

Excel-Specific Properties: The ss_ Keys

The key difference between Excel outputs and regular document outputs are the ss_ (spreadsheet-specific) keys. These properties provide essential Excel context that doesn’t exist in PDF or other document formats:
PropertyTypeDescription
ss_sheet_nameStringName of the worksheet containing this segment
ss_rangeStringExcel cell range (e.g., “A1:D10”)
ss_cellsArrayArray of Cell objects with detailed cell data
ss_header_rangeStringRange of the header cells (if applicable)
ss_header_textStringText content of the header (if applicable)
ss_header_bboxObjectBounding box of the header (if applicable)
ss_header_ocrArrayOCR results for the header (if applicable)

Why Excel Needs Special Keys

The ss_ fields exist to work with native Excel values and maintain the mapping between identified content and the original Excel structure. These fields only exist for spreadsheet files. The ss_ keys hold state for what was identified and map it back to Excel cells and ranges, enabling you to:
  1. Map to original Excel cells: Access the exact cells (ss_cells) that were identified as part of each segment
  2. Preserve Excel ranges: Know the exact cell range (ss_range) each segment corresponds to
  3. Track sheet context: Identify which worksheet (ss_sheet_name) the content came from
  4. Maintain Excel structure: Work with native Excel data like formulas, cell values, and styling

Complete Excel Segment Example

Here’s what a complete Excel Table segment looks like:
{
  "content": "<table><tr><th>Product</th><th>Q1 Revenue</th><th>Q2 Revenue</th></tr><tr><td>Widget A</td><td>$50,000</td><td>$75,000</td></tr><tr><td>Widget B</td><td>$30,000</td><td>$45,000</td></tr></table>",
  "text": "Product Q1 Revenue Q2 Revenue Widget A $50,000 $75,000 Widget B $30,000 $45,000",
  "html": "<table><tr><th>Product</th><th>Q1 Revenue</th><th>Q2 Revenue</th></tr><tr><td>Widget A</td><td>$50,000</td><td>$75,000</td></tr><tr><td>Widget B</td><td>$30,000</td><td>$45,000</td></tr></table>",
  "markdown": "| Product | Q1 Revenue | Q2 Revenue |\n|---------|------------|------------|\n| Widget A | $50,000 | $75,000 |\n| Widget B | $30,000 | $45,000 |",
  "segment_type": "Table",
  "bbox": {"left": 100, "top": 150, "width": 400, "height": 100},
  "page_number": 1,
  "page_width": 1024,
  "page_height": 768,
  "image": "https://api.chunkr.ai/presigned-url...",
  "segment_id": "uuid-segment-id",
  "confidence": 0.95,
  
  // Excel-specific properties
  "ss_sheet_name": "Q2 Sales Data",
  "ss_range": "A3:C5",
  "ss_cells": [
    {
      "cell_id": "uuid-cell-1",
      "text": "Product",
      "range": "A3",
      "formula": null,
      "value": "Product",
      "hyperlink": null,
      "style": {"is_bold": true}
    },
    {
      "cell_id": "uuid-cell-2", 
      "text": "Widget A",
      "range": "A4",
      "formula": null,
      "value": "Widget A",
      "hyperlink": null,
      "style": null
    }
  ],
  "ss_header_range": "A3:C3",
  "ss_header_text": "Product Q1 Revenue Q2 Revenue",
  "ss_header_bbox": {"left": 100, "top": 150, "width": 400, "height": 25}
}

Excel Chart Example

Charts and graphs in Excel are identified as Picture segments with spreadsheet context:
{
  "content": "Sales trend chart showing quarterly growth from Q1 to Q4",
  "text": "",
  "segment_type": "Picture", 
  "bbox": {"left": 450, "top": 100, "width": 350, "height": 300},
  "page_number": 1,
  "image": "https://api.chunkr.ai/chart-image...",
  
  // Excel-specific properties  
  "ss_sheet_name": "Dashboard",
  "ss_range": "E1:K20",
  "ss_cells": []
}

Accessing Excel Metadata in Code

for chunk in task.output.chunks:
    for segment in chunk.segments:
        # Standard properties
        print(f"Content: {segment.content}")
        print(f"Type: {segment.segment_type}")
        
        # Excel-specific properties (only populated for Excel files)
        if segment.ss_sheet_name:
            print(f"Sheet: {segment.ss_sheet_name}")
            print(f"Range: {segment.ss_range}")
            
        # Access individual cells
        if segment.ss_cells:
            for cell in segment.ss_cells:
                print(f"Cell {cell.range}: {cell.text}")
                if cell.formula:
                    print(f"Formula: {cell.formula}")
        
        # Access header information if available
        if segment.ss_header_text:
            print(f"Header: {segment.ss_header_text}")
This Excel-specific metadata makes Chunkr’s Excel parser uniquely powerful for AI applications that need to understand and work with spreadsheet structure and context.