In addition to using chunkr’s default models, we also provide a pipeline interface to allow you to use Azure Document Intelligence as a provider. When using Azure, instead of the default models, your files are processed thorugh the Azure layout analysis model, the Azure OCR model, and the Azure table OCR model.

You can still leverage Chunkr’s intelligent chunking and segment processing. The output will be mapped to the Chunkr output format.

When to use Azure

Pros

  • If our queue is full, you can use Azure to process your files
  • If you don’t need VLMs on you tables, you can use the Azure table OCR model. This will allow you to get much faster results.
  • Better OCR (we are working on it!)

Cons

  • The layout model is not as good as the one we use in Chunkr.
  • The layout model does not support segment type: ListItem, Formula

Example

  1. Use default segment processing and chunking with the Azure layout analysis model and OCR model.
from chunkr_ai import Chunkr
from chunkr_ai.models import (
    Configuration,
    Pipeline
)

chunkr = Chunkr()

chunkr.upload("path/to/file", Configuration(
    pipeline=Pipeline.AZURE
))
  1. Use default chunking with the Azure layout analysis model, OCR model and table OCR model. In this case, the HTML and Markdown for the Table segment will be generated by the Azure table OCR model.
from chunkr_ai import Chunkr
from chunkr_ai.models import (
    Configuration,
    GenerationConfig,
    GenerationStrategy,
    SegmentProcessing,
    Pipeline
)

chunkr = Chunkr()

chunkr.upload("path/to/file", Configuration(
    segment_processing=SegmentProcessing(
        Table=GenerationConfig(
            html=GenerationStrategy.AUTO,
            markdown=GenerationStrategy.AUTO
        ),
    ),
    pipeline=Pipeline.AZURE,
))