Chunking is the process of grouping segments into logical chunks. Our segmentation models can produce a hierarchy of segments, and we can use that to create chunks.

The exact strategy varies based on end-application and use-case, but generally the goal is to put together segments in a way that maintains the context of the information.

We offer an intelligent chunking algorithm but you can also turn it off to receive individual segments to handle chunking yourself.

Configuration

You can configure intelligent chunking by setting the target_chunk_length parameter. This is the approximate number of words a chunk can contain.

Intelligent Chunking

The chunking algorithm works as follows:

  1. Remove headers and footers
  2. Add segments to a chunk until we hit a breaking condition, or if the chunk length >= target_chunk_length.

Breaking Conditions

We go down the segment hierarchy (from Title -> Section header -> Other). Once we hit a segment_type that is higher in the hierarchy than the current segment type, we break the chunk.

Turning it off

Setting target_chunk_length to 0 will turn off intelligent chunking, and each chunk will contain a single segment. Click here to learn more about the chunk model.