Chunking
target_length
: 4096tokenizer
: Cl100kBase
Word
: Split by wordsCl100kBase
: For OpenAI models (e.g. GPT-3.5, GPT-4, text-embedding-ada-002)XlmRobertaBase
: For RoBERTa-based multilingual modelsBertBaseUncased
: BERT base uncased tokenizerembed
field in each chunk and segment object. This field contains the text that will be compared against the target length.
The content of the embed
field is automatically calculated based on whether a description is generated for segments:
description
is set to true
for a segment type, the embed field will include both the content
and the generated description
description
is set to false
or not specified, the embed field will contain only the content
description: true
)description: true
is the default)description
is not specified)CL100K_BASE
(default)embed
field (content only, or content + description)embed
fields in a group are concatenated to form the chunk’s embed
fieldchunk_length
represents the tokenized length of this concatenated content