vrk chunk
vrk chunk splits text into token-counted pieces that fit within an LLM context window.
The problem
A 38,000-token document needs to go through a model with an 8,192-token window. split -l 100 divides by line count, which has no relationship to tokens. One chunk is 800 tokens, the next is 4,900. An obligation clause that spans a split boundary gets lost because neither chunk has the full sentence.
The solution
vrk chunk splits text into token-counted pieces that fit within an LLM context window. Each chunk is emitted as a JSONL record with its index, text, and exact token count. Configurable overlap preserves context at boundaries so entities that span a split point appear in both adjacent chunks.
Before and after
Before
split -l 100 document.txt chunk_
# chunk_aa: 312 tokens (wasted context space)
# chunk_ab: 4,891 tokens (exceeds your budget)
# chunk_ac: split mid-sentence, entity lost at boundary
After
cat document.txt | vrk chunk --size 4000 --overlap 200
Example
cat contract.pdf.txt | vrk chunk --size 4000 --overlap 200
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success, including empty input |
| 1 | I/O error |
| 2 | No input, –size missing or < 1, –overlap >= –size, unknown flag |
Flags
| Flag | Short | Type | Description |
|---|---|---|---|
--size | int | Max tokens per chunk (required) | |
--overlap | int | Token overlap between adjacent chunks | |
--by | string | Chunking strategy: paragraph | |
--quiet | -q | bool | Suppress stderr output |
How the output works
Every record contains three fields:
index- zero-based chunk number, sequentialtext- the chunk content as a stringtokens- exact token count for this chunk (always <=--size)
$ cat technical-spec.md | vrk chunk --size 30
{"index":0,"text":"Section one covers the system architecture. The platform uses a microservices pattern with...","tokens":30}
{"index":1,"text":"event-driven communication between services. Each service maintains its own database...","tokens":28}
Flag details
–size (required)
Sets the maximum tokens per chunk. Choose a value that leaves room for your system prompt:
# Model has 8,192-token context. System prompt is ~1,000 tokens.
# Leave 7,000 for content chunks.
cat document.txt | vrk chunk --size 7000
–overlap
Repeats tokens from the end of each chunk at the start of the next one. This is how you prevent entities from being lost at boundaries.
Without overlap, a sentence that spans a chunk boundary gets split:
$ printf 'The vendor shall deliver all components within 30 calendar days of the signed agreement.' | vrk chunk --size 10
{"index":0,"text":"The vendor shall deliver all components within 30 calendar days","tokens":10}
{"index":1,"text":" of the signed agreement.","tokens":5}
With overlap, the boundary region appears in both chunks:
$ printf 'The vendor shall deliver all components within 30 calendar days of the signed agreement.' | vrk chunk --size 10 --overlap 4
{"index":0,"text":"The vendor shall deliver all components within 30 calendar days","tokens":10}
{"index":1,"text":" within 30 calendar days of the signed agreement.","tokens":9}
Now “within 30 calendar days” appears in both chunks. An LLM processing either chunk sees the complete obligation.
Rule of thumb: set --overlap to 5-10% of --size. For --size 4000, use --overlap 200.
–by paragraph
Splits at paragraph breaks (double newlines) and never breaks a paragraph across chunks:
cat article.md | vrk chunk --size 4000 --by paragraph
Use --by paragraph when your content has natural paragraph structure and you want each chunk to contain complete paragraphs. If a single paragraph exceeds --size, it falls back to token-level splitting for that paragraph.
Processing chunks through an LLM
Use vrk prompt --field text to process each chunk. One API call per record, no loop needed:
cat long-document.md | vrk chunk --size 4000 --overlap 200 | \
vrk prompt --field text --system 'Extract all named entities as a JSON array' --json
The --field text flag reads each JSONL line, extracts the text field, and sends it as the prompt. With --json, the output merges input fields (index, tokens) with response metadata.
Advanced: shell loop pattern
If you need per-record shell logic beyond what
--fieldprovides (e.g. conditional processing, writing to different files), use awhileloop:cat long-document.md | vrk chunk --size 4000 --overlap 200 | \ while IFS= read -r record; do echo "$record" | jq -r '.text' | \ vrk prompt --system 'Extract all named entities as a JSON array' done
IFS=prevents word splitting.-rprevents backslash interpretation. Prefer--fieldfor straightforward pipelines.
Pipeline integration
Extract entities from a large document
# Convert a PDF to text, chunk it, extract entities from each piece
pdftotext contract.pdf - | \
vrk chunk --size 4000 --overlap 200 | \
vrk prompt --field text \
--schema '{"entities":"array","summary":"string"}' \
--retry 2 \
--system 'Extract named entities and a one-sentence summary' \
--json | \
vrk validate --schema '{"entities":"array","summary":"string"}' --strict
Measure before chunking
# Only chunk if the document actually exceeds the budget
TOKENS=$(cat report.md | vrk tok --json | jq -r '.tokens')
if [ "$TOKENS" -le 8000 ]; then
cat report.md | vrk prompt --system 'Summarize this report'
else
cat report.md | vrk chunk --size 4000 --overlap 200 | \
vrk prompt --field text --system 'Summarize this section'
fi
Web page to chunked summaries
# Grab a long article, chunk it, summarize each section
vrk grab https://example.com/long-article | \
vrk chunk --size 4000 --by paragraph | \
vrk prompt --field text --system 'Summarize this section in one paragraph'
When it fails
Missing --size:
$ cat document.txt | vrk chunk
usage error: chunk: --size is required (>= 1)
$ echo $?
2
Overlap >= size:
$ cat document.txt | vrk chunk --size 100 --overlap 100
usage error: chunk: --overlap (100) must be less than --size (100)
$ echo $?
2