vrk plain

vrk plain strips markdown formatting and keeps the prose - clean plain text from any markdown input.

The problem

Markdown syntax (headers, link URLs, code fences, bullet markers) consumes tokens but adds no information for an LLM. A 2,000-token document is 1,400 tokens as plain text. Over hundreds of documents nightly, that is 30% wasted context and real money. sed regex strips some formatting but misses nested structures.

The solution

vrk plain strips markdown formatting and keeps the prose. Headers become text, links keep their label but drop the URL, code blocks keep their content but lose the fences. The output is clean plain text ready for token-efficient LLM processing.

Before and after

Before

cat README.md | sed 's/^#* //' | sed 's/\[([^]]*)\]([^)]*)/\1/g'
# Fragile regex, misses code blocks, tables, nested formatting

After

cat README.md | vrk plain

Example

vrk grab https://example.com/docs | vrk plain | vrk tok

Exit codes

Code	Meaning
0	Success
1	Could not read stdin or write stdout
2	Interactive TTY with no piped input and no positional arg

Flags

Flag	Short	Type	Description
`--json`	-j	bool	Emit JSON with text, input_bytes, output_bytes
`--quiet`	-q	bool	Suppress stderr output

What gets stripped

Markdown syntax	Plain text result
`bold`	bold
`_italic_`	italic
`[link text](url)`	link text
`inline code`	inline code
Code fences	Content only, no fences
`# Headers`	Header text
`- bullet`	bullet

URLs in links are dropped. Code content is kept. Everything that carries meaning stays; everything that’s just formatting goes.

How it works

$ printf '**Bold text** and _italic_ and [link](http://x.com)\n\n- bullet one\n- bullet two\n\n```python\nprint("hello")\n```\n' | vrk plain
Bold text and italic and link

bullet one
bullet two

print("hello")

JSON output

echo '**hello** world' | vrk plain --json

Wraps the plain text in a JSON envelope with byte counts.

Pipeline integration

Save tokens before an LLM call

# Strip markdown formatting to reduce token count, then summarize
vrk grab https://example.com/docs | vrk plain | vrk tok --check 8000 | \
  vrk prompt --system 'Summarize this document'

Compare token counts before and after

RAW=$(vrk grab https://example.com/docs | vrk tok --json | jq -r '.tokens')
PLAIN=$(vrk grab https://example.com/docs | vrk plain | vrk tok --json | jq -r '.tokens')
echo "Markdown: $RAW tokens, Plain: $PLAIN tokens, Saved: $((RAW - PLAIN))"

Batch processing markdown files

# Strip formatting from all docs before chunking and summarizing
for f in docs/*.md; do
  cat "$f" | vrk plain | vrk chunk --size 4000 | \
    while IFS= read -r record; do
      echo "$record" | jq -r '.text' | vrk prompt --system 'Summarize'
    done
done

When it fails

No input:

$ vrk plain
usage error: plain: no input: pipe text to stdin
$ echo $?
2