vrk grab

vrk grab fetches a URL and returns clean, readable markdown with automatic content extraction.

The problem

curl returns the full HTML of a page. A 52KB response contains 3KB of article content and 49KB of navigation, JavaScript, cookie banners, and ads. Feeding the raw HTML to an LLM wastes most of the context window on boilerplate. BeautifulSoup extracts content, but the selectors differ for every site.

The solution

vrk grab fetches a URL and returns clean, readable markdown. Readability-style content extraction strips navigation, ads, scripts, and boilerplate. Use --text for plain prose or --raw for the unprocessed HTML. One command replaces curl + BeautifulSoup + custom extraction logic.

Before and after

Before

curl -s https://example.com/article | \
  python3 -c "
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(sys.stdin.read(), 'html.parser')
print(soup.get_text())"

After

vrk grab https://example.com/article

Example

vrk grab https://blog.example.com/llm-best-practices | vrk tok

Exit codes

Code	Meaning
0	Success
1	HTTP error, fetch timeout, response too large, blocked internal address, or I/O error
2	Usage error - invalid URL, no input, mutually exclusive flags

Flags

Flag	Short	Type	Description
`--text`	-t	bool	Plain prose output, no markdown syntax
`--raw`		bool	Raw HTML, no processing
`--json`	-j	bool	Emit JSON envelope with metadata
`--quiet`	-q	bool	Suppress stderr output
`--max-size`		int	Max response body size in bytes (default 10MB)
`--allow-internal`		bool	Allow requests to private, loopback, and link-local addresses (blocked by default for SSRF safety)

Output formats

Markdown (default)

$ vrk grab https://example.com/blog/llm-pipelines
# Getting Started with LLM Pipelines

Building reliable LLM pipelines requires careful attention to...

## Token Management

Before sending any document to an LLM, measure its token count...

Clean markdown - headers, paragraphs, links, and code blocks preserved. Ready to pipe to vrk tok, vrk chunk, or vrk prompt.

Plain text (–text)

$ vrk grab --text https://example.com/blog/llm-pipelines
Getting Started with LLM Pipelines

Building reliable LLM pipelines requires careful attention to...

No markdown syntax. Just prose. Useful when you want to minimize tokens or feed text to a tool that doesn’t understand markdown.

Raw HTML (–raw)

The full, unprocessed HTML. Use when you need to extract specific elements yourself, or when the content extraction strips something you need.

vrk grab --raw https://example.com/blog/llm-pipelines

JSON envelope (–json)

Wraps the content in a JSON object with metadata including the title, token estimate, and fetch timestamp:

vrk grab --json https://example.com/blog/llm-pipelines

Pipeline integration

Fetch, check budget, and summarize

# Grab an article, make sure it fits in context, then summarize
vrk grab https://blog.example.com/quarterly-review | \
  vrk tok --check 12000 | \
  vrk prompt --system 'Summarize the key findings in 5 bullet points'

Extract and validate links from a web page

# Grab a page, extract all links, parse each URL
vrk grab https://docs.example.com/api-reference | \
  vrk links --bare | \
  while IFS= read -r url; do
    vrk urlinfo --field host "$url"
  done | sort -u

Fetch, chunk, and process a long article

# Grab a long page, chunk it if needed, summarize each section
CONTENT=$(vrk grab https://example.com/whitepaper)
TOKENS=$(echo "$CONTENT" | vrk tok --json | jq -r '.tokens')
if [ "$TOKENS" -le 8000 ]; then
  echo "$CONTENT" | vrk prompt --system 'Summarize this'
else
  echo "$CONTENT" | vrk chunk --size 4000 --overlap 200 | \
    while IFS= read -r chunk; do
      echo "$chunk" | jq -r '.text' | vrk prompt --system 'Summarize this section'
    done
fi

Nightly batch with state tracking

# Fetch each URL, redact secrets from content, summarize, store result
for url in $(cat urls.txt); do
  CONTENT=$(vrk grab "$url" | vrk mask)
  SUMMARY=$(echo "$CONTENT" | vrk prompt --system @prompts/summarize.txt --retry 2)
  if [ $? -eq 0 ]; then
    vrk kv set --ns summaries "$(echo "$url" | vrk slug)" "$SUMMARY" --ttl 168h
    vrk kv incr --ns summaries processed
  fi
done

When it fails

Unreachable URL:

$ vrk grab https://nonexistent.example.com/page
error: grab: Get "https://nonexistent.example.com/page": dial tcp: lookup nonexistent.example.com: no such host
$ echo $?
1

HTTP error:

$ vrk grab https://example.com/this-page-does-not-exist
error: grab: HTTP 404
$ echo $?
1

No URL provided:

$ vrk grab
usage error: grab: no URL provided
$ echo $?
2