vrk grab
vrk grab fetches a URL and returns clean, readable markdown with automatic content extraction.
The problem
curl returns the full HTML of a page. A 52KB response contains 3KB of article content and 49KB of navigation, JavaScript, cookie banners, and ads. Feeding the raw HTML to an LLM wastes most of the context window on boilerplate. BeautifulSoup extracts content, but the selectors differ for every site.
The solution
vrk grab fetches a URL and returns clean, readable markdown. Readability-style content extraction strips navigation, ads, scripts, and boilerplate. Use --text for plain prose or --raw for the unprocessed HTML. One command replaces curl + BeautifulSoup + custom extraction logic.
Before and after
Before
curl -s https://example.com/article | \
python3 -c "
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(sys.stdin.read(), 'html.parser')
print(soup.get_text())"
After
vrk grab https://example.com/article
Example
vrk grab https://blog.example.com/llm-best-practices | vrk tok
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | HTTP error, fetch timeout, response too large, blocked internal address, or I/O error |
| 2 | Usage error - invalid URL, no input, mutually exclusive flags |
Flags
| Flag | Short | Type | Description |
|---|---|---|---|
--text | -t | bool | Plain prose output, no markdown syntax |
--raw | bool | Raw HTML, no processing | |
--json | -j | bool | Emit JSON envelope with metadata |
--quiet | -q | bool | Suppress stderr output |
--max-size | int | Max response body size in bytes (default 10MB) | |
--allow-internal | bool | Allow requests to private, loopback, and link-local addresses (blocked by default for SSRF safety) |
Output formats
Markdown (default)
$ vrk grab https://example.com/blog/llm-pipelines
# Getting Started with LLM Pipelines
Building reliable LLM pipelines requires careful attention to...
## Token Management
Before sending any document to an LLM, measure its token count...
Clean markdown - headers, paragraphs, links, and code blocks preserved. Ready to pipe to vrk tok, vrk chunk, or vrk prompt.
Plain text (–text)
$ vrk grab --text https://example.com/blog/llm-pipelines
Getting Started with LLM Pipelines
Building reliable LLM pipelines requires careful attention to...
No markdown syntax. Just prose. Useful when you want to minimize tokens or feed text to a tool that doesn’t understand markdown.
Raw HTML (–raw)
The full, unprocessed HTML. Use when you need to extract specific elements yourself, or when the content extraction strips something you need.
vrk grab --raw https://example.com/blog/llm-pipelines
JSON envelope (–json)
Wraps the content in a JSON object with metadata including the title, token estimate, and fetch timestamp:
vrk grab --json https://example.com/blog/llm-pipelines
Pipeline integration
Fetch, check budget, and summarize
# Grab an article, make sure it fits in context, then summarize
vrk grab https://blog.example.com/quarterly-review | \
vrk tok --check 12000 | \
vrk prompt --system 'Summarize the key findings in 5 bullet points'
Extract and validate links from a web page
# Grab a page, extract all links, parse each URL
vrk grab https://docs.example.com/api-reference | \
vrk links --bare | \
while IFS= read -r url; do
vrk urlinfo --field host "$url"
done | sort -u
Fetch, chunk, and process a long article
# Grab a long page, chunk it if needed, summarize each section
CONTENT=$(vrk grab https://example.com/whitepaper)
TOKENS=$(echo "$CONTENT" | vrk tok --json | jq -r '.tokens')
if [ "$TOKENS" -le 8000 ]; then
echo "$CONTENT" | vrk prompt --system 'Summarize this'
else
echo "$CONTENT" | vrk chunk --size 4000 --overlap 200 | \
while IFS= read -r chunk; do
echo "$chunk" | jq -r '.text' | vrk prompt --system 'Summarize this section'
done
fi
Nightly batch with state tracking
# Fetch each URL, redact secrets from content, summarize, store result
for url in $(cat urls.txt); do
CONTENT=$(vrk grab "$url" | vrk mask)
SUMMARY=$(echo "$CONTENT" | vrk prompt --system @prompts/summarize.txt --retry 2)
if [ $? -eq 0 ]; then
vrk kv set --ns summaries "$(echo "$url" | vrk slug)" "$SUMMARY" --ttl 168h
vrk kv incr --ns summaries processed
fi
done
When it fails
Unreachable URL:
$ vrk grab https://nonexistent.example.com/page
error: grab: Get "https://nonexistent.example.com/page": dial tcp: lookup nonexistent.example.com: no such host
$ echo $?
1
HTTP error:
$ vrk grab https://example.com/this-page-does-not-exist
error: grab: HTTP 404
$ echo $?
1
No URL provided:
$ vrk grab
usage error: grab: no URL provided
$ echo $?
2