vrk sip

vrk sip samples lines from stdin using first-N, reservoir sampling, every-Nth, or percentage strategies.

The problem

head -1000 gives you the first 1,000 lines, not a representative sample. shuf -n 1000 gives a random sample but loads the entire file into memory. On a 10GB log file, shuf gets OOM-killed. Reservoir sampling in Python is 20 lines for something that should be one command.

The solution

vrk sip samples lines from stdin using four strategies: --first N (first N lines), --count N (reservoir sampling, O(N) memory), --every N (every Nth line), or --sample N (each line with N% probability). --seed makes samples reproducible.

Before and after

Before

shuf -n 1000 large-file.jsonl
# Loads entire file into memory. OOM on large files.

After

cat large-file.jsonl | vrk sip --count 1000 --seed 42

Example

cat access.log | vrk sip --count 1000 --seed 42

Exit codes

Code	Meaning
0	Success
1	I/O failure reading stdin
2	No strategy specified, multiple strategies, –sample outside 1-100, interactive TTY

Flags

Flag	Short	Type	Description
`--first`		int	Take first N lines
`--count`	-n	int	Reservoir sample of exactly N lines
`--every`		int	Emit every Nth line
`--sample`		int	Include each line with N% probability (1-100)
`--seed`		int64	Random seed for reproducibility
`--json`	-j	bool	Append metadata record after all output
`--quiet`	-q	bool	Suppress stderr output

Sampling strategies

Exactly one must be specified.

–first N (deterministic, take first N)

$ seq 100 | vrk sip --first 5
1
2
3
4
5

Like head -n but integrated with sip’s --json metadata.

–count N (random reservoir sample)

$ seq 100 | vrk sip --count 3 --seed 42
7
55
58

Uniform random sample using reservoir sampling. Uses O(N) memory regardless of input size. A 10-million-line file sampled to 1,000 lines uses the same memory as sampling 100 lines.

–every N (deterministic, every Nth line)

$ seq 20 | vrk sip --every 5
5
10
15
20

Useful for systematic sampling at regular intervals.

–sample N (probabilistic, N% chance per line)

seq 1000 | vrk sip --sample 10

Each line has a 10% chance of being included. The output count is approximate.

Deterministic output (–seed)

$ seq 100 | vrk sip --count 5 --seed 42
# Same output every time

Use --seed for reproducible samples in tests and CI pipelines.

JSON metadata (–json)

$ seq 100 | vrk sip --count 3 --seed 42 --json
7
55
58
{"_vrk":"sip","strategy":"count","requested":3,"emitted":3,"seed":42}

Pipeline integration

Sample from a large dataset for testing

# Take a 1,000-line sample from a 10M-line dataset
cat production-logs.jsonl | vrk sip --count 1000 --seed 42 | \
  vrk validate --schema '{"level":"string","msg":"string"}'

Sample before expensive LLM processing

# Process a random 10% of records through an LLM
cat records.jsonl | vrk sip --sample 10 | \
  while IFS= read -r record; do
    echo "$record" | vrk prompt --system 'Classify this record'
  done

Rate-limit combined with sampling

# Sample 100 records, then throttle to 5/s for API calls
cat data.jsonl | vrk sip --count 100 | vrk throttle --rate 5/s | process-each

When it fails

No strategy specified:

$ seq 10 | vrk sip
usage error: sip: specify exactly one of --first, --count, --every, --sample
$ echo $?
2

Multiple strategies:

$ seq 10 | vrk sip --first 5 --count 3
usage error: sip: specify exactly one of --first, --count, --every, --sample
$ echo $?
2