Build track · Platform & eval engineer · Developer tools

Projects — the learn-by-doing ladder

Seven projects, cloud-delivered, dev-tools domain. Each one is an artifact real inputs hit, wrapped in an eval, with a short writeup. The eval is the part that gets you hired.

The rule for every project

Artifact, not a notebook. A deployed service, CLI, or package someone could actually run.
An eval with numbers. Before you "improve" anything, you must be able to measure it. No eyeballing — that's the whole job you're pivoting into.
A "decisions + numbers" writeup. One short post per project: what you tried, what the metric did, what you'd change. Communicating tradeoffs is half of what an AI-platform hire is judged on.

Scope (your call): everything here is cloud / backend / developer-tools. No on-device, mobile, or product-UX work — you ruled those out, so they're absent on purpose. Sequencing maps to the curriculum: P0 with L1–L2, P1 after L2, P2 after L3 + eval intro, P3 after L4, P4 alongside L5, P5 after L6, P6 is the capstone.

P0 · warm-up

Instrument one LLM call

Start measuring from line one — the platform/eval reflex.

L1 · L6~half a day

Build

A thin wrapper around a single chat completion that records, per call: input/output tokens, time-to-first-token and total latency, $ cost, and the top-k logprobs of the first few tokens. Run one fixed prompt across three models.

The eval — your hire signal

A small table: model × {tokens, latency, cost, "confidence" from logprobs}. You can't optimize what you don't measure; this makes that instinct muscle memory.

Artifact

A reusable traced_call() helper + the comparison table. You'll fold this into every later project's observability.

Stretch: emit OpenTelemetry-style spans so it drops into a real tracing backend later.

Dev-artifact → validated JSON

Make an unreliable model produce reliable structured output — and prove it.

milestone B1L2~2–3 days

Build

Pick a messy dev input: CI logs, stack traces, or API-doc snippets. Extract structured records — e.g. {error_type, file, line, root_cause, fix} or {endpoint, method, params, auth} — using function-calling / JSON mode / schema-constrained decoding, with a validate-and-retry loop.

The eval — your hire signal

Validity rate (% parseable and schema-valid) over a held-out set of ~50 inputs, tracked before/after each change. Push toward ~100%. Bonus: field-level accuracy on a labeled subset.

Artifact

A small extraction service/function + a validity report showing the climb to ~100%.

Stretch: compare schema-constrained decoding vs. prompt-only; quantify the reliability gap.

Docs/code RAG + a retrieval eval harness

The centerpiece. Most candidates show a RAG demo; almost none prove a retrieval improvement with numbers. That gap is your edge.

milestone B2L3 · L5~1–2 weeks

Build

Index a real corpus you know — a framework's docs, the Python stdlib, or your own repo. Ingest & parse it properly (the unglamorous part that decides quality), chunk, embed, retrieve. Then improve it: add reranking, query transformation (HyDE / multi-query), or contextual retrieval.

The eval — your hire signal

A golden set of ~30 questions, each mapped to the chunk(s) that should be retrieved. Measure recall@k and MRR. Show one specific change move the number (e.g. "reranking: recall@5 0.62 → 0.81"). Commit the eval set to the repo.

Artifact

The RAG service + the committed retrieval-eval harness + a before/after table.

Stretch: add an end-to-end answer eval (LLM-judge, faithfulness/citations) on top of the retrieval eval.

Dev-tools agent — HITL + failure recovery

Safe and resilient and evaluated. Demos that loop and burn tokens are common; this isn't one.

milestone B3L4 · L8~1–2 weeks

Build

An agent that does a real dev task: review a PR (read the diff, run linters/tests as tools, comment), or triage a failing CI log (read logs, grep code, propose a fix). Multi-tool, with a human approval gate before any mutating action, graceful recovery when a tool errors, and injection defense (it's reading untrusted code/diffs).

The eval — your hire signal

Task success rate over ~20 scenarios; a safety check (never mutates without approval); and a recovery test (inject a tool failure → verify it degrades gracefully, doesn't loop or hallucinate success).

Artifact

The agent + a scenario eval suite covering success, safety, and recovery.

Stretch: add a cost ceiling per run and trace token+cost per step.

P4 · signature piece

Eval & regression-gate harness as a product

This is the platform/eval-engineer portfolio piece. "I built the eval infra that gates our prompt changes in CI" is literally the job.

L5~1–2 weeks

Build

Lift P2/P3's ad-hoc eval into a reusable harness: golden sets, an LLM-as-judge with bias controls (position-swap, reference-based rubric scoring), task metrics, and a CI gate (e.g. a GitHub Action) that fails the build when a prompt/chain change regresses the metric. Capture traces (token + cost per step).

The eval — your hire signal

The harness is the deliverable. Demonstrate it catching a deliberately-introduced regression in CI — a red build from a worse prompt. Show the judge's bias controls actually change a verdict.

Artifact

A reusable repo/package + a CI run screenshot/log showing a blocked regression.

Stretch: add an eval-flywheel step — mine real failure traces into new test cases automatically.

Production-shape + model routing

"Cut cost X% at equal quality, measured" is a sentence that gets platform engineers hired.

L6~1 week

Build

Take P2 or P3 and harden it: a model cascade (cheap model first, escalate to a stronger one on low confidence / hard inputs), prompt + semantic caching, retries/timeouts/fallbacks, rate-limit handling, and full tracing with token + cost per span. Set a latency and cost budget and hold to it.

The eval — your hire signal

Cost and latency before → after routing + caching, at fixed quality — use P4's harness to prove quality didn't drop. Report $ and ms saved.

Artifact

The hardened service + a cost/latency/quality dashboard.

Stretch: swap one cascade tier to a self-hosted open-weight model (vLLM) and compare economics.

P6 · capstone

Ship it — and prove a change moved the number

The claim-the-title bar: a shipped, evaluated, instrumented dev-tools AI service with a measured improvement.

milestone B4all layers~2 weeks

Build

Combine. Ship P2-or-P3 as a real cloud service (API + minimal UI or CLI), with P4 (eval gating CI) and P5 (routing / caching / tracing / budget) wired in. Then pick one real improvement, implement it, and document the before/after with numbers.

The eval — your hire signal

A documented before/after on one metric (quality, cost, or latency), with the eval set and traces as evidence. This — not coverage, not a quiz — is "done."

Artifact

Deployed service + repo + the writeup. This is the top of your portfolio.

What the finished ladder proves to a platform/eval hiring manager

Eval rigor — golden sets, judges with bias controls, CI gates (P2, P4, P6). The rarest, most valued skill.
Observability & cost discipline — tracing, token/cost accounting, routing that saved measured money (P0, P5).
RAG done right — ingestion, retrieval eval, advanced patterns, not a toy demo (P2).
Safe, resilient agents — HITL, recovery, injection defense, all evaluated (P3).
You ship and measure — a deployed service with a documented, numeric improvement (P6).