Build track · Platform & eval engineer · Developer tools
Seven projects, cloud-delivered, dev-tools domain. Each one is an artifact real inputs hit, wrapped in an eval, with a short writeup. The eval is the part that gets you hired.
The rule for every project
Scope (your call): everything here is cloud / backend / developer-tools. No on-device, mobile, or product-UX work — you ruled those out, so they're absent on purpose. Sequencing maps to the curriculum: P0 with L1–L2, P1 after L2, P2 after L3 + eval intro, P3 after L4, P4 alongside L5, P5 after L6, P6 is the capstone.
Start measuring from line one — the platform/eval reflex.
A thin wrapper around a single chat completion that records, per call: input/output tokens, time-to-first-token and total latency, $ cost, and the top-k logprobs of the first few tokens. Run one fixed prompt across three models.
A small table: model × {tokens, latency, cost, "confidence" from logprobs}. You can't optimize what you don't measure; this makes that instinct muscle memory.
A reusable traced_call() helper + the comparison table. You'll fold this into every later project's observability.
Stretch: emit OpenTelemetry-style spans so it drops into a real tracing backend later.
Make an unreliable model produce reliable structured output — and prove it.
Pick a messy dev input: CI logs, stack traces, or API-doc snippets. Extract structured records — e.g. {error_type, file, line, root_cause, fix} or {endpoint, method, params, auth} — using function-calling / JSON mode / schema-constrained decoding, with a validate-and-retry loop.
Validity rate (% parseable and schema-valid) over a held-out set of ~50 inputs, tracked before/after each change. Push toward ~100%. Bonus: field-level accuracy on a labeled subset.
A small extraction service/function + a validity report showing the climb to ~100%.
Stretch: compare schema-constrained decoding vs. prompt-only; quantify the reliability gap.
The centerpiece. Most candidates show a RAG demo; almost none prove a retrieval improvement with numbers. That gap is your edge.
Index a real corpus you know — a framework's docs, the Python stdlib, or your own repo. Ingest & parse it properly (the unglamorous part that decides quality), chunk, embed, retrieve. Then improve it: add reranking, query transformation (HyDE / multi-query), or contextual retrieval.
A golden set of ~30 questions, each mapped to the chunk(s) that should be retrieved. Measure recall@k and MRR. Show one specific change move the number (e.g. "reranking: recall@5 0.62 → 0.81"). Commit the eval set to the repo.
The RAG service + the committed retrieval-eval harness + a before/after table.
Stretch: add an end-to-end answer eval (LLM-judge, faithfulness/citations) on top of the retrieval eval.
Safe and resilient and evaluated. Demos that loop and burn tokens are common; this isn't one.
An agent that does a real dev task: review a PR (read the diff, run linters/tests as tools, comment), or triage a failing CI log (read logs, grep code, propose a fix). Multi-tool, with a human approval gate before any mutating action, graceful recovery when a tool errors, and injection defense (it's reading untrusted code/diffs).
Task success rate over ~20 scenarios; a safety check (never mutates without approval); and a recovery test (inject a tool failure → verify it degrades gracefully, doesn't loop or hallucinate success).
The agent + a scenario eval suite covering success, safety, and recovery.
Stretch: add a cost ceiling per run and trace token+cost per step.
This is the platform/eval-engineer portfolio piece. "I built the eval infra that gates our prompt changes in CI" is literally the job.
Lift P2/P3's ad-hoc eval into a reusable harness: golden sets, an LLM-as-judge with bias controls (position-swap, reference-based rubric scoring), task metrics, and a CI gate (e.g. a GitHub Action) that fails the build when a prompt/chain change regresses the metric. Capture traces (token + cost per step).
The harness is the deliverable. Demonstrate it catching a deliberately-introduced regression in CI — a red build from a worse prompt. Show the judge's bias controls actually change a verdict.
A reusable repo/package + a CI run screenshot/log showing a blocked regression.
Stretch: add an eval-flywheel step — mine real failure traces into new test cases automatically.
"Cut cost X% at equal quality, measured" is a sentence that gets platform engineers hired.
Take P2 or P3 and harden it: a model cascade (cheap model first, escalate to a stronger one on low confidence / hard inputs), prompt + semantic caching, retries/timeouts/fallbacks, rate-limit handling, and full tracing with token + cost per span. Set a latency and cost budget and hold to it.
Cost and latency before → after routing + caching, at fixed quality — use P4's harness to prove quality didn't drop. Report $ and ms saved.
The hardened service + a cost/latency/quality dashboard.
Stretch: swap one cascade tier to a self-hosted open-weight model (vLLM) and compare economics.
The claim-the-title bar: a shipped, evaluated, instrumented dev-tools AI service with a measured improvement.
Combine. Ship P2-or-P3 as a real cloud service (API + minimal UI or CLI), with P4 (eval gating CI) and P5 (routing / caching / tracing / budget) wired in. Then pick one real improvement, implement it, and document the before/after with numbers.
A documented before/after on one metric (quality, cost, or latency), with the eval set and traces as evidence. This — not coverage, not a quiz — is "done."
Deployed service + repo + the writeup. This is the top of your portfolio.
What the finished ladder proves to a platform/eval hiring manager