The Whiteboard Interview Is Dead: Fair, AI-Proof Hiring

AI broke whiteboards and take-homes in 2026. Here's the decision framework for fair, AI-proof work-sample assessments, grounded in how Anthropic, Stripe, and Linear hire.

Ernest Bursa

Ernest Bursa

Founder · · 12 min read
Two engineers pair programming on a real codebase at a sunlit co-working table, one narrating a decision while the other types

As a standalone signal, the whiteboard interview is dead. A controlled NC State and Microsoft study found candidates in watched technical interviews performed about half as well as those solving the same problem privately, and generative AI now solves both whiteboard puzzles and take-homes in minutes. The durable replacement is a work-sample assessment: a job-relevant task that ends in a live defense, where the candidate explains and changes real decisions out loud.

That last move is the whole game. AI overlays can write code during a screen-share and finish a “3-hour” take-home in minutes, but they cannot defend a tradeoff in real time. The strategic response from the best-run engineering teams in 2026 is not surveillance software. It is a format shift toward assessments that test judgment and communication, the two things AI still cannot fake live. This guide gives you the decision framework: which format to use now, how to make it both fair and cheat-resistant without spyware, and how Anthropic, Stripe, Vercel, and Linear actually choose.

Are Whiteboard Interviews Dead? (Yes, and AI Is Only Half the Reason)

Yes, as a standalone signal. The whiteboard interview was broken before AI ever touched it, and AI removed whatever signal was left.

The first problem is that it never measured the right thing. In a controlled experiment, Behroozi and colleagues at NC State and Microsoft (2020) had candidates solve the same problem in two conditions: alone, and watched by an interviewer in a traditional whiteboard setup. Candidates in the watched condition performed roughly half as well. The format largely measures performance anxiety and working-memory load under observation, not engineering competence. It also penalizes exactly the people you want to hire fairly: introverts, neurodivergent candidates, and anyone whose communication style does not match a high-pressure verbal performance.

The second problem arrived in 2025. Overlay tools like Cluely, Interview Coder, and Leetcode Wizard now feed answers invisibly during a screen-share. A standard LeetCode-style problem is solved silently in the background while the candidate types. If your screen still relies on competitive-programming puzzles, you are no longer measuring the candidate. You are measuring their tooling.

This does not mean live coding is worthless. It means the watched-puzzle format is. The version that survives is collaborative live coding: pairing in a real IDE on a realistic problem, where the interviewer is a partner thinking alongside the candidate, not a proctor waiting for the right answer. That tests how someone reasons, asks questions, and works in unfamiliar code, which is both closer to the job and far harder to fake with an overlay.

Why AI Broke the Take-Home Too (and Why Surveillance Is the Wrong Fix)

The unsupervised take-home is now the most AI-exposed format of all. The fix is not detection software. It is design.

Take-homes always had the best real-world-validity story, and they remain valuable. But an ungraded, undefended async task is the easiest thing in your loop for AI to complete. Assessment vendor Fabric reports that a take-home designed to take three hours can be finished by AI tools in roughly eight minutes, and that cheating adoption in its candidate pool more than doubled across 2025, from about 15% to 35%. Treat those exact numbers as directional rather than gospel; they are vendor self-reported and uncited. The direction, though, is not in dispute, and any engineering leader who has reviewed a take-home that “felt too clean” already knows it.

The tempting response is to buy your way out with proctoring: eye-tracking, keystroke logging, screen lockdown, browser spyware. Resist it, for three reasons.

  • It is adversarial and brand-damaging. Engineering is a small, talkative community. Candidates share surveillance horror stories, and your best applicants self-select out before they apply.
  • It creates its own bias and accessibility problems. Lockdown and eye-tracking tools penalize neurodivergent candidates, disabled candidates, and anyone with a nonstandard setup. The EEOC and DOJ have made clear that employers remain liable when an automated assessment tool causes adverse impact, regardless of who built it.
  • It does not even work. Surveillance fights the symptom. A second monitor or a phone defeats most of it. You spend trust and budget to lose anyway.

The durable answer is design-based resistance: build formats where the signal lives in live reasoning, so there is nothing for an overlay to rescue. In some roles you can go further and explicitly allow AI during the task, then score how well the candidate directs and critiques it, because that mirrors the actual job.

Which Assessment Format Should You Use Now? A Decision Framework

Match the format to the daily reality of the role, and make sure at least one round forces real-time judgment. There is no single best format; there is a best format for this role.

Format Best for Why it resists AI
Pair programming on a realistic problem Roles where collaboration and working in unfamiliar code are the job Thinking is observed live and collaboratively; an overlay can’t narrate reasoning for you
Take-home + live defense Roles where deep, independent async work is the job The defense round tests decisions the candidate must own out loud
System design Senior and infrastructure roles It’s about tradeoffs and communication, not retrievable answers
Async code review of real code Remote-first, async-heavy cultures Tests comprehension and critique, not generation

The through-line across all four is the same: the most AI-proof signal is a candidate defending real decisions in real time. Pick the format that looks most like a normal Tuesday in the role, then make sure the candidate has to explain their thinking to a human at least once.

A practical default for most startup engineering roles is the second row: a short, paid, realistic take-home that becomes the agenda for a live conversation. You get the ecological validity of real work plus the cheat-resistance of a live defense. If you want the tactical mechanics of designing that take-home itself, scope, time budget, and grading, see our deep-dive on how to structure code assignments candidates don’t hate.

The One Move That Makes Any Format AI-Proof: The Live Defense

The single most durable anti-cheating mechanism is to end every async artifact with a live defense: “Walk me through this. Now change requirement X. Why did you choose this over the alternative?”

Here is why it works. An overlay LLM can produce the code. It cannot, in real time, explain why one data model beat another for this constraint, adapt when you change the spec mid-conversation, or debug the thing it supposedly wrote. The artifact stops being the final signal and becomes the agenda for a 20- to 30-minute conversation about judgment. Someone who genuinely built it sails through. Someone who pasted it from a tool stalls on the first “why.”

The live defense also quietly fixes the fairness problem. You are no longer scoring typing speed under observation, the thing the NC State study showed is mostly anxiety. You are scoring reasoning about work the candidate already did at their own pace, which is both fairer and a far better predictor of on-the-job performance.

Concretely, the move looks like this in any loop:

  1. Candidate completes a small, realistic, paid work sample async.
  2. A 25-minute live session opens with “walk me through your approach.”
  3. You change one requirement live and watch them adapt.
  4. You ask them to debug or extend one piece on the spot.
  5. Reviewers score the reasoning, on a rubric, before anyone debriefs.

No spyware. No accusations. Just a conversation that an AI cannot have on the candidate’s behalf.

How Anthropic, Stripe, Vercel, and Linear Actually Hire

The best-run engineering teams have already made this shift. None of them rely on watched whiteboard puzzles, and none of them rely on surveillance. They rely on realistic work plus live judgment.

Anthropic runs a recruiter screen, a technical phone screen, then either a take-home or a roughly 60-minute live assessment (role-dependent, in CodeSignal, and explicitly not LeetCode-style), followed by four to six onsite rounds including system design and a heavily weighted values round. Most notably, the company that makes Claude publishes an explicit candidate AI policy. As of a July 2025 reversal, candidates may use AI to polish application materials, but it is prohibited in live interviews and take-homes: “Complete these without Claude unless we indicate otherwise. We’d like to assess your unique skills.” That is AI-proofing by design plus honesty with candidates, from the team with the most reason to think hard about it.

Stripe runs a deliberately practical loop: debug an unfamiliar codebase, build a small integration from scratch, work multi-part problems while narrating your thinking. Some rounds run as pairing. It is closer to real engineering than competitive programming on purpose.

Vercel uses a collaborative, build-style coding session plus system design, weighted toward frontend product judgment and communication.

Linear uses a short (around three-hour), paid, work-trial-style project followed by a code-review discussion, and requires a near-unanimous “strong yes” from the panel to extend an offer. Structure, a high bar, and work-relevance, in one loop.

A useful contrast is the GitLab-style pattern: an async code review of a real merge request as the basis for a live discussion. It tests reading and critiquing real code rather than generating it, which fits a remote-async culture. The point of listing five different approaches is not that one is correct. It is that each company matched the format to how it actually works, and every one of them ends in a moment of live, defensible judgment.

Is the New Format Actually Fairer? What the Evidence Says

Job-relevant work samples are among the most valid and lowest-bias selection methods, but only when they are structured. Fairness comes from structure, job-relevance, and consistency, not from the format label.

Be careful with the numbers, because the canon was recently corrected. Sackett, Zhang, Berry, and Lievens (2022) re-analyzed decades of personnel-selection research and lowered several long-cited validity estimates:

  • Structured interviews are now the single best predictor at an operational validity of about .42 (revised down from .51).
  • Work-sample tests sit at about .33 (revised sharply down from the long-quoted .54).
  • General cognitive ability sits at about .31 (down from .51).

The ranking is the headline: a well-structured interview, the live defense done rigorously, now predicts performance better than a raw work sample or a cognitive test. That is a strong argument for the live-defense move on its own merits, not just as an anti-cheating tactic.

On bias specifically, use peer-reviewed effect sizes rather than the recycled marketing stats. Aamodt’s meta-analysis found unstructured interviews far more susceptible to bias (d = .59) than structured ones (d = .23), and racial score gaps shrink as structure increases. Add one more lever: pay candidates for substantial work-sample stages. Campion and colleagues (2025) found that practice and paid work-sample testing reduces subgroup score differences, and paying for real work also raises completion and helps caregivers and lower-income candidates who cannot donate unpaid hours.

Why the “42% / 81% Bias Reduction” Stats You’ve Seen Are Unreliable

You will find dozens of vendor blogs claiming structured interviews “reduce gender bias 42%, racial bias 35%, and improve accuracy 81%.” Those three numbers have no traceable primary study; they are copied from one source to the next. Use the peer-reviewed figures above instead. The credibility of your fairness argument depends on citing research that actually exists, especially in a regulatory environment where the EEOC and DOJ expect you to defend your process.

Design Fair, AI-Proof Assessments by Default with Kit

AI broke the whiteboard and the unsupervised take-home in the same year. The fix is not surveillance. It is designing the right format: job-relevant work samples, paid and structured, that always end in a live defense. The problem with doing this by hand is that the pieces, the realistic task, the payment, the scheduled defense, the independent scoring, live in five different tools and tend to drift apart. Kit makes them one composable pipeline.

  • Composable process templates let you encode the thesis directly: an application form flows into a code-assignment stage, then into a live-interview round, then into team review and an offer. The take-home is built to be the agenda for the defense, not the final signal.
  • The code-assignment stage is a realistic work sample, not LeetCode. It uses a GitHub-based private repo cloned from a template, with a real branch-and-PR workflow and a configurable deadline. It is job-relevant by construction.
  • Per-stage payouts let you pay candidates for substantial work-sample stages, which is both the fairness move backed by Campion (2025) and a clear respect signal.
  • Team review with per-stage reviewers gives you structured, independent scorecards before the debrief, the highest-leverage anti-bias mechanism in the research and the auditable artifact the EEOC and DOJ expect.
  • Live-interview scheduling productizes the defense round, so the “walk me through this” conversation is a built-in stage rather than an afterthought.

If you want the validity case in depth, read structured interview scorecards and predictive validity, and for the broader shift away from puzzle screens, see why LeetCode is obsolete in a post-AI interview.

The whiteboard is gone and the unsupervised take-home went with it. What replaces them is not a new gadget. It is a format choice: paid, structured, job-relevant work that a candidate defends out loud. Build that once, and your loop is fair and AI-proof by design. Start a free trial and compose your first AI-proof pipeline, or browse the role templates to start from a pre-built one.

Related articles

Ready to hire smarter?

Start free. No credit card required. Set up your first hiring pipeline in minutes.

Start hiring free