AI Interview Cheating Is Undetectable. Redesign the Test.

Invisible AI overlays like Cluely beat live coding and proctoring. The fix is not more surveillance. It is redesigning assessments to measure reasoning AI cannot fake.

Ernest Bursa

Ernest Bursa

Founder · · 13 min read
Two startup engineers at a sunlit SOMA loft desk reviewing a candidate's code assignment together on a laptop, talking through the logic line by line

AI interview cheating uses invisible screen overlays, such as Cluely and Interview Coder, to feed candidates AI-generated answers during live technical interviews. The overlays hook the graphics layer, so they are invisible to screen sharing and proctoring. Because the tools are engineered to be undetectable, catching them is an arms race you lose. The durable fix is redesigning your assessment to measure reasoning and judgment, which AI cannot fake.

That is the uncomfortable shift in technical hiring right now. Your live coding round and your take-home assignment were the trusted signals. A class of consumer tools broke both of them invisibly, and most teams have not noticed yet. This piece walks through how the cheating works, which numbers you can actually trust, why detection is a dead end, and what a cheat-resistant assessment looks like in practice.

What is AI interview cheating, and how do Cluely and Interview Coder work?

AI interview cheating tools capture the interviewer’s audio and the on-screen problem text, run them through a large language model, and render the answer in an overlay that the candidate sees but screen sharing does not. They achieve invisibility by hooking the graphics layer directly (DirectX on Windows, Metal on macOS), so the answer never appears in the shared window. Reported response latency is roughly one to two seconds.

Interview Coder was built by two Columbia students, Chungin “Roy” Lee and Neel Shanmugam, to beat LeetCode-style interviews. Lee filmed himself using the invisible overlay to pass an Amazon technical interview, posted it publicly, and the pair were suspended from Columbia. In April 2025 the project was rebranded and expanded into Cluely, whose pitch was, bluntly, “cheat on everything.”

This is not a fringe hack from a hobbyist. Cluely raised a $15M Series A led by Andreessen Horowitz in June 2025, about two months after a $5.3M seed round. There is real capital and real engineering behind making interview cheating frictionless and invisible.

There is a fitting irony worth sitting with. In a March 2026 interview with TechCrunch, Roy Lee admitted the “$7M ARR” figure he had publicly claimed the prior July was fabricated; his actual Stripe data showed around $5.2M. He called it “the only blatantly dishonest thing i’ve said publicly.” A company whose entire product is undetectable dishonesty got caught being dishonest. The lesson for hiring teams is direct: you cannot out-detect a tool, or a culture, built to deceive. You have to change what you measure.

How big is the problem, and which numbers can you trust?

The honest answer is that the cheating is widespread, but the most-quoted statistic is the least reliable one. Start with the independent evidence, then treat the vendor numbers with appropriate suspicion.

The strongest independent signal comes from interviewing.io, which surveyed 67 interviewers at FAANG and FAANG-adjacent companies in October 2025. The results:

  • 81% suspect that candidates have used AI to cheat in their interviews.
  • About 33% have actually caught someone doing it.
  • 75% believe AI assistance lets weaker candidates pass interviews they would otherwise fail.

That triangulates with Karat, whose co-founder reported that roughly 80% of candidates use LLMs on coding tests even when explicitly told not to. Two independent practitioner sources, both pointing the same direction.

Now the headline number you have probably seen. Fabric, an AI interview platform, reported that 38.5% of candidates were flagged for cheating across 19,368 interviews on its own platform between July 2025 and January 2026. Treat that with caution, and here is exactly why. The number comes from a single vendor that sells the detection product, measuring “cheating” as a probability score above a threshold from its own model, on interviews run on its own AI-led platform. “Flagged by a detection model” is not the same as “confirmed cheating.” Fabric’s own publications give conflicting figures elsewhere (38.5%, 35%, and 15% all appear). If you cite it, attribute it to Fabric and label it for what it is: a vendor self-measure, not an established fact about reality.

The takeaway does not depend on Fabric’s number. Independent interviewers tell you most of them suspect cheating and a third have caught it. That alone should change how you run your process.

Why detection and proctoring lose the arms race

Detection loses because the tools are engineered specifically to defeat it. An overlay that hooks the graphics layer is invisible to screen sharing by design. A second device sitting off-camera is invisible to any browser lockdown. Eye-tracking and keystroke analysis produce false positives that punish nervous-but-honest candidates while a calm cheater sails through. Every detection layer you add is one the next release routes around, and it degrades the experience for everyone who is not cheating.

The cleanest proof that detection fails is interviewing.io’s controlled experiment. They ran 37 mock interviews (32 usable) with experienced engineers (four or more years of experience) who were instructed to secretly use ChatGPT, while the interviewers were blind to the setup. The results are stark:

Question type Pass rate with secret ChatGPT use
Verbatim LeetCode questions 73%
Lightly modified LeetCode 67%
Custom, novel questions 25%
Control (no cheating) 53%

Two findings from that experiment matter most. First, not a single interviewer flagged the cheating, even at a 73% pass rate. Second, 81% of the cheating candidates felt confident they had gotten away with it. Human detection failed completely against verbatim questions. No proctoring product would have done better, because the AI output never touched the shared screen.

But look at the bottom of that table. The same cheating that worked 73% of the time on verbatim questions worked only 25% of the time on custom ones. Detection did nothing. Redesign cut the cheat’s success by roughly two-thirds. That is the whole argument in one row.

The fix: redesign assessments so AI assistance is irrelevant or expected

The durable response is not to catch the AI at the gate. It is to redesign the gate so AI assistance does not help, or so you assume it is present and evaluate how well the candidate wields it. The interviewing.io data already showed the direction: custom, novel problems collapse the cheating advantage because the model has no public answer to pattern-match against.

One caveat from the same research is important. Merely rewording an existing LeetCode problem is not enough. Lightly modified questions still had a 67% cheat-pass rate, barely below verbatim. An effective custom question needs genuinely unique inputs and outputs, ideally tied to your own domain, so the model cannot recognize it. The recurring principles across independent sources look like this:

  1. Validate reasoning and process, not final-answer syntax. The answer is the cheap part now. How a candidate frames the problem, weighs tradeoffs, and recovers from a wrong turn is the signal.
  2. Use custom problems with novel inputs and outputs. Not public, not published, not a reskin of a known puzzle.
  3. Probe understanding with line-by-line follow-ups. “Why did you choose this data structure?” “Now extend it to handle this case.” A candidate who leaned on an overlay cannot defend or modify code they did not reason through.
  4. Use realistic, multi-step, multi-file tasks. State-of-the-art models still degrade on long, multi-step reasoning chains, and real work is not a single function with a single correct output.
  5. Where it fits the role, treat AI as expected. Assess how well the candidate directs, critiques, and corrects the AI, because that is the actual job now.

That last point is where the industry frontier is heading. CodeSignal launched AI-assisted assessments that let candidates use AI and grade how well they use it. The mature stance is not “lock AI out.” It is “assume AI is present, and measure the human judgment around it.”

This is not a fringe view, and it does not mean burning down your process. In the same interviewing.io survey of 52 FAANG respondents, zero said their company had abandoned algorithmic questions, but 58% said they had changed the types of questions they ask, and only about 11% had adopted cheating-detection software. More than half predicted algorithmic interviews will decline in prominence within two to five years. Meta interviewers reported shifting to “more open-ended questions which probe thinking.” The realistic path is redesign, not surveillance, and not abandonment.

What a cheat-resistant technical assessment looks like in practice

A cheat-resistant assessment is one where AI assistance does not change the outcome because you are measuring things AI cannot fake on someone else’s behalf: domain framing, defensible decisions, and the ability to extend the work live. Here is the concrete shape.

Give a custom, multi-file, company-specific task

Replace the public algorithm puzzle with a small slice of your real problem. A bug in a realistic codebase, a feature on top of starter code you wrote, a data-modeling task with inputs no model has seen. Because it is yours, no LLM has a memorized answer, which is exactly the condition that dropped the cheat-pass rate from 73% to 25%. For more on building tasks candidates respect, see how to structure code assignments.

Put a “walk us through and extend it” round right after the take-home

This is the single highest-leverage change. Schedule a live round immediately after the assignment whose only job is to have the candidate explain their solution line by line and then extend it on the spot. “Add this edge case.” “Refactor this for readability.” A candidate who genuinely solved the task does this easily. A candidate who pasted in an overlay’s output cannot, because they never built the mental model. This operationalizes interviewing.io’s line-by-line follow-up finding directly inside your pipeline.

Score with structured, weighted, blind reviews

Have multiple reviewers evaluate the same submission against the same named criteria, weighted by what matters for the role, before they see each other’s votes. Blind voting removes anchoring. Weighted scorecards force everyone to evaluate the same competencies instead of vibes. This is where you capture the reasoning signal that a pass/fail checkmark throws away.

Change the question type, do not ban the algorithm

FAANG did not abandon algorithmic interviews; they changed the kind of question and added open-ended probes. You can keep a screening filter while making the deciding rounds resistant to one-shot AI answers. The goal is signal, not purity.

Why structured, reasoning-first scoring is the real upgrade

Structured scoring is the best-established idea in this entire piece, and it predates the AI era. Structured interviews, where every candidate faces the same questions scored against the same behaviorally-anchored rubric, are roughly twice as predictive of job performance as unstructured ones. Standardized scorecards reduce noise and bias because they hold everyone to the same criteria instead of the interviewer’s mood. The common recommendation is 5 to 7 weighted competencies.

AI cheating did not create the case for structured scoring; it made it urgent. When the final answer is commodity, the only durable signal is how the candidate got there and whether they can defend it. A rubric that scores “explained tradeoffs clearly” and “extended the solution correctly under pressure” measures exactly what an overlay cannot supply. If you want the deeper argument, read structured interview scorecards and predictive validity.

The shift in mindset is from catching to measuring. Stop asking “is this person cheating?” and start asking “can this person reason about this problem in front of me?” The second question is harder to game and far more predictive.

How Kit bakes cheat-resistant evaluation into the pipeline

Most of the market splits into two camps. Detection vendors fight an arms race against tools engineered at the graphics layer. Assessment platforms build great problems but live in a separate silo from your pipeline. Kit takes the third path: it makes structured, reasoning-first evaluation the default shape of the pipeline itself, so the redesign is built once and reused, not improvised per role.

Here is how that maps to everything above:

  • Code assignments backed by real GitHub repos. Each candidate gets a private repo generated from your own template repository, with your README, your starter code, even your CI. That is what lets you ship a custom, multi-file, company-specific task rather than a public puzzle, which is the design choice that collapses the AI cheat advantage.
  • A defend-and-extend live round, sequenced right after. Kit’s process templates let you order stages freely, so you can drop a live interview round immediately after the code assignment whose purpose is “walk us through and extend your solution.” The candidate who relied on an overlay cannot authentically defend or modify the code.
  • Structured team review with blind voting and weighted scorecards. Reviewers score the same submission against named, weighted criteria with recommendations from Strong No to Strong Yes, and can vote blind so no one anchors on the lead. This is the structured rubric the research says doubles predictive validity, applied to reasoning rather than to a green checkmark.
  • Deliberate panel decisions, not rubber stamps. Voting supports a positive-vote threshold, requiring all reviewers, and veto auto-rejects, with ambiguous rounds routed to a human “needs a decision.” A panel decides on signal quality instead of an algorithm waving through output a bot may have produced.
  • Reusable process templates. Build the cheat-resistant pipeline once as a process template and reuse it across roles, so reasoning-first hiring is the default, not a heroic one-off.

To be clear about what Kit does not do: there is no AI-cheating detection, no proctoring, no eye-tracking, and no autograder. That is deliberate. You cannot reliably detect a tool built to be invisible. So Kit does not try to catch the cheat. It helps you measure what the cheat cannot fake, which is the honest and stronger position.

The threat is real and the surveillance response is a trap. Invisible overlays beat live coding and they beat proctoring, and the data shows zero interviewers noticing. The same data shows custom questions cutting the cheating advantage by two-thirds and structured scoring roughly doubling predictive validity. Stop trying to catch AI at the gate. Redesign the gate so AI assistance is irrelevant, and make that redesign the default shape of your pipeline.

If you are rethinking technical assessment for the AI era, start a free trial and build a code assignment plus structured review pipeline that measures reasoning, not syntax. For the adjacent identity threat, where the candidate themselves may be fake, see deepfake candidates and AI hiring fraud.

Related articles

Ready to hire smarter?

Start free. No credit card required. Set up your first hiring pipeline in minutes.

Start hiring free