Investigation · Issue 01 · June 2026

How LLM résumé tools actually lie.

A categorized walk-through · Klepify Engineering · 12-min read

Every AI résumé tool tells you it won't make things up. Most of them do anyway — predictably, in specific places. Here's the taxonomy of those lies, what causes each one, and which tools have actual defenses vs. just polite prompt instructions.

Search "Teal hallucination" or "Rezi made up" on Trustpilot, Reddit, or Hacker News and you'll find the same complaint patterns across vendors: invented metrics, drifted job titles, fabricated skills lifted from the JD, name typos, claimed certifications the candidate never held. These aren't bugs in the conventional sense. They're the predictable consequence of an architecture that asks an LLM to creatively rewrite text under JD-keyword pressure, then trusts the LLM not to drift.

We've spent the last six months building a résumé tailoring pipeline that doesn't trust the LLM on the things it lies about most. In the process we mapped the failure modes pretty carefully. This is what we found.

"Don't invent" is a polite request. Polite requests fail under pressure.

LLM résumé hallucinations cluster into five categories. Each has a different root cause, and each requires a different kind of defense. The defenses aren't optional add-ons — they're architectural decisions made before the LLM is ever called.

Structural drift

The lie: The model changes a job title, company name, date range, or your full name to "better match" the JD.

What it looks like:

Source résumé

Senior Software Engineer
@ Acme Corp
2022 — 2024

LLM rewrite (JD asked for "Lead")

Lead Software Engineer
@ Acme Corp
2022 — 2024

Root cause: When the JD demands "Lead" or "Senior" or "Staff" and the candidate's title is one level off, the model under JD-keyword pressure will quietly shift the title. The system prompt says "don't change job titles" but the JD says "Lead Frontend Engineer" and the LLM is trying to satisfy both. Sometimes the JD wins.

Why it's the worst category: The user can't catch this by reading their tailored résumé — it looks like their résumé. They sent it. Recruiter calls. Asks about their experience as a Lead. They have no idea.

The defense that works: deterministic server-side overwrite. After generation, walk every role block in document order and replace whatever <h3> the model emitted with the exact title/company/date from the parsed résumé. The model's output becomes irrelevant for those fields. This is what Klepify does; no other major tool we surveyed publishes anything equivalent.

Invented metrics

The lie: The model adds a specific number (a percentage, a dollar amount, a team size) that wasn't in your source.

What it looks like:

You wrote

Maintained the team Slackbot. Reduced support tickets in Q2 by tightening the lookup flow.

LLM rewrote it as

Spearheaded the team Slackbot used by 200+ engineers daily, leveraging TypeScript to drive a 47% reduction in support tickets in Q2.

Root cause: "Achievement-focused" prompting (the standard pattern across the industry) instructs the model to "highlight results with metrics." If your source bullet has no metric, the model invents one to satisfy the instruction. The number always sounds specific (47%, not 50%; 200+, not "a lot"). That specificity is what makes it land — and what makes it dangerous.

The defense that's possible but partial: regex over the LLM output for digits + percent/$ patterns, compare against the source. If a number appears in the rewrite that didn't appear in the source bullet, flag it. Klepify doesn't currently do this; it's on our roadmap. Today, the diff-highlighter in the viewer surfaces the new text so the user can audit it manually.

The defense the industry uses: prompt instructions ("don't invent metrics") and the user's eyes. That's it.

Skill fabrication

The lie: The model adds skills to your "Skills" section or "Stack" line that you didn't list anywhere in your source.

What it looks like:

Source tech_stack

TypeScript, Node.js, Postgres, Slack API

LLM output (JD wanted React + TanStack)

React, TypeScript, TanStack Query, gRPC, Node.js, Cypress

Root cause: Identical to the structural-drift case — JD demands a skill, model adds it. The Trustpilot complaint pattern we documented earlier ("AI generated bullet points for skills I never listed") is exactly this category. It's the most embarrassing failure mode because once you're in an interview the recruiter expects you to know the thing.

The defense that's possible: keep a structured skills array on the parsed résumé. After LLM generation, walk the Skills section, parse each skill the model emitted, drop any that don't appear in the parsed array. Klepify doesn't currently do this either — the system prompt has the rule, but enforcement is on our roadmap. Today, the diff-highlighter surfaces them.

The defense the industry uses: again, prompt + user-review.

Scope inflation

The lie: The model expands the scope of what you actually did — "managed a team" becomes "managed a team of 12 across 3 time zones," "shipped a feature" becomes "led a cross-functional initiative."

You wrote

Designed the empty-state screens for the analytics dashboard.

LLM rewrote it as

Pioneered next-generation UX for the flagship analytics product, resulting in a 24% engagement uplift and $2.3M in retained ARR through user-centered design at scale.

Root cause: This is a subspecies of "invented metrics" but worse — it inflates qualitative scope (pioneered, flagship, at scale) on top of quantitative invention (24%, $2.3M). Models do this when they sense the source bullet is "weak" and the JD wants strength.

The defense that works: there isn't really an automated one. This requires the LLM to be calibrated about what it does and doesn't know — something current models aren't great at under achievement-pressure. Klepify's mitigation is the buzzword scrub ("pioneered" → null) plus user-facing diff highlighting. The structural words get removed; the user catches the rest.

Buzzword bloom

The lie: Not a factual lie — but a stylistic one. The model returns text dripping with "leveraged," "spearheaded," "results-driven," "passionate about," "cutting-edge," "best-in-class." It doesn't sound like the candidate; it sounds like a 2014 LinkedIn post.

This category isn't a hallucination in the technical sense. But it's a signal hallucination — recruiters reading 200 résumés a week have learned that buzzword-density correlates with AI-generated content. So even if the underlying facts are correct, the style flags the résumé as AI-written, which is increasingly a negative signal in itself.

The defense that works: a deterministic regex post-pass that swaps each buzzword for a plain-English equivalent. Klepify ships this — about 15 phrases — modeled on the open-source Resume-Matcher project's AI_PHRASE_REPLACEMENTS list. It's belt-and-suspenders to the system prompt's "no buzzwords" rule, because LLMs under JD-pressure reach for "spearheaded" anyway.

Here's a representative slice of the substitution table:

REGEX POST-PASS "leveraged" → "used" "leveraging" → "using" "spearheaded" → "led" "orchestrated" → "led" "results-driven" → "" "passionate about" → "focused on" "cutting-edge" → "modern" "best-in-class" → "leading" "world-class" → "strong" "moving the needle" → "shifting outcomes" "synergies" → "alignment" "proven track record of" → "" "game-changing" → "meaningful"

It's not subtle. It's not supposed to be. The point is that after generation, regardless of what the LLM produced, certain phrases simply do not appear in the output. The model gets no creativity vote on its own buzzword density.

What defenses each major tool actually has

We surveyed the publicly-documented architecture (and lack thereof) for the five major players plus the open-source reference. Here's the map.

Category of lie	Teal	Rezi	Huntr	Jobscan	ResumAI	Klepify
01 · Structural drift	Prompt only	Prompt only	User-review	N/A — no rewrite	Prompt only	Server overwrite
02 · Invented metrics	Prompt only	Prompt only	User-review	N/A	Prompt only	Diff highlighter (server check roadmapped)
03 · Skill fabrication	Prompt only	Prompt only	User-review	N/A	Prompt only	Diff highlighter (server check roadmapped)
04 · Scope inflation	Prompt only	Prompt only	User-review	N/A	Prompt only	Partial — buzzword scrub catches verb tier
05 · Buzzword bloom	Prompt only	Prompt only	Prompt only	N/A	Prompt only	Regex post-pass

The pattern is clear. Most of the industry's defenses are prompt-only. Huntr's side-by-side rewrite UX is the strongest user-review-as-defense in the category — but user-review only helps if the user is reading carefully, and it doesn't help at all for structural drift (which doesn't look wrong on its face).

Klepify's defenses are stronger on categories 01 and 05 (where deterministic checks are tractable) and weaker on 02–04 (where they require harder content verification). We think the honest answer is that 02–04 require a research-grade source-grounded NLI model we don't have today. The diff-highlighter is the user-facing compensation for that gap.

Where Klepify's own defenses are weaker than this article might suggest:

The buzzword scrub catches about 15 of the 50 most common AI-flavor phrases. It does not catch domain-specific puffery ("at scale" in software, "transformative" in marketing, "high-impact" in finance) without a per-domain extension.

The structural overwrite covers name, contact, role title, role company, and role date range. It does not currently verify bullet content or the Skills/Stack list — those rely on the system prompt + diff-highlighter + user review. We've roadmapped a "claim verification" pass that would compare each numeric/skill claim in the rewrite against the source résumé and flag mismatches; it's not shipped yet.

The match score is computed via semantic similarity scoring — if our underlying matcher has biases against certain phrasing styles, the score inherits those biases. We surface the before/after delta partly because it controls for that bias (delta is more reliable than absolute score).

The picture if you zoom out

The industry's hallucination defenses live on a spectrum:

Prompt-only: Teal, Rezi, Kickresume, Enhancv. The LLM is asked nicely not to lie. It mostly complies. Sometimes it doesn't.
User-review-driven: Huntr's side-by-side UX puts the original and the rewrite next to each other so the user can spot drift. Better than prompt-only, but only as good as the user's vigilance.
Deterministic structural lock: Klepify locks four fields (name, title, company, dates) at the architecture level. Other lie categories remain the user's job to catch — but the most damaging category is foreclosed entirely.
No generation: Jobscan doesn't really generate text. It scores and advises. Almost no hallucination risk, but the user does all the rewriting themselves.

Which is right for you depends on how much you trust the LLM and how much patience you have for catching its mistakes. The honest position is that none of these tools — including Klepify — is a "submit and forget" solution today. The differences are about which classes of failure each one forecloses.

The class we think matters most is structural drift, because it's invisible to the candidate. If your title gets quietly changed and you don't notice, you've sent a lie. Klepify exists partly because that specific bug felt unacceptable.

Try the structural lock on your own résumé.

14-day free Pro trial — no card. Unlimited tailoring plus the full diff highlighter, so you see exactly what got rewritten and what got locked.

Add Klepify to Chrome

Sources. Code-level analysis of Klepify's own /tailor Edge Function plus _shared/refinement.ts. Industry pipeline behavior inferred from Teal, Rezi, Huntr, Jobscan, Kickresume, and the open-source Resume-Matcher. Complaint patterns drawn from Qwyse 2026 review, Enhancv's Rezi review, ResumeHog's Teal review. Composed June 2026.