Why Keyword Matching Fails
Keyword matching — the foundation of most ATS systems built before 2020 — has three fatal flaws:
False positives and false negatives at scale
A resume that says "worked with Redis for caching" and one that says "designed Redis cluster topology for 10M requests/day" both contain the keyword "Redis". The first candidate has surface familiarity; the second has deep expertise. Keyword matching gives them the same score. You've lost your signal entirely.
The inverse problem is just as bad. A candidate who writes "built a distributed key-value store" might be describing Redis-equivalent work without ever using the word. They fail the filter. You've now rejected someone who can do the job because they described their work in plain English instead of buzzword-optimized prose.
It's trivially gameable
Candidates know how ATS filters work. Resume optimization guides explicitly teach people to mirror the job description's exact phrasing. This means keyword-heavy resumes aren't a proxy for relevant skills — they're a proxy for knowing how to game keyword-heavy filters. You're selecting for a skill that has nothing to do with the job.
Studies consistently find that identical candidates with "ATS-optimized" resumes get 3-5× more callbacks than those with naturally-written resumes. Your keyword filter isn't screening for job fit — it's screening for people who read resume-writing guides.
It's a bias amplification machine
Keyword matching doesn't just miss good candidates — it systematically misses certain types of good candidates:
- Career changers who have the underlying skills but used different terminology in their previous industry
- International candidates who describe equivalent work using different vocabulary
- Self-taught engineers who didn't attend university programs that use standardized CS terminology
- Candidates from smaller companies who didn't use enterprise tool branding (AWS vs "cloud infrastructure")
The net effect: keyword matching systematically favors candidates who came from environments that taught them to describe work in a specific way. That's correlated with pedigree, not performance.
| Approach | Signal Captured | Bias Risk | Gaming Resistance |
|---|---|---|---|
| Keyword Matching | Terminology familiarity | High — favors credential-heavy resumes | None — trivially gameable |
| Contextual AI Scoring | Technical depth + evidence | Moderate — requires monitoring | High — evaluates substance, not phrasing |
Contextual Scoring: Evaluating Actual Job Fit
Contextual scoring uses a large language model to read the resume the way a senior engineer would: understanding what the candidate did, at what scale, with what ownership, and comparing that to what the role actually requires. The model isn't counting keywords — it's extracting signal from the full text.
What "contextual" actually means
Here's the difference in practice. Take this resume fragment:
Senior Engineer — Payments Platform (2021–2024)
Led architecture of idempotent payment processing pipeline handling $40M/day.
Reduced p99 latency from 800ms to 95ms by replacing synchronous webhook delivery
with a durable async queue backed by PostgreSQL advisory locks.
Owned on-call rotation for 18 months; drove MTTR from 47min to 8min.A keyword matcher looking for "Redis", "Kafka", or "SQS" scores this low or zero — none of those words appear. A contextual model understands:
- The candidate built async queue infrastructure (the concept behind Kafka/SQS) at production scale
- They owned the system end-to-end for 18 months, not just built it
- The 800ms → 95ms improvement demonstrates real performance engineering ability
- Payment processing domain knowledge is explicit and demonstrated, not claimed
Scoring on dimensions, not keywords
The Stackwright scoring model evaluates five dimensions, each 0–100, with specific evidence anchors rather than vague rubrics:
The keyword system rejects a strong candidate. The contextual system surfaces them correctly. This isn't a contrived example — it's representative of what happens in practice when experienced engineers write about their work rather than optimizing for ATS parsers.
Calibration: Consistent Scores Across Resume Formats
The biggest reliability problem with LLM-based scoring isn't accuracy — it's consistency. Different resume formats, different writing styles, and different levels of detail can cause the same underlying candidate to score 72 or 84 depending on how they formatted their bullet points. You need calibration.
The calibration problem in practice
Three factors cause score drift across resume formats:
- Information density — A two-page resume with detailed bullet points gives the model more signal than a one-page minimalist resume, even when both describe equivalent work.
- Quantification patterns — "Improved performance" and "reduced latency by 89%" describe the same work. The quantified version appears stronger even if the underlying achievement is identical.
- Format noise — PDF-extracted text from a heavily formatted resume may contain garbled sections. The model scores what it sees, not what the candidate meant to communicate.
Using anchor candidates for calibration
The most reliable calibration technique is defining score anchors in your scoring prompt: synthetic resume fragments that define what a 40, 60, 80, and 95 score look like for a given role. The model uses these as a reference scale rather than floating its scores arbitrarily.
function buildCalibratedPrompt(resume, jobDescription, requiredSkills) {
const anchors = `
## Calibration Reference
Use these anchors to calibrate your scores. Do not deviate significantly
without strong evidence in the resume.
Score 95 (Exceptional): "Designed and owned distributed payment processing
system handling $40M/day, p99 latency 95ms, 18-month on-call ownership,
drove MTTR 47min→8min, co-author of 2 internal RFCs adopted company-wide."
Score 80 (Strong): "4 years production Node.js at fintech startup, led
migration of core API to TypeScript, built webhook delivery system for
500 merchant integrations, mentored 2 junior engineers."
Score 60 (Competent): "2 years full-stack development, comfortable with
Node.js and React, shipped 3 features end-to-end, some experience with
PostgreSQL, no production incidents owned."
Score 40 (Entry-level signal): "Bootcamp graduate, personal projects in
Node.js, 1 internship at agency, strong learning trajectory but limited
production experience."
`;
return `You are a senior engineering hiring manager evaluating a candidate.
Score the following resume against the job description.
${anchors}
## Job Description
${jobDescription}
## Required Skills
${requiredSkills.join(', ')}
## Resume
${resume}
Return ONLY valid JSON. No prose before or after.
{
"fit_score": <0-100, calibrated against reference above>,
"strengths": [<2-4 evidence-backed strengths>],
"gaps": [<1-3 specific gaps, or empty array>],
"skill_matches": {: <"strong"|"partial"|"missing">},
"dimension_scores": {
"technical_depth": <0-100>,
"relevant_experience": <0-100>,
"problem_complexity": <0-100>,
"leadership_signal": <0-100>,
"growth_trajectory": <0-100>
},
"hire_signal": <"strong_yes"|"yes"|"maybe"|"no">,
"calibration_note":
}` ;
}Generic anchors produce generic calibration. Write your anchors against the specific role: what does a "strong" backend engineer look like for your payment platform role, specifically? The more role-specific your anchors, the tighter your score distribution will be.
Format normalization before scoring
Before sending a resume to the model, normalize the text to remove format noise. PDF extraction commonly introduces extra whitespace, garbled unicode, and broken line breaks that confuse the model.
function normalizeResumeText(rawText) {
return rawText
// Collapse multiple blank lines → single blank line
.replace(/\n{3,}/g, '\n\n')
// Remove soft hyphens and zero-width spaces (PDF artifacts)
.replace(/[\u00AD\u200B\u200C\u200D\uFEFF]/g, '')
// Normalize smart quotes and dashes
.replace(/[\u2018\u2019]/g, "'")
.replace(/[\u201C\u201D]/g, '"')
.replace(/[\u2013\u2014]/g, '-')
// Strip headers/footers often repeated on every page
.replace(/Page \d+ of \d+/gi, '')
// Collapse excessive whitespace within lines
.replace(/[ \t]{2,}/g, ' ')
.trim();
}
// Apply before scoring
const cleanResume = normalizeResumeText(req.body.resume);
const prompt = buildCalibratedPrompt(cleanResume, jobDescription, skills);Testing for score consistency
The same resume, lightly reformatted, should score within ±5 points. Build a consistency test suite using known-good candidates:
// scripts/calibration-test.js
const testCases = [
{
id: 'senior-payments-engineer',
resume_variants: [
RESUME_VERBOSE_FORMAT, // 2-page, bullet-heavy
RESUME_MINIMAL_FORMAT, // 1-page, concise
RESUME_PDF_EXTRACTED, // From PDF, some garbling
],
expected_score_range: [82, 92], // Acceptable window
expected_hire_signal: ['strong_yes', 'yes']
}
];
async function runCalibrationTests() {
const results = [];
for (const tc of testCases) {
const scores = await Promise.all(
tc.resume_variants.map(v => scoreResume(v, JOB_DESCRIPTION, SKILLS))
);
const fitScores = scores.map(s => s.fit_score);
const min = Math.min(...fitScores);
const max = Math.max(...fitScores);
const spread = max - min;
const inRange = fitScores.every(s =>
s >= tc.expected_score_range[0] && s <= tc.expected_score_range[1]
);
results.push({
id: tc.id, fitScores, spread,
pass: spread <= 10 && inRange // Max 10pt spread, in expected range
});
}
console.table(results);
process.exit(results.every(r => r.pass) ? 0 : 1);
}Run this test suite whenever you modify the scoring prompt. A spread greater than 10 points means your prompt isn't giving the model enough anchors to calibrate against format differences.
Bias Detection: What to Watch for in AI Hiring
Contextual AI scoring eliminates keyword matching bias, but introduces different risks. The model was trained on human-generated text that reflects human biases. Those biases can surface in scoring behavior in ways that are subtle and difficult to detect without deliberate monitoring.
The three bias vectors to monitor
Implementing a bias audit routine
The most practical approach to bias monitoring is periodic differential testing: score the same resume with and without specific signals, then measure the score delta. If removing an employer name from a resume changes the score significantly, credential anchoring is active.
// scripts/bias-audit.js
// Run monthly against a sample of real candidate resumes
async function auditCredentialBias(resume, jobDescription, skills) {
// Redact employer names (replace with generic placeholders)
const redactedResume = resume
.replace(/(?:Google|Meta|Amazon|Apple|Netflix|Stripe|Plaid|Airbnb)/gi,
'[Tech Company]')
.replace(/(?:MIT|Stanford|Carnegie Mellon|Caltech|Harvard)/gi,
'[University]');
const [originalScore, redactedScore] = await Promise.all([
scoreResume(resume, jobDescription, skills),
scoreResume(redactedResume, jobDescription, skills)
]);
const credentialDelta = originalScore.fit_score - redactedScore.fit_score;
return {
original_score: originalScore.fit_score,
redacted_score: redactedScore.fit_score,
credential_delta: credentialDelta,
// Flag if employer/school names move score more than 8 points
credential_bias_flag: Math.abs(credentialDelta) > 8
};
}
async function auditVerbosityBias(resume, jobDescription, skills) {
// Create a truncated version — keep first bullet only per role
const lines = resume.split('\n');
let bulletCount = 0;
const truncatedLines = lines.filter(line => {
const isBullet = line.trim().match(/^[-•·*]/);
if (isBullet) { bulletCount++; return bulletCount % 3 === 1; } // Keep 1 in 3
return true;
});
const [fullScore, truncScore] = await Promise.all([
scoreResume(resume, jobDescription, skills),
scoreResume(truncatedLines.join('\n'), jobDescription, skills)
]);
return {
full_score: fullScore.fit_score,
truncated_score: truncScore.fit_score,
verbosity_delta: fullScore.fit_score - truncScore.fit_score,
// Flag if reducing detail moves score more than 12 points
verbosity_bias_flag: (fullScore.fit_score - truncScore.fit_score) > 12
};
}Run these checks monthly against a sample of real candidate resumes (with PII removed). Prompt changes, model updates, and shifts in your candidate pool can all reintroduce bias patterns that were previously clean. Treat this like any other monitoring system.
Code Example: Consistent Scoring Across Resume Formats
Here's the complete POST /api/v1/score-resume request showing how the same candidate scores consistently when their resume is submitted in different formats — verbose PDF extraction, minimalist plain text, and an ATS-parsed structured format.
curl -X POST https://stackwright.polsia.app/api/v1/score-resume \
-H "Content-Type: application/json" \
-H "X-API-Key: sk-sw-demo-stackwright2025" \
-d '{
"resume": "Alex Rivera — Senior Backend Engineer\n\nPAYMENTS PLATFORM LEAD, FinEdge Inc (2020–2024)\n• Architected idempotent payment processing pipeline, $35M/day transaction volume\n• Reduced p99 API latency from 820ms to 98ms via async queue refactor\n• Led team of 4 engineers; conducted weekly code reviews and architecture reviews\n• Maintained 99.98% uptime across 18-month on-call rotation\n• Introduced observability stack (OpenTelemetry + Grafana), cut MTTR from 52min to 9min\n\nSENIOR ENGINEER, CloudBase (2017–2020)\n• Built multi-tenant data pipeline processing 200M events/day in Node.js\n• Designed PostgreSQL sharding strategy supporting 10× growth without migration\n• Open source: pg-batch-insert library, 1.4k GitHub stars",\n "job_description": "Senior backend engineer for payments API. Must have: production payment processing experience, Node.js, PostgreSQL, async systems design, on-call ownership.",\n "required_skills": ["Node.js", "PostgreSQL", "payment processing", "async systems", "on-call"]\n }'curl -X POST https://stackwright.polsia.app/api/v1/score-resume \
-H "Content-Type: application/json" \
-H "X-API-Key: sk-sw-demo-stackwright2025" \
-d '{
"resume": "Alex Rivera. Backend engineer, 7 years. Led payments infrastructure at FinEdge (2020-2024): owned async transaction pipeline at $35M/day scale, p99 820ms→98ms, 18mo on-call. CloudBase (2017-2020): Node.js data pipeline, 200M events/day, PostgreSQL sharding. OSS: pg-batch-insert.",\n "job_description": "Senior backend engineer for payments API. Must have: production payment processing experience, Node.js, PostgreSQL, async systems design, on-call ownership.",\n "required_skills": ["Node.js", "PostgreSQL", "payment processing", "async systems", "on-call"]\n }'Both requests return scores within the expected calibration window. Here's what the responses look like side-by-side:
A 5-point spread between the two formats (89 vs 84) is within acceptable calibration tolerance. Both correctly identify this as a strong_yes hire signal. The leadership score drops in the minimalist version because team management evidence is implicit — if this were a leadership-critical role, that gap would be worth flagging to the recruiter.
Stackwright's scoring API includes calibration anchors tuned for engineering roles. The demo key sk-sw-demo-stackwright2025 works on the live endpoint — try scoring the same candidate in different formats and observe the consistency for yourself.
Compliance: What HR Leaders Need to Know
AI screening tools are under increasing regulatory scrutiny. New York City's Local Law 144 requires employers using AEDT (Automated Employment Decision Tools) to conduct annual bias audits and notify candidates. The EU AI Act classifies hiring AI as "high risk", requiring documentation and human oversight. Illinois, Maryland, and other states have similar requirements in various stages of enactment.
The practical requirements for defensible AI resume scoring:
- Document your evaluation criteria — What dimensions are you scoring, and why? Be specific. "Technical depth" means X, as evidenced by Y. This documentation is your defense if a hiring decision is challenged.
- Human review for borderline candidates — AI scores below 60 or in the "maybe" range should always have a human review step before rejection. Use the score to prioritize, not to auto-reject.
- Candidate disclosure — In jurisdictions with AEDT laws, candidates must be told AI was used in screening. Build this into your application flow.
- Adverse impact monitoring — Track pass-through rates by demographic group if you have that data. An algorithm that passes 40% of applicants overall but 20% of a protected group is a legal and ethical problem regardless of intent.
- Audit trail — Log every scoring call with inputs, outputs, and timestamps. You need this for both audit compliance and debugging when a score seems wrong.
This is worth saying explicitly in your internal documentation. A fit_score of 72 doesn't mean "reject". It means "this candidate shows these strengths and these gaps relative to this role". The hiring decision belongs to a human. The AI's job is to surface information consistently, not to make the call.
Try It: Score a Real Resume
Stackwright is a production implementation of contextual resume scoring — calibrated anchors, bias-detection tooling, and a full audit log included. The API is live, the docs have a browser-based test console, and the demo key below gives you 10 calls to run your own calibration experiments right now.
Test contextual scoring vs keyword matching
Score the same resume in two formats. Check the spread. Then try redacting employer names and see if the score holds — that's your credential bias signal.
Summary
Three things to implement if you're evaluating AI for resume screening:
- Replace keyword matching with contextual scoring — evaluate what candidates can do, not what words they used to describe it. This alone recovers the strong candidates that keyword filters systematically reject.
- Add calibration anchors to your scoring prompt — define what 40/60/80/95 looks like for each role family. Without anchors, your score distribution is floating and inconsistent across resume formats.
- Run bias audits on a schedule — credential anchoring and verbosity bias are active in LLM scoring by default. Test for them monthly. The differential testing approach in section 4 takes about 20 minutes to set up and surfaces the most common failure modes.
Fair resume scoring isn't about being lenient — it's about being accurate. A system that scores based on evidence of what someone can do will outperform a keyword filter on both precision and recall, and it'll hold up better when the hiring process is scrutinized.
Ready to integrate?
Start scoring resumes in minutes. Free tier ships immediately — no credit card. Pro starts at $49/mo for production scale.