AMS Check Logic

Assumptions

Article source credibility has already been verified upstream
Input name and DOB are assumed correct; DOB expected in YYYY-MM-DD (with basic format validation)
Name commonality is assessed globally (cultural/geographic context would be considered in future development)
Only name and DOB are used as inputs — many other useful signals (occupation, location, employer, family members) are intentionally excluded in this task

Test Data

Manually created test data used to inform focus areas and scoring can be found in data/ams_test_dataset_filled.csv

Pipeline Overview

URL → [1. Fetch, Extract & Parse] → [2. LLM Extraction] → [3. Deterministic Checks] → [4. LLM Assessment] → [5. Scoring]

Justificaiton of two-call structure

LLM call 1 is extraction only
LLM call 2 is judgement only This separates the concerns of the LLM rather than asking it to do two things at once. This ensures the prompt (and therefore LLM) is focused and makes the output more reliable. A model asked to simultaneously extract and assess tends to let its assessment colour its extraction.

Stage 1: Fetch, Extract & Parse (Deterministic)

Fetch article URL, extract clean text
Extract publish date from metadata or webpage (null if not found)
Fallback if unavailable: record failure reason, attempt metadata-only extraction, try cached version if possible
If nothing extractable → verdict must be refer (never discard on fetch failure)

Stage 2: LLM Extraction

Single LLM call to extract structured facts. No match assessment yet — extraction only. Potentially, with more time this would be mixed with NER (Named Entity Recognition) systems

Article language / script used
Country of origin (where was the article published) if available
Extracted entities as name + entity_type where entity_type is one of person | company | animal | other
- Each entity also includes is_deceased: true | false | null (entity-level only; no input-name matching in extraction)
All name forms used in the article for the screened candidate, cultural origin, name commonality
Gender suggestions - how many he / she / they / them
DOB if stated, age if stated, life events with dates and event_type enum:
- For each life event date, extract structured components where possible: event_date_year, event_date_month, event_date_day (integers or null)
- Also extract the type of event (mainly for deterministic checks)
  - birth (person's own birth event, tolerated implied age range -1 to 1; do not confuse with children_mentioned)
  - first_job_internship (roughly 16-26)
  - employment (at least 14)
  - prison_sentence (deterministic: <10 impossible, 10-15 highly implausible, 16+ generally plausible)
  - university_graduation (roughly 21-24)
  - founded_company (at least 18)
  - military_service (18+)
  - marriage (at least 16-18 depending context)
  - children_mentioned (at least 16-18)
  - retirement (roughly 60+, with early retirement possible)
  - other (only if none apply)

Unique identifiers (corroborative, not primary signals)

Twitter / X handle
LinkedIn profile
Facebook profile
Instagram handle
Note: handles are unreliable as primary signals (accounts get reused, people have multiples) but potentially useful for further research (see PART 2)

Stage 3: Deterministic Checks (Code)

Fast, determinstic rule-based checks run before the second LLM call.

Name

Normalise both strings (lowercase, trim, remove punctuation)
Check in the extracted names AND in the raw article text to confirm existence
Output: exact, derived_match or no_exact_match
(In future would consider looking at explicit romanisation / transliterations of names so we can test deterministically)

DOB

Exact DOB match (parsed to ISO)
Component match scoring against input DOB:
- +1 when year matches
- +2 when year and month match
- +3 when year, month, and day all match (year / year & month matches capctured as positive for cases where that's the only dob data available)
- If a provided component conflicts (e.g. month or day mismatch), treat as inconsistent
Age arithmetic: stated age + publish date → implied birth year, check vs. input DOB (±1 year)
Flag: day ≤ 12 → possible day/month transposition
Flag: 01/01/XXXX → possibly approximate
If ambiguous date fromat used e.g. 02/03/2000 in a US article - treat either date as potentially valid

Stage 4: LLM Assessment

Second LLM call. Receives extraction output + deterministic results. Provides independent assessments.

Name match exact | likely_variant | possible | unlikely | no_match + reason Considers: nicknames, initials, middle names, cultural ordering, transliteration, particles, name uniqueness

Deceased check (input-name specific) input_name_is_deceased: true | false | null + reason Computed in assessment call using input_name against extracted person entities and article context.

Gender match consistent | inconsistent | ambiguous | not_found + reason Considers: does the gender of the input name, and the artile match (I'm aware that a person could transition with both a name and gender change - but this is uncommon, so we're ignoring it for this analysis) Only negative scoring for this section (a positive gender match means very little)

DOB match Owned by deterministic checks (not by LLM in call 2). LLM receives deterministic DOB output as context but does not return a separate DOB verdict field.

Life Event match impossible | implausible | plausible + reason Considers: Age at which the life event took place and how plausbile that would be to occur

Sentiment positive | negative | neutral | n/a + reason Only assessed if the screened person is identified as a person in the article and name is not no_match Scoped to this individual specifically, not the article overall (with more time would refine implementation)

Stage 5: Confidence Scoring (Deterministic)

Code combines all signals into a final verdict in two layers:

Layer 1: Hard Overrides (applied before scoring)

fetch/extract failed or incomplete (content_source != full_text or fetch failure reason present) → refer
exact input-name match to non-person entity type (company/animal/other) → discard
screened input name is assessed as deceased (input_name_is_deceased = true) → discard (assuming it's a live application, the person making it cannot be confirmed deceased, even if all other details match)

Layer 2: Additive Score (non-overridden paths)

Name score (highest weight)

Deterministic vs LLM	Score
`exact` + `exact`	+5
`no_exact_match` + `likely_variant`	+2
`no_exact_match` + `possible`	+0
`no_exact_match` + `unlikely`	-2
`no_exact_match` + `no_match`	-5
mixed `exact`/`likely_variant` or `exact`/`possible`	reduced positive
any disagreement	-1 uncertainty penalty

Name commonality thresholding (no commonality multiplier on score) Name commonality does not scale points. It only changes the match threshold: (With more time to test, I would experiment having it shift the discard threshold as well)

rare: match threshold >= 3
uncommon: match threshold >= 4
unknown: match threshold >= 4
common: match threshold >= 5
very_common: match threshold >= 7

DOB score (deterministic-only)

Deterministic DOB result	Score
`exact`	+4
`age_consistent`	+2
`transposition_flagged` / `approximate_flagged`	+0
`not_found`	+0
`inconsistent`	-4 (or -2 when exactly 1 day off)

Life event score (per event, deterministic baseline with LLM-preferred resolution)

Plausibility	Score
`impossible`	-3 each
`highly_implausible`	-2 each
`implausible`	-1 each
`plausible`	+0
`unknown`	+0

For now, life events are scored purely negatively (or neutral at 0).
With occupation/background inputs in the future, life events could also contribute positively when role progression and context align with the article.

Deterministic vs LLM resolution

Deterministic outputs are baseline signals
Where both deterministic and LLM provide the same signal family (especially life events), LLM judgment is preferred for scoring
If deterministic is more negative than LLM for the same signal, add a warning flag for analyst review

Gender score (negative-only)

Condition	Score
`highly_inconsistent` + strong/strong evidence	-3
`inconsistent` + one strong side	-1
`consistent` / `unknown` / `ambiguous` / `unisex`	+0

Final Thresholds

rare: match if score >= 3
uncommon: match if score >= 4
unknown: match if score >= 4
common: match if score >= 5
very_common: match if score >= 7
discard if score <= -3
otherwise refer

Sentiment

Sentiment is reported for analyst context only and does not contribute to the score.

Output

verdict: match | refer | discard
confidence: high | medium | low (derived from verdict strength and distance from threshold)
scoring_explanation: explicit breakdown of score components, thresholds, and gate decisions (I have included a match outcome as this could, with testing, reduce the need for the analyst to review)

Some Next Steps & Improvements

(For some extraction and analysis, I've used LLM's over alternative, more deterministic methods, primarily because finding the data for / implementing the alternative methods would be time consuming for an MVP, but calling out as an FYI - some mentions below as well)

Pipeline & extraction

Improve article fetcher (website url scraper) robustness and completeness — input quality is the biggest single lever on output quality
- Also experiment with ways to bypass bot checkers / paywalls (if legal and ethical to do so) (I would also compare against feeding urls directly to an LLM, however I would build some guardrails to guard against hallucinations)
Better extraction, and parsing of dates (including publish date, which is sometimes NOT in the metadata, but instead in the HTML itself)
Cross-check LLM-extracted entities against raw article text to catch hallucinated DOBs or life events
Experiment with NER systems alongside the LLM extraction call for higher confidence entity identification
ONS (and/or other cencus) data to perform a more robust assessment of the popularity / commonality of names (potentially even in the context of the time period)

Scoring & logic

Tune scoring thresholds against a larger labelled dataset — current thresholds are principled but not empirically validated in any real depth; any tuning without held-out ground truth is guesswork
Tiered DOB scoring: greater penalty the further the article DOB deviates from the input, rather than a flat inconsistent score
Tighter subject-scoping: confirm the named person is not just mentioned in the article but is the actual subject of the adverse sentiment

Evaluation

Current test set is too highly weighted toward well-known public figures to guard against false negatives, but a more realistic distribution would include obscure individuals where corroborating signals are sparse (i.e. increase the number / proportion of true negatives)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMS Check Logic

Assumptions

Test Data

Pipeline Overview

Justificaiton of two-call structure

Stage 1: Fetch, Extract & Parse (Deterministic)

Stage 2: LLM Extraction

Unique identifiers (corroborative, not primary signals)

Stage 3: Deterministic Checks (Code)

Stage 4: LLM Assessment

Stage 5: Confidence Scoring (Deterministic)

Layer 1: Hard Overrides (applied before scoring)

Layer 2: Additive Score (non-overridden paths)

Final Thresholds

Some Next Steps & Improvements

FilesExpand file tree

LOGIC.md

Latest commit

History

LOGIC.md

File metadata and controls

AMS Check Logic

Assumptions

Test Data

Pipeline Overview

Justificaiton of two-call structure

Stage 1: Fetch, Extract & Parse (Deterministic)

Stage 2: LLM Extraction

Unique identifiers (corroborative, not primary signals)

Stage 3: Deterministic Checks (Code)

Stage 4: LLM Assessment

Stage 5: Confidence Scoring (Deterministic)

Layer 1: Hard Overrides (applied before scoring)

Layer 2: Additive Score (non-overridden paths)

Final Thresholds

Some Next Steps & Improvements