AgentLiar Detector: Catch Coding Agents That Falsely Claim Task Completion
AI coding agents are getting better at completing tasks. They are also getting better at appearing to complete tasks. An agent that claims "done" when it has created placeholder files, written empty tests, or quietly narrowed the scope of the original requirement is harder to catch than one that simply fails, because the failure is hidden inside output that looks correct at a glance.
AgentLiar is a production-ready system that detects when coding agents falsely claim task completion. It runs four independent verification checks, produces a weighted confidence score from 0 to 100, and delivers structured evidence in JSON, Markdown, or console output - usable as a CLI tool, Python library, GitHub Action, or HTTP API.
Features
4 Independent Checks - File, Test, Scope, and LLM Judge.
Confidence Scoring - weighted aggregation on a 0–100 scale.
Multiple Interfaces - CLI, Python API, GitHub Action, and HTTP API.
Adversarial Detection - catches placeholder implementations, empty tests, and scope narrowing.
Structured Reports - JSON and Markdown output with evidence.
Production Ready - type hints, error handling, logging, and async support.
Architecture
The async orchestrator dispatches four independent checks File, Test, Scope (local), plus an optional OpenRouter LLM Judge and produces a weighted 0–100 confidence score delivered as JSON, Markdown, or console output for CI gating.
The Four Verification Checks
1. File Check
Detects missing expected files
Identifies unexpected new files
Finds placeholder content: TODO, FIXME, pass-only
Validates file sizes and content
2. Test Check
Detects empty test bodies
Identifies tests without assertions
Finds skipped tests
Validates claimed versus actual test counts
3. Scope Check
Detects silent scope narrowing: "only", "for now"
Identifies partial implementations
Finds TODO markers in code
Validates requirements coverage
4. LLM Judge
Independent assessment via OpenRouter
Structured JSON output
Timeout and retry logic
Optional - works without an API key
Quick Start
Installation
pip install -e .
Or pip install agentliar once published. Requires Python 3.10+.
CLI Usage
Prepare sample inputs from examples/simple_task.json, then run:
agentliar verify \
--task-file .tmp/task.txt \
--claim-file .tmp/claim.json \
--changes-file .tmp/changes.json \
--format markdown
Use agentliar config to inspect configuration and agentliar analyze .tmp/task.txt to review a task file.
Python API
from agentliar import Verifier
verifier = Verifier()
result = await verifier.verify(
task_description=task,
claim=claim_payload,
file_changes=changes_payload
)
# Read result.score, result.passed, result.confidence_level, result.reports
GitHub Action
Use the GitHub Action with task, claim, and change files, a confidence threshold, and an optional OPENROUTER_API_KEY secret when you want the LLM Judge path enabled.
HTTP API
Start the API server:
python -m agentliar.server
# or
uvicorn agentliar.server:app --host 0.0.0.0 --port 8000
Then POST /verify with the task, claim, and file-change payloads. The response returns score, pass/fail, and evidence blocks.
Confidence Score Interpretation
90–100 - High. Task appears fully completed.70–89 - Medium. Task likely complete with minor issues.50–69 - Low. Task partially completed.30–49 - Critical. Significant issues detected.0–29 - Failed. Task likely not completed.
Configuration
Create a .env file. Set OPENROUTER_API_KEY and OPENROUTER_MODEL only if you want LLM Judge mode. The check weights must sum to 1.0. CONFIDENCE_THRESHOLD controls the pass/fail cutoff.
Recommended LLM Judge models (May 2026):
anthropic/claude-haiku-4-5 - cheap and fast judginganthropic/claude-sonnet-4-6 or openai/gpt-5.4 - higher-quality judgingopenai/gpt-4.1-mini - budget option
Use Cases
CI/CD Integration - automatically verify PR claims before merging.
Code Review - get an independent assessment of task completion alongside a human review.
Agent Monitoring - detect when AI agents overstate progress in automated pipelines.
Quality Gates - block merges below a confidence threshold.
Documentation - generate verification reports for stakeholders.
Security
No hardcoded secrets
API keys via environment variables only
No data persistence
Local processing except for LLM Judge
Project Structure
src/agentliar/ # Checks, orchestration, scoring, reports, API, CLI, server
tests/
├── unit/ # Unit tests
├── adversarial/ # Adversarial tests
└── integration/ # Integration tests
examples/ # Sample inputs
action.yml # GitHub Action definition
pyproject.toml # Packaging and tooling
Testing
pytest # Full suite
pytest --cov=agentliar --cov-report=html # With coverage
pytest tests/unit/ # Unit tests only
pytest tests/adversarial/ # Adversarial tests only
pytest tests/integration/ # Integration tests only
Code Quality
ruff check . # Linting
ruff format . # Formatting
mypy src tests # Type checking
How I Built This Using NEO
This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.
The requirement was a production-ready verification system for detecting false completion claims from coding agents - running four independent checks locally, with an optional LLM Judge via OpenRouter, and exposing the result through a CLI, Python API, GitHub Action, and HTTP API. NEO built the full implementation: the async orchestrator dispatching all four checks, the File, Test, Scope, and LLM Judge check modules, the weighted confidence scorer, the JSON and Markdown report generators, the Click CLI with verify, config, and analyze commands, the FastAPI HTTP server, the GitHub Action definition in action.yml, and the test suite split across unit, adversarial, and integration coverage.
How You Can Use and Extend This With NEO
Use it as a CI gate on every PR that includes AI-generated code.
Add the GitHub Action to your workflow with a confidence threshold. Any PR where the agent's claimed changes do not pass the file, test, and scope checks below your threshold is blocked before merge - automatically, without a reviewer having to spot the placeholder implementation manually.
Use the LLM Judge for higher-confidence verification on critical tasks.
Set OPENROUTER_API_KEY and configure a judge model for tasks where the local checks alone are not sufficient. The LLM Judge runs independently from the other three checks and adds a cross-model perspective to the confidence score.
Extend it with additional check types.
The four checks share a common async interface in the orchestrator. A new check follows the same pattern and its weight is added to the configuration. The orchestrator, scorer, and reporters pick it up automatically.
Final Notes
Agents that falsely claim completion are harder to catch than agents that fail outright - because the output exists and looks plausible. AgentLiar makes the verification systematic: four independent checks, a weighted confidence score, and structured evidence that tells you exactly where the claim breaks down.
The code is at https://github.com/dakshjain-1616/AgentLiar
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code


