Programming

AI Grandpa Returns: Building a Chatbot That Talks to Phone Scammers

After a popular Habr article about an AI pretending to be an elderly man to waste scammers' time turned out to be hypothetical rather than real, one developer built the actual working system — and documented every technical problem along the way.

Many people remember the article about scammers who called an AI pretending to be an elderly man, who kept them on the line for 31 minutes and recorded everything. The piece went viral, then later it emerged the author had described a hypothetical scenario rather than an actual result — which understandably drew criticism.

The proposed approach, however, is realistic and achievable with available hardware and services. This article is about actually building it.

Why Not the Original Stack?

The original article described a local stack: Asterisk + SIP provider + dedicated server, Silero VAD for end-of-utterance detection in 20ms, local Whisper transcription in 400ms, Llama-3 70B generation in 500ms, Piper voice synthesis in 150ms. Total latency: approximately 1.1 seconds — nearly acceptable.

Two things seemed off: the processing speed appeared inflated by roughly a factor of two, and local TTS models produce obviously synthetic-sounding audio.

The hardware constraint was real. An old machine with 6GB VRAM cannot fit Llama-3 70B or a quality Whisper variant. But hardware turned out not to matter. Instead of Asterisk, this implementation uses ngrok and a browser. A URL of the form https://xxxx.ngrok-free.app is generated, a WebRTC connection is established through it — no telephone infrastructure required.

The Architecture

Audio capture: WebRTC via browser, exposed publicly via ngrok
Silence detection: RMS amplitude analysis — waits for 0.5 seconds of silence above threshold
Transcription: OpenAI gpt-4o-transcribe — significantly more accurate than local Whisper for Russian with background noise
Response generation: Claude Haiku (Anthropic) — stable in roleplay scenarios and under pressure
Voice synthesis: ElevenLabs v3 with a custom voice — more natural-sounding, supports streaming

Total pipeline latency: 6–8 seconds between the caller finishing a sentence and the first audio of the response. This is unnatural for ordinary conversation, but elderly people do pause to think before answering — the delay is maskable.

The Character: Gennady Petrovich Sokolov

The persona is a 78-year-old retired electrical engineer. His behavioral profile:

Law-abiding and cautious — asks for everything in writing
Slow information processing — asks for repetition often
Technical pedantry — insists on bureaucratic formalities
Hearing difficulties — mishears things, especially names
Financial anxiety — becomes distressed at any mention of money transfers

The system prompt encodes these traits precisely. Key rules:

Respond only to the most recent thing said — maximum two sentences
Never repeat the caller's name back if it was misheard
Provide fictional data, never real information
Use action markers for pauses and sounds

The action markers embedded in Claude's responses are processed by the synthesis layer:

§SHORT§ — 1–2 second pause
§LONG§ — 3–5 second pause
§SIGH§, §COUGH§, §HMM§ — sound effects
§AWAY1§, §AWAY2§ — extended absence of 1–2 minutes

Detecting When a Scammer Is Shouting

RMS analysis of the audio signal distinguishes tone levels: calm speech at 2,000–4,000, raised voice at 4,000–6,000, shouting at 7,000+. When the threshold is exceeded, Claude receives a note in context and responds with confusion and distress — which tends to cause scammers to moderate their tone or escalate in ways that waste more of their time.

The Core of the Implementation

#!/usr/bin/env python3
"""
Project Gennady 3.0 — WebRTC Server
Browser → WebRTC audio → Whisper API → Claude → ElevenLabs → WebRTC audio → Browser

Run:
    python server.py
    ngrok.exe http 8080

Install:
    pip install aiohttp aiortc numpy anthropic httpx pydub python-dotenv openai scipy
"""

import os, sys, re, wave, random, asyncio, tempfile, fractions, datetime
from pathlib import Path
import numpy as np

# ... full implementation available in the article repository

Problems Encountered and How They Were Solved

Recognition hallucinations. On short audio fragments the transcription model produced invented text. Solution: raise the RMS threshold, require a minimum buffer duration of 0.8 seconds, switch to gpt-4o-transcribe, add context hints.

Name distortion. "Sokolov" was transcribed as "Vysokovolk," "Gostokolov," and other inventions. Solution: the prompt forbids Gennady from repeating the incorrect version aloud — he simply reintroduces himself.

Dialogue became a monologue. While Gennady was generating a response, new caller phrases accumulated in the buffer and were sent with stale context. Solution: clear the buffer after each response, wait for a new utterance.

Phrases split mid-sentence. The silence detector fired on internal pauses within a sentence. Solution: require 0.5 seconds of continuous silence before treating a pause as end-of-turn.

Repeated reactions. The same response ("please don't shout") repeated verbatim. Solution: a prompt rule requiring the topic to be closed after the first response.

Response latency. The full Whisper + Claude + TTS cycle takes 6–7 seconds. The bottleneck is voice synthesis — local models sound obviously synthetic, faster alternatives are unacceptable in quality. Progress in this area should resolve the issue within a year.

Results and Costs

Two example conversations with the AI persona are included in the article. The recording experiment required significant time and approximately $5 in API costs — OpenAI, Anthropic, and ElevenLabs combined. The $5 ElevenLabs tier includes 30 minutes of synthesis.

The pipeline for generating a convincing phone persona is essentially real today. Within a year the more pressing problem may be the reverse: distinguishing an AI from a live human in a phone call.

On the original article: the core criticism was fair. The author chose to describe a concept rather than an implementation, which misled readers about the actual difficulty of the task. This is what the actual implementation looks like.