How to Automate Invoice Processing from Email: AI Pipeline for Accounting

The problem

The weekly routine that quietly eats hours

Invoice processing from email is invisible work. It does not feel like a problem because each individual action takes only minutes. But multiply those minutes across every supplier, every week, every month — and the numbers add up fast.

The specific pain here: five suppliers each send invoices in completely different formats. Different file naming conventions. Different field positions. Different PDF structures — some text-based, some scanned images. No two documents look the same. A human learns to handle this variation instinctively. A naive automation breaks on the first edge case.

The target outcome was clear: invoices arrive by email, get processed without any human involvement, land in the correct Google Drive folder with a standardised filename, and get logged to a database — with a Telegram notification for each successful run. If anything goes wrong, it should fail loudly, not silently.

The key design decision: AI for normalisation, not for retrieval

Gemini 2.5 Flash is not the source of truth — the PDF is. A deterministic baseline parser runs first (regex, keyword triggers, field extraction). Gemini verifies and normalises on top. When the AI API is unavailable, the fallback still produces usable output. The LLM handles variation; the parser handles facts.

End-to-end pipeline

How the pipeline flows

GitHub Actions cron → IMAP fetch unseen emails → Attachment filter (PDF only)
↓
Dedup check (message ID + hash) → pdf-parse → raw text → Baseline regex parser
↓
Gemini 2.5 Flash verify + normalise → Google Drive upload
↓
Supabase log → Telegram notification

GitHub Actions cron trigger

Runs on a schedule — hourly or daily depending on invoice volume. No server needed. Free tier covers any normal business volume easily.

IMAP fetch and filter

Connects to the inbox, finds unseen messages, extracts PDF attachments only. Message IDs recorded immediately for deduplication.

Deduplication — two layers

Attachment-level: message ID + file hash. Business-level: supplier name + invoice number. Handles re-forwarded emails, cron restarts, and duplicate PDFs in one message.

PDF parsing + baseline extraction

pdf-parse extracts raw text. A deterministic parser runs regex and keyword triggers to pull structured fields: supplier, invoice number, date, total, VAT amount.

Gemini 2.5 Flash normalisation

The AI verifies baseline output, normalises date formats, fills gaps the regex missed, and handles supplier-specific formatting quirks. Runs only on fields the baseline could not confidently extract.

Google Drive filing

PDF uploaded to the correct supplier subfolder with a standardised filename: YYYY-MM-DD_Supplier_InvoiceNumber.pdf. Folder structure mirrors the client's existing accounting system.

Supabase log + Telegram alert

Every attachment gets a processing_log record — status, extracted fields, Drive file ID, timestamp. Nothing is silently lost. Telegram sends a summary after each run.

Two-layer LLM design

Why not just use Gemini for everything?

Pure LLM extraction is unreliable for financial documents. The model can hallucinate amounts, misread currency symbols, and produce inconsistent date formats across runs.

The baseline parser handles what regex handles perfectly — supplier names in known positions, invoice numbers matching known patterns, totals in expected fields. It is fast, deterministic, and costs nothing.

Gemini handles what regex cannot — variations in field position across supplier formats, non-standard date strings, edge cases where the PDF structure differs from the template. It sees only the fields the baseline could not confidently extract.

The result: deterministic accuracy where possible, AI flexibility where needed — and the pipeline still works if the API is down.

Deduplication

Two layers — nothing processed twice

Invoices arrive in unpredictable ways: re-forwarded from a colleague, the same PDF attached to two follow-up emails, the cron job restarting mid-run. Without proper deduplication, any of these scenarios causes double-processing or double-filing.

Layer 1: Attachment-level Message ID + SHA-256 hash of the PDF file. Catches re-sends of the exact same file before any processing begins.

Layer 2: Business-level Supplier name + invoice number. Catches the same invoice arriving as a different attachment (renamed, re-exported from accounting software).

What gets logged

Nothing silently lost — full observability

Every attachment gets a record in Supabase regardless of outcome — whether it processed successfully, was deduplicated, failed parsing, or hit an API error. The record includes: attachment filename, message ID, processing status, extracted fields, Drive file ID if uploaded, error message if failed, and timestamp.

This means the client can always audit what happened to any specific invoice. "Where did the Siemens invoice from March 12 go?" — answer is one SQL query away.

The Telegram notification sends a run summary after each cron execution: how many emails checked, how many attachments found, how many processed, how many skipped as duplicates, any errors. If the pipeline runs silently with zero activity for too long, that itself becomes a signal worth investigating.

FAQ

Common questions about invoice automation

How does AI extract data from PDF invoices?

A two-layer approach works best. First, a deterministic parser runs regex and keyword triggers to extract structured fields — supplier name, invoice number, date, total. Then Gemini 2.5 Flash verifies and normalises the output, handles edge cases, and fills fields the regex missed. This way, if the AI API is down, the baseline extraction still runs.

How do you prevent duplicate invoice processing?

Two deduplication layers: at the attachment level using message ID plus file hash, and at the business level using supplier name plus invoice number. This handles re-forwarded emails, cron job restarts, and cases where the same PDF arrives twice in one message. Every attachment gets a processing log record regardless of outcome — nothing is silently lost.

What is the infrastructure cost for an automated invoice pipeline?

If you use GitHub Actions as the scheduler, the infrastructure cost is effectively $0. GitHub Actions provides 2,000 free minutes per month — more than enough for hourly invoice checks running a few seconds each. The only variable cost is AI API usage (Gemini 2.5 Flash tokens per PDF), which is minimal for standard business invoice volumes.

Tech stack

Tools used

Node.js TypeScript IMAP pdf-parse Gemini 2.5 Flash Google Drive API Supabase GitHub Actions Telegram Bot API VS Code + Claude Code

Have a manual document processing routine in your business?

I build automation pipelines that eliminate recurring manual work: invoice processing, document classification, data extraction from email. Works with any document format — PDFs, Word files, scanned images. Based in Munich, working with clients across Europe.

Book a free call Send a message

← Back to articles

How to Automate Invoice Processing from Email — AI Pipeline for Accounting Teams