The weekly routine that quietly eats hours
Invoice processing from email is invisible work. It does not feel like a problem because each individual action takes only minutes. But multiply those minutes across every supplier, every week, every month โ and the numbers add up fast.
The specific pain here: five suppliers each send invoices in completely different formats. Different file naming conventions. Different field positions. Different PDF structures โ some text-based, some scanned images. No two documents look the same. A human learns to handle this variation instinctively. A naive automation breaks on the first edge case.
The target outcome was clear: invoices arrive by email, get processed without any human involvement, land in the correct Google Drive folder with a standardised filename, and get logged to a database โ with a Telegram notification for each successful run. If anything goes wrong, it should fail loudly, not silently.
The key design decision: AI for normalisation, not for retrieval
Gemini 2.5 Flash is not the source of truth โ the PDF is. A deterministic baseline parser runs first (regex, keyword triggers, field extraction). Gemini verifies and normalises on top. When the AI API is unavailable, the fallback still produces usable output. The LLM handles variation; the parser handles facts.
How the pipeline flows
โ
Dedup check (message ID + hash) โ pdf-parse โ raw text โ Baseline regex parser
โ
Gemini 2.5 Flash verify + normalise โ Google Drive upload
โ
Supabase log โ Telegram notification
Runs on a schedule โ hourly or daily depending on invoice volume. No server needed. Free tier covers any normal business volume easily.
Connects to the inbox, finds unseen messages, extracts PDF attachments only. Message IDs recorded immediately for deduplication.
Attachment-level: message ID + file hash. Business-level: supplier name + invoice number. Handles re-forwarded emails, cron restarts, and duplicate PDFs in one message.
pdf-parse extracts raw text. A deterministic parser runs regex and keyword triggers to pull structured fields: supplier, invoice number, date, total, VAT amount.
The AI verifies baseline output, normalises date formats, fills gaps the regex missed, and handles supplier-specific formatting quirks. Runs only on fields the baseline could not confidently extract.
PDF uploaded to the correct supplier subfolder with a standardised filename: YYYY-MM-DD_Supplier_InvoiceNumber.pdf. Folder structure mirrors the client's existing accounting system.
Every attachment gets a processing_log record โ status, extracted fields, Drive file ID, timestamp. Nothing is silently lost. Telegram sends a summary after each run.
Why not just use Gemini for everything?
Pure LLM extraction is unreliable for financial documents. The model can hallucinate amounts, misread currency symbols, and produce inconsistent date formats across runs.
The baseline parser handles what regex handles perfectly โ supplier names in known positions, invoice numbers matching known patterns, totals in expected fields. It is fast, deterministic, and costs nothing.
Gemini handles what regex cannot โ variations in field position across supplier formats, non-standard date strings, edge cases where the PDF structure differs from the template. It sees only the fields the baseline could not confidently extract.
The result: deterministic accuracy where possible, AI flexibility where needed โ and the pipeline still works if the API is down.
Two layers โ nothing processed twice
Invoices arrive in unpredictable ways: re-forwarded from a colleague, the same PDF attached to two follow-up emails, the cron job restarting mid-run. Without proper deduplication, any of these scenarios causes double-processing or double-filing.
Nothing silently lost โ full observability
Every attachment gets a record in Supabase regardless of outcome โ whether it processed successfully, was deduplicated, failed parsing, or hit an API error. The record includes: attachment filename, message ID, processing status, extracted fields, Drive file ID if uploaded, error message if failed, and timestamp.
This means the client can always audit what happened to any specific invoice. "Where did the Siemens invoice from March 12 go?" โ answer is one SQL query away.
The Telegram notification sends a run summary after each cron execution: how many emails checked, how many attachments found, how many processed, how many skipped as duplicates, any errors. If the pipeline runs silently with zero activity for too long, that itself becomes a signal worth investigating.
Common questions about invoice automation
Tools used
I build automation pipelines that eliminate recurring manual work: invoice processing, document classification, data extraction from email. Works with any document format โ PDFs, Word files, scanned images. Based in Munich, working with clients across Europe.