My tax script kept breaking. Claude Code read the whole thing and fixed it.

I had a working Google Apps Script that processed my tax invoices. It used Gemini directly — upload a PDF, get JSON back, write a row to the sheet. It worked. But it kept breaking in new ways, and every time I opened it to fix something, I lost track of what had changed and why.

🗂️ Before: Gemini in the loop, me in the middle

The original setup called Gemini from a browser chat window or the Apps Script editor. I’d describe a problem, get a function rewrite, paste it in, run it, repeat. It worked fine for isolated fixes but fell apart across sessions — by version three, I’d explain the same architectural decisions over and over because nothing was in context.

More importantly: to test anything, I had to point it at real invoice data. Real vendor names, real amounts, real files in Google Drive. Every iteration of the script had full access to everything.

🔒 The privacy shift: Claude Code in read-only mode

When I switched to Claude Code, the first thing I configured was read-only mode. Claude Code can be given granular tool permissions — I allowed it to read files and search the codebase, but not write to Drive, not send API calls on my behalf, not touch the spreadsheet.

This matters because the script processes real financial documents. I wanted an AI collaborator that could understand the full codebase — including the folder IDs, the sheet structure, the split logic — without being able to act on any of it autonomously. Read-only meant I could open the whole project without second-guessing what it might touch.

The other consideration was MCP (Model Context Protocol). Claude Code supports MCP servers that give it access to external tools — Google Drive, Sheets, web search. I deliberately didn’t connect those for this project. The script already had production credentials baked into CONFIG. Adding an MCP layer that could read or write those resources would have blurred the line between “AI understanding the code” and “AI with access to my data.” For a tax automation touching real invoices, that line matters.

🤖 What actually changed

The biggest shift wasn’t the model — it was context persistence. Claude Code reads the files. Not what I paste: the actual project, in full. When I said “add duplicate detection,” it looked at how the log sheet was structured, found where MD5 hashes would naturally live, and wrote detection logic that fit the existing pattern. I didn’t explain the architecture. It was already there.

That changes the unit of work. Instead of “fix this function,” I could say “there’s a problem with how we handle vendors that have multiple invoices in the same batch run” — and get a solution that understood what “batch run” meant in this specific script.

🔢 v1 to v7 — what each version actually added

Seven versions in, the script had grown MD5 deduplication, vendor split rules, a self-learning category brain, quarantine handling, and post-run verification. I could no longer hold it in my head, and every fix meant re-explaining the architecture to a chat window that had forgotten what it had seen before. The fix wasn’t a better model. It was giving the model the whole codebase instead of pasted fragments.

💡 The moment where it felt different

There was a session where I asked Claude Code to fix the year routing logic — the part that decides which archive folder to move a file to based on the invoice date. I’d been handling it with a chain of if (year === 2026)... else if (year === 2025)... that would silently fail for any new year.

Claude Code looked at the existing FOLDER_MAPPING config — a dictionary of sheet IDs to folder IDs I’d already built for the file verification step — and suggested routing year decisions through that same config rather than duplicating the logic. I hadn’t connected those two pieces. It did, because it could see both at once.

That’s a different kind of collaboration than paste-and-ask. It’s closer to pair programming with someone who has actually read your code.

Here’s the full pipeline, end to end:

graph TD
    A[Invoice in Drive Inbox] --> B{MD5 in log?}
    B -- Yes --> C[Move to Quarantine]
    B -- No --> D["Gemini API\nextract JSON"]
    D --> E{Vendor split rule?}
    E -- Match --> F["Apply split logic\nwrite multiple rows"]
    E -- No match --> G["Brain lookup\n+ category fallback"]
    F & G --> H["Write rows to year tab\nLog to script_log"]
    H --> I["Verify Drive folder\nvs sheet entries"]
    I --> J["Archive file\nto year folder"]

📝 What I actually learned

Start with a hard rule, not an AI decision. Every time I reached for the model to make a categorization call, I eventually replaced it with a deterministic rule. The AI is good at extraction — reading an invoice and returning structured data. It’s unreliable for decisions that have a correct answer I could just encode.

The audit trail is the product. The script_log tab started as a debug tool. It became the thing I actually check. Every decision the script made is there — including why something went to quarantine. That’s more useful than the automation itself.

Read-only mode is the right default. Letting an AI read everything while acting on nothing is a comfortable working posture for anything touching real data. It’s worth setting up explicitly rather than assuming it.

Building something that iterates is faster than building something perfect. v1 was embarrassingly simple. v7 handles edge cases I didn’t know existed when I started. I didn’t design for v7 — I got there one version at a time.

→ If you want the technical detail

The full script — source, config, and a function-by-function walkthrough — is in AI Invoice Automation (Gemini).