Fine-Tuning · Case Study

LoRA Fine-Tuning for Finance: How We Gained 31 Points Over the Base Model

Prabhakar Gupta · Principal AI Architect · 08 Apr 2026 · 8 min read

The headline is the +31 percentage points. The lesson is where they came from: not the GPUs, not hyperparameters — the six weeks we spent on data before training a single step.

The task: classifying banking workflows against RBI compliance requirements — dense regulatory language where the difference between "compliant" and "breach" hangs on one clause's interpretation. The base model with careful prompting and retrieval plateaued in the low 60s on our golden set. The failure mode wasn't missing knowledge — the circulars were right there in context. It was judgment: the model kept applying generic reasoning to a domain with its own logic. That's the signature of a behaviour problem, and behaviour problems are fine-tuning problems.

01The data work that actually won

We built ~12,000 labelled examples, and the preparation decisions made or broke the result. Label quality over quantity: every example dual-reviewed by compliance officers; disagreements adjudicated, not averaged — noisy labels poison fine-tunes far faster than small datasets starve them. Hard negatives on purpose: we over-sampled near-miss cases, pairs that look compliant and aren't, because that boundary is the entire job. Reasoning in the targets: outputs included a short structured justification citing the relevant clause, not just a label — training the reasoning pattern, not the answer key. A held-out set the training team never saw, drawn from a later time period, so our score meant something about the future, not the past.

02Why LoRA, and the run itself

Low-Rank Adaptation let us tune a 7B-class model on a single GPU, iterate in hours instead of days, and ship adapters we could version, A/B, and roll back like deployment artifacts — operational properties that matter more in an enterprise than squeezing the last point from a full fine-tune. The run that finally went to production was unglamorous: conservative learning rate, early stopping against the eval set, three failed experiments quietly discarded. Final score: 31 points over the prompted base model on held-out data, with DPO alignment afterwards to tighten output discipline against preferred formats.

The honest accounting

Roughly 70% of project effort was data and evaluation, 10% was training, 20% was deployment and monitoring. Teams budgeting the reverse are why "we tried fine-tuning, it didn't work" is such a common sentence.

Bottom line: fine-tuning is not a shortcut — it's a data-quality discipline with a training step at the end. Get the labels, the hard negatives, and the held-out set right, and the percentage points follow.

No spam. Unsubscribe anytime. New Tuesdays.

Build systems, not demos

My live 8-week Agentic AI course covers all of this in working code — batch 01 starts 7 July, limited to 50 seats.

View the course →