Per-tenant fine-tuning without leaking your data
Shared models memorize training data. Per-tenant fine-tuning is the only way to specialize a model for your team without your data leaking to someone else's queries. Here's how it actually works.
Every healthcare AI vendor wants to claim their model is "trained for your domain." Most of them mean they trained one big model on a pile of medical text and serve the same weights to every customer. That works fine for general medical knowledge. It breaks the moment your team's data needs to influence YOUR model without influencing anyone else's.
This is the per-tenant fine-tuning problem. Here's how Sanolith solves it with what we call a Sano adapter.
Why a shared model is a problem
A model fine-tuned on aggregated customer data has memorized some of that data. Multiple papers (Carlini et al. 2021, Nasr et al. 2023) show that LLMs can be prompted to regurgitate specific training examples. If your team's curated Q&A goes into a shared training run, another team's prompts can fish those Q&As out.
For commodity domain knowledge this is fine. For your institution's specific protocols, redacted but still identifying chart patterns, internal terminology, it's a leak.
The fix is per-tenant fine-tuning: your data trains a model that only you serve.
The cost-of-doing-it-naively
The naïve approach is: train a separate full-parameter copy of the base model for each customer.
For a 70B base model, that means ~140GB of weights per customer. 100 customers = 14TB of weights. Loading any one customer's model into GPU memory takes ~30 seconds. Switching between customers is impossible at chat-response latencies.
You could mitigate with model sharding, hot/warm caches, dedicated GPU pools per customer. All of these are real engineering work and the cost scales with customer count.
The clean answer is a Sano adapter.
How a Sano adapter works
A Sano adapter is what Sanolith ships when your team trains a private model. Under the hood, it's a Low-Rank Adaptation (LoRA) built on the open peft library: a parameter-efficient fine-tuning method we don't try to hide because the technique is industry-standard and verifiable. Instead of retraining the full model, the trainer:
1. Freezes the base model weights 2. Injects small adapter matrices into specific layers (typically attention + MLP) 3. Trains ONLY the adapter weights on your data
The adapter for a typical Llama-3-8B is ~50MB. For a 70B base it's ~200MB. Compare to ~16GB / ~140GB full weights.
At inference time, the base model loads once per GPU. Adapters are tiny and hot-swap in <100ms. One GPU can serve dozens of tenant-specific adapters from the same warm base model.
What changes vs. shared training
A Sano adapter trained on Tenant A's data:
- Captures patterns from Tenant A only
- Does NOT modify the base model weights other tenants use
- Lives in storage scoped to Tenant A
- Loads into GPU memory only when a Tenant A request arrives
- Unloads when not in use
If Tenant B prompts the same base model, the model has no access to Tenant A's adapter. The path:
1. Request arrives tagged with tenant_id 2. Router loads base model + Tenant B's adapter (if any), NOT Tenant A's 3. Inference runs with Tenant B's adapter in the forward pass 4. Response returned
Tenant A's adapter never enters the computation. Cross-tenant memorization is impossible by construction, not by policy.
What's in the adapter
The adapter encodes:
- Your team's preferred answer style (concise vs verbose, citations vs prose)
- Your formulary preferences (when multiple drugs are equivalent, which one your institution stocks)
- Your safety constraints (specific contraindications your team flags more aggressively)
- Vocabulary specific to your specialty (oncology terms, pediatric terms, rare-disease terms)
What's NOT in the adapter:
- Patient identifiers. These never enter training because the data is PHI-redacted before it reaches the trainer.
- Your raw chart notes. Only the curated Q&A produced by your team's reviewers.
- Anything not in the consented training corpus.
The training loop
For Sanolith specifically, the pipeline is:
1. Your team curates Q&A from clinical conversations (post-redaction) 2. A curation reviewer approves each example before it joins the training set 3. The training job runs on dedicated GPU (paideia service) 4. The trained Sano adapter is uploaded to a tenant-scoped S3 path 5. The inference layer (vLLM) hot-loads the adapter for that tenant's requests
The whole cycle is auditable. Every example that influenced your model is in the training set; every training set is in object storage with the run that produced the adapter.
Who owns the adapter
This matters more than it sounds. Three positions you'll hear from vendors:
1. "We own all weights." (You can't take it with you on churn.) 2. "You own the weights, but they live in our infrastructure." (Custodianship model.) 3. "You own the weights and we'll export them on request." (Hard one to find in practice.)
The right answer for healthcare is #3. The Sano adapter is a derivative work trained on YOUR data; you own the derivative. The vendor is the custodian under the BAA. On churn, you get an export and the vendor purges within the BAA's deletion SLA.
If a vendor says "we own the weights," your team's institutional knowledge is now hostage. Walk away.
Why this beats RAG alone
Retrieval-Augmented Generation (RAG) is the alternative: keep the model untrained, look up your documents at inference time, stuff them into the context window.
RAG is necessary. It's not sufficient.
RAG handles facts that change ("our updated formulary lists ibuprofen 600mg as the first-line NSAID"). Fine-tuning handles patterns that don't change easily through context ("our team prefers concise replies that lead with the answer and back it up in two sentences").
The combination is the win. RAG for current facts, Sano adapters for stable preferences and style. Sanolith ships both per tenant.
The math on cost
For a team of 30 clinicians sending ~50 chats/day each:
- 1,500 chats/day, ~30,000/month
- Inference cost on a shared 70B model: ~$1,200/month (at current open-weight pricing)
- Adapter training: ~$15 per training run (a few GPU-hours on a single H100)
- Storage: <$1/month for the adapter weights
So per-tenant fine-tuning adds roughly $50/month of infrastructure cost to a team using $1,200/month of inference. 4% overhead for a model that's genuinely specialized to your team.
That's the trade. A 4% cost premium for a model your team owns, that doesn't leak to other tenants, that improves over time as your reviewers approve more examples.
The math works.