Is Sanolith actually HIPAA-compliant?

We sign a BAA. PHI is redacted before any inference call. All data is encrypted at rest and in transit. Audit logs are append-only, tenant-scoped, and exportable. We pass through annual third-party HIPAA security risk assessments.

How is Sanolith different from ChatGPT Enterprise with a BAA?

ChatGPT Enterprise is one model from one vendor. Sanolith routes to whichever model fits the task (GPT, Claude, Llama, your own fine-tune) without rewriting integrations. Clinical tools (PubMed, DailyMed, RxNav, FAERS) are native. The audit ledger captures per-prompt redaction events, not just access logs.

What does the PHI redactor actually catch?

Names, MRN, DOB, SSN, ITIN, phone, email, addresses, dates within 1 day of admission, plus 40+ other identifiers. It is fail-closed: if the redactor errors, the request errors. PHI does not pass through to inference under any failure mode.

Who owns the fine-tuned model trained on our data?

The tenant owns the Sano adapter weights. They're trained on the tenant's data and live inside the tenant boundary. Sanolith is the custodian under the BAA, not the owner. On churn, the tenant receives the weights exported and Sanolith purges from its infrastructure within the BAA's deletion SLA.

Can we bring our own model or our own GPUs?

Yes, on the Enterprise tier. Point Sanolith at your vLLM cluster, your AWS Bedrock account, or your on-prem inference endpoint. Same redactor, same audit, same API. Self-hosted inference works for air-gapped deployments.

What happens to our data if we churn?

On termination, the customer receives a full export of documents, audit ledger, and fine-tuned model weights within 30 days. Sanolith purges all tenant data (embeddings, chat history, audit ledger backups) within 60 days, per the BAA. Certified destruction report on request.

How long does it take to go live?

Starter tier is self-serve, live in 15 minutes. Team tier with BAA takes about 5 business days for paperwork and onboarding. Enterprise with custom integrations runs 2 to 4 weeks depending on scope.

Is the Sanolith audit ledger really tamper-evident?

Append-only Postgres table with row-level security, plus hash-chained checkpoints written hourly to immutable object storage. Every entry is timestamped, signed, and the chain is reproducible from the checkpoints.

Per-tenant fine-tuning without leaking your data

Every healthcare AI vendor wants to claim their model is "trained for your domain." Most of them mean they trained one big model on a pile of medical text and serve the same weights to every customer. That works fine for general medical knowledge. It breaks the moment your team's data needs to influence YOUR model without influencing anyone else's.

This is the per-tenant fine-tuning problem. Here's how Sanolith solves it with what we call a Sano adapter.

Why a shared model is a problem

A model fine-tuned on aggregated customer data has memorized some of that data. Multiple papers (Carlini et al. 2021, Nasr et al. 2023) show that LLMs can be prompted to regurgitate specific training examples. If your team's curated Q&A goes into a shared training run, another team's prompts can fish those Q&As out.

For commodity domain knowledge this is fine. For your institution's specific protocols, redacted but still identifying chart patterns, internal terminology, it's a leak.

The fix is per-tenant fine-tuning: your data trains a model that only you serve.

The cost-of-doing-it-naively

The naïve approach is: train a separate full-parameter copy of the base model for each customer.

For a 70B base model, that means ~140GB of weights per customer. 100 customers = 14TB of weights. Loading any one customer's model into GPU memory takes ~30 seconds. Switching between customers is impossible at chat-response latencies.

You could mitigate with model sharding, hot/warm caches, dedicated GPU pools per customer. All of these are real engineering work and the cost scales with customer count.

The clean answer is a Sano adapter.

How a Sano adapter works

A Sano adapter is what Sanolith ships when your team trains a private model. Under the hood, it's a Low-Rank Adaptation (LoRA) built on the open peft library: a parameter-efficient fine-tuning method we don't try to hide because the technique is industry-standard and verifiable. Instead of retraining the full model, the trainer:

1. Freezes the base model weights 2. Injects small adapter matrices into specific layers (typically attention + MLP) 3. Trains ONLY the adapter weights on your data

The adapter for a typical Llama-3-8B is ~50MB. For a 70B base it's ~200MB. Compare to ~16GB / ~140GB full weights.

At inference time, the base model loads once per GPU. Adapters are tiny and hot-swap in <100ms. One GPU can serve dozens of tenant-specific adapters from the same warm base model.

What changes vs. shared training

A Sano adapter trained on Tenant A's data:

Captures patterns from Tenant A only
Does NOT modify the base model weights other tenants use
Lives in storage scoped to Tenant A
Loads into GPU memory only when a Tenant A request arrives
Unloads when not in use

If Tenant B prompts the same base model, the model has no access to Tenant A's adapter. The path:

1. Request arrives tagged with tenant_id 2. Router loads base model + Tenant B's adapter (if any), NOT Tenant A's 3. Inference runs with Tenant B's adapter in the forward pass 4. Response returned

Tenant A's adapter never enters the computation. Cross-tenant memorization is impossible by construction, not by policy.

What's in the adapter

The adapter encodes:

Your team's preferred answer style (concise vs verbose, citations vs prose)
Your formulary preferences (when multiple drugs are equivalent, which one your institution stocks)
Your safety constraints (specific contraindications your team flags more aggressively)
Vocabulary specific to your specialty (oncology terms, pediatric terms, rare-disease terms)

What's NOT in the adapter:

Patient identifiers. These never enter training because the data is PHI-redacted before it reaches the trainer.
Your raw chart notes. Only the curated Q&A produced by your team's reviewers.
Anything not in the consented training corpus.

The training loop

For Sanolith specifically, the pipeline is:

1. Your team curates Q&A from clinical conversations (post-redaction) 2. A curation reviewer approves each example before it joins the training set 3. The training job runs on dedicated GPU (paideia service) 4. The trained Sano adapter is uploaded to a tenant-scoped S3 path 5. The inference layer (vLLM) hot-loads the adapter for that tenant's requests

The whole cycle is auditable. Every example that influenced your model is in the training set; every training set is in object storage with the run that produced the adapter.

Who owns the adapter

This matters more than it sounds. Three positions you'll hear from vendors:

1. "We own all weights." (You can't take it with you on churn.) 2. "You own the weights, but they live in our infrastructure." (Custodianship model.) 3. "You own the weights and we'll export them on request." (Hard one to find in practice.)

The right answer for healthcare is #3. The Sano adapter is a derivative work trained on YOUR data; you own the derivative. The vendor is the custodian under the BAA. On churn, you get an export and the vendor purges within the BAA's deletion SLA.

If a vendor says "we own the weights," your team's institutional knowledge is now hostage. Walk away.

Why this beats RAG alone

Retrieval-Augmented Generation (RAG) is the alternative: keep the model untrained, look up your documents at inference time, stuff them into the context window.

RAG is necessary. It's not sufficient.

RAG handles facts that change ("our updated formulary lists ibuprofen 600mg as the first-line NSAID"). Fine-tuning handles patterns that don't change easily through context ("our team prefers concise replies that lead with the answer and back it up in two sentences").

The combination is the win. RAG for current facts, Sano adapters for stable preferences and style. Sanolith ships both per tenant.

The math on cost

For a team of 30 clinicians sending ~50 chats/day each:

1,500 chats/day, ~30,000/month
Inference cost on a shared 70B model: ~$1,200/month (at current open-weight pricing)
Adapter training: ~$15 per training run (a few GPU-hours on a single H100)
Storage: <$1/month for the adapter weights

So per-tenant fine-tuning adds roughly $50/month of infrastructure cost to a team using $1,200/month of inference. 4% overhead for a model that's genuinely specialized to your team.

That's the trade. A 4% cost premium for a model your team owns, that doesn't leak to other tenants, that improves over time as your reviewers approve more examples.

The math works.