← All posts
Technical Due Diligence & M&A

The AI Due Diligence Checklist Nobody Is Using Yet

Traditional technical due diligence misses the AI-specific risks. Here's the checklist I developed for evaluating AI features in startup acquisitions and funding rounds — model provenance, data rights, eval evidence, and vendor lock-in.

Craig Hoffmeyer7 min read

I conducted a technical due diligence last quarter for a growth-stage investor evaluating an AI-native SaaS company. The standard DD checklist — code quality, architecture, security, team, ops — came back clean. The engineering was solid. But when I dug into the AI-specific dimensions, the picture changed. The company's core feature depended entirely on a single model provider with no fallback. Their training data included customer inputs with no clear licensing. Their eval suite was three manual test cases. And their cost model assumed pricing that the model provider had already signaled would increase.

None of this showed up on the traditional DD checklist. The investor would have funded a company sitting on four material risks that a standard technical assessment missed entirely.

AI features introduce risks that did not exist two years ago, and the DD frameworks have not caught up. This article is the AI-specific checklist I have built through conducting these assessments.

The six categories

The AI DD checklist covers six categories that traditional DD either misses or treats superficially.

1. Model provenance and dependency

Which models does the product use? Map every model: foundation models (GPT-4, Claude, Gemini), fine-tuned models, open-weights models, and any models embedded in third-party tools.

What is the vendor concentration? If 100% of AI functionality depends on a single provider, what happens if that provider changes pricing, changes terms of service, rate-limits the account, or goes down? The risk is not theoretical — pricing changes and API deprecations have hit production products repeatedly.

Is there a fallback strategy? Can the product switch to an alternative model within days, not months? This requires an abstraction layer between the product and the model provider. Products that call the OpenAI API directly from business logic are locked in; products that route through a model gateway can switch.

Are model versions pinned? A product running on "latest" will break when the provider ships a model update that changes output format, reasoning patterns, or capability. Pinned versions with controlled upgrades are the mature pattern.

Is there any fine-tuned or custom-trained model? If yes, who owns the trained weights? Where is the training data? Can the model be reproduced from scratch if the training infrastructure changes?

2. Data rights and provenance

What data was used for any fine-tuning or RAG pipeline? This is the highest-risk category. If the company fine-tuned a model on customer data, do the customer agreements permit that use? If the RAG pipeline ingests data from third-party sources, are the licenses clear?

Do customers know their data flows through an AI model? Privacy policies and terms of service should explicitly describe how customer data is used in AI features. Vague language is a liability risk.

Is customer data sent to third-party AI providers? If customer data is sent to OpenAI, Anthropic, or Google for inference, does the data processing agreement (DPA) cover that use? Are there customers in jurisdictions (EU, certain US states) with data residency requirements?

Is there PII in the training data or prompts? Personal information flowing through AI pipelines creates GDPR, CCPA, and sector-specific (HIPAA, FERPA) compliance obligations. The company should be able to demonstrate how PII is handled in the AI data flow.

Can customer data be deleted from the AI pipeline? If a customer requests deletion under GDPR or CCPA, can the company remove their data from the RAG index, any fine-tuning datasets, and any cached responses?

3. Evaluation and quality assurance

Is there a formal eval suite? A company shipping AI features without an eval suite is shipping blind. I wrote about this in why your eval suite matters more than your prompt. The eval suite should cover accuracy, safety, latency, and cost.

How many eval examples exist? Under 30 is a warning sign. 50–100 is the minimum viable suite. Over 200 suggests maturity.

How are evals run? Manually or automated? On every code change or occasionally? The answer reveals how seriously the company takes AI quality.

Is there regression detection? When the company updates a prompt, changes a model, or modifies the pipeline, do they have automated detection for quality regressions? Without this, model upgrades are a gamble.

Are there production quality metrics? User feedback, error rates, hallucination rates, and user engagement with AI features. These are the real-world eval that complements the offline eval suite.

4. Cost model and unit economics

What is the current AI inference cost per user per month? This number is the AI-specific COGS. It should be a tracked metric, not a guess.

How does the cost scale with usage? Some AI features have linear cost scaling (cost grows proportionally with users). Others have super-linear scaling (cost grows faster than users, because each user generates more tokens over time as they use more features). The scaling curve matters for margin projections.

Is there a cost optimization strategy? Caching, model routing (using cheaper models for simple queries), prompt optimization, batching. I covered this in the hidden cost curve of LLM features. A company without a cost optimization strategy is a company whose margins will compress.

What happens if model pricing increases 2x? The model providers have changed pricing multiple times. A 2x increase is a realistic scenario. Does the company's business model survive it?

5. Safety and guardrails

Is there prompt injection protection? Can a malicious user manipulate the AI feature to leak system prompts, access other users' data, or perform unintended actions?

Is there output filtering? Does the product filter AI-generated output for harmful content, PII leakage, or off-brand responses?

Is there an audit trail? Are AI inputs and outputs logged for debugging, compliance, and incident response? Can the company trace what the AI said to a specific user at a specific time?

What is the incident response plan for an AI failure? If the AI feature produces harmful output, what is the process? Kill switch, escalation path, customer communication. This should be documented.

6. Team and capability

Who on the team understands the AI system deeply? A single AI engineer is a key-person risk. The AI knowledge should be distributed across at least two people.

Is there separation between AI experimentation and production systems? Teams that experiment on production pipelines are teams that will have production incidents from experiments.

Can the team migrate between model providers? This is a capability question, not just an architecture question. Has the team ever switched models? Do they have experience with multiple providers?

How to score it

I score each category on a three-point scale: green (no material risk), yellow (manageable risk with a clear plan), red (material risk without a plan). The overall assessment aggregates across categories.

A company with all greens is rare and suggests either maturity or a shallow AI feature that does not touch the hard problems. Most companies score yellow across several categories, which is fine as long as the plans are credible. A red in model dependency, data rights, or cost model is a material finding that should factor into the investment decision.

Counterpoint: early-stage companies get a pass on some of this

For pre-seed and seed-stage companies, the checklist is aspirational rather than required. A company with three months of runway and 100 users does not need a comprehensive eval suite or a multi-provider fallback strategy. What they do need is awareness — the founders should be able to articulate the risks even if they have not yet mitigated them.

Your next step

If you are building an AI-native product and expect to raise or be acquired, start with the data rights category. Audit what data flows through your AI pipeline, confirm your customer agreements cover that use, and document it. Data rights issues are the hardest to fix retroactively and the most likely to derail a deal.

Where I come in

I conduct AI-specific technical due diligence for investors and help startups prepare for it. The assessment typically takes 3–5 days and covers all six categories. Book a call if you are an investor evaluating an AI company or a founder preparing for a raise.


Related reading: Inside a Technical Due Diligence · AI Safety for Startups · The Hidden Cost Curve of LLM Features

Need an AI-specific DD? Book a call.

Get in touch →