Do inference costs count?

Usually not. Inference serving live users is production, not research. Compute spent testing a new serving architecture or comparing quantization approaches during development can count.

Does fine-tuning a foundation model qualify?

It can, if you are testing whether the fine-tune improves performance against a measured baseline. Fine-tuning once and shipping without evaluation is a weaker case.

Does using the OpenAI or Anthropic API count as R&D?

Not by itself. It can support a qualifying claim if your team is running structured experiments on top of it, like testing retrieval strategies or evaluation methods.

Do failed training runs count?

Yes. A training run that does not improve the model still counts, since the credit is about eliminating uncertainty, not achieving a specific result.

Does data labeling count?

Labeling costs can count when the data is built specifically to support a qualifying experiment, such as a new evaluation benchmark, rather than routine data entry.

The R&D tax credit for AI startups

The short answer

Model training, fine-tuning, and evaluation work usually qualifies for the R&D tax credit. The sticking point is separating genuine experimentation from prompt tweaks against a vendor's API.

What qualifies, and what fights you

AI companies that train or fine-tune their own models are running textbook research and development. Architecture choices, hyperparameter sweeps, and evaluation design all involve real uncertainty about whether an approach will work, which is exactly what the credit rewards.

The harder question is companies built as a thin layer on top of a foundation model API. Calling GPT or Claude with a well-crafted prompt and shipping the result is a product decision, not research, unless your team is running structured experiments to measure and improve model behavior. The line is whether you are measuring and iterating against a hypothesis, or just trying prompts until one feels right.

Retrieval-augmented generation systems sit in between. Building the retrieval pipeline, testing chunking strategies, and tuning ranking algorithms against a real evaluation set qualifies. Wiring a vector database into a chat UI with default settings usually does not.

The four-part test, applied to AI startups

Qualified purpose is straightforward for AI companies: the work improves a model or a product built around one. The technological requirement is met through machine learning, statistics, and software engineering.

Elimination of uncertainty shows up clearly in training runs. Nobody knows in advance whether a new architecture, a different training data mix, or a fine-tuning approach will improve accuracy. The process of experimentation is the training and evaluation loop itself: hypothesize, train, measure against a benchmark, adjust, and repeat.

New to the test itself? Read what software work qualifies as R&D first.

Work that usually qualifies

Training runs and architecture experiments

Comparing model architectures or training configurations to see which one generalizes better on your data qualifies as core experimentation.

Evaluation harness development

Building a system that scores model outputs against a benchmark or a labeled test set, so you can measure whether a change actually helped, is qualifying technical work.

Fine-tuning experiments with uncertain outcomes

Testing whether fine-tuning on domain-specific data improves accuracy over a base model, and measuring the result, qualifies. Running one fine-tune and shipping it without measurement is a weaker claim.

Retrieval pipeline design

Testing chunking strategies, embedding models, and ranking approaches against a real evaluation set to improve retrieval quality qualifies.

Inference optimization

Quantizing a model or redesigning a serving stack to cut latency while holding accuracy steady involves genuine engineering trade-offs and testing.

Work that usually does not

Prompt tweaks against a stable API

Rewording a prompt sent to a vendor's model until the output looks better, with no structured measurement, does not meet the process of experimentation requirement.

Wrapping a vendor API in a product feature

Calling a foundation model's API as documented and displaying the result does not involve technological uncertainty, even if the product built around it is new.

Which expenses count

GPU compute is the expense that sets AI companies apart. Cloud compute used for training runs, fine-tuning jobs, and evaluation sweeps counts as a qualifying expense. Inference compute serving live production traffic generally does not, since it is not development work.

Wages for ML engineers and researchers count, prorated to time spent on qualifying experimentation. That includes the engineers designing eval harnesses and running training jobs, not just the ones writing model code.

US-based contractors, including specialized ML consultants, count at 65% of what you pay them. Data labeling work can also count if it directly supports a qualifying experiment, such as building a new evaluation set.

A worked example

Hypothetical example. An AI startup has 5 ML engineers earning a blended average of $180,000, spending about 75% of their time on training, fine-tuning, and evaluation work.

Wage QRE: $675,000
Contractor QRE (65% of $150,000): $97,500
GPU training compute: $220,000
Total QRE: $992,500

At 6 to 10% of total QRE, the federal credit lands between about $59,550 and $99,250. A pre-revenue or early-revenue company under $5 million can apply up to $500,000 of that against payroll taxes each year.

Common questions

Other industries

SaaSR&D credit for SaaS startups FintechR&D credit for fintech startups DevtoolsR&D credit for devtools startups HardwareR&D credit for hardware startups BiotechR&D credit for biotech startups HealthtechR&D credit for healthtech startups