Veterinary AI’s Training-Set Problem — Part One: The Labeling Step

Artificial Intelligence Investigation The Math Problem · Part One

Phantom Radiologists: The Time Math That Exposes Veterinary AI’s Training-Set Problem — Part One: The Labeling Step

SignalPET claims its AI was trained on “over 2 million annotated veterinary radiographs.” Vetology claims “over 300,000 Board Certified veterinary radiologist-reviewed cases.” Antech RapidRead claims “16 million images.” This is Part One of a three-part investigation into whether those numbers can be reconciled with the documented capacity of the North American board-certified veterinary radiologist workforce. This article focuses on the simplest possible AI training task — image-level categorical labeling, the kind the Stanford CheXNeXt study measured at 34.3 seconds per image in PLOS Medicine — and shows the math does not work for the larger claims even at this most charitable level. Part Two addresses bounding-box localization, pixel-level segmentation, and pathology-confirmed ground truth, each of which adds substantial additional time burden on top of the simplest labeling step. Part Three documents the validation-statistics gap between FDA-cleared human radiology AI and commercial veterinary AI, and the corporate-consolidation revenue model that explains why the gap exists. The conclusion of Part One alone is severe enough: the labor required to produce the larger vendor training-corpus claims at the simplest annotation step exceeds what the documented veterinary specialty workforce could plausibly have produced.

VeterinaryTeleradiology.com Editorial Staff  ·  April 2026  ·  Estimated read: 24 minutes  ·  Series: The Math Problem, Part 1 of 3

What This Article Covers — and What It Doesn’t

Before any math, the reader needs to understand what AI training-data labeling actually is, because the term is used loosely in vendor marketing and because the labor implications differ enormously depending on which kind of labeling work is being described.

Image labeling for medical AI is not a single task. It is the umbrella term for at least four distinct annotation activities, each with progressively greater per-image time requirements, and each required for different AI capabilities. This article addresses only the simplest of those activities: image-level categorical classification. Even at the simplest level, calculated against the documented workforce of North American board-certified veterinary radiologists, the math does not reconcile with the larger commercial vendor training-corpus claims. The remaining three annotation activities — bounding-box localization, pixel-level segmentation, and ground-truth correlation against pathology — add substantial additional time burden on top of the figures presented here. They are the subject of Part Two. The validation-statistics evidence base that compares FDA-cleared human radiology AI to commercial veterinary AI, and the corporate-consolidation revenue model that explains why the validation gap exists, are the subject of Part Three.

The four annotation tasks in plain language:

Image-level classification. The radiologist looks at the image and marks which of N pre-defined categories are present. Output is a row of yes/no flags per image. No localization. No measurement. No shape characterization. Pre-defined categories from a fixed list. This is what the Stanford CheXNeXt study measured at 34.3 seconds per image average. It is the foundational first step in supervised AI training, but it is not sufficient on its own to produce AI that shows the user where findings are located, performs measurements, or characterizes disease shapes — capabilities that require the additional annotation tasks below.

Bounding-box annotation. The radiologist draws a rectangle around each abnormality on the image and labels it with the relevant category. The task requires identifying lesion edges, clicking and dragging the rectangle, applying the correct label, verifying the label, and proceeding to the next finding. A single image with three findings requires three boxes. Per-image bounding-box annotation rates documented in peer-reviewed AI training literature are several minutes per image, an order of magnitude longer than image-level classification.

Pixel-level segmentation. The radiologist outlines the lesion boundary at the pixel level, producing a precise shape mask used for measurement, volumetric analysis, and disease shape characterization. Required for any AI that produces measurements such as vertebral heart score, lung field volume, or mass dimensions. Per-image segmentation times for complex cases run several minutes to tens of minutes.

Ground-truth correlation against pathology. The radiologist’s interpretation is checked against an independent reference standard — tissue pathology, surgical confirmation, autopsy, or long-term clinical outcome — to determine whether the labeling is actually correct. This is a separate workflow from the labeling itself, and most veterinary AI training corpora skip it entirely, relying instead on radiologist consensus as the reference standard. The methodological consequences of this shortcut are documented in this publication’s companion analysis of the engineering rigor gap.

This article — Part One of “The Math Problem” — calculates the labor required for Step One only. Part Two calculates the additional labor required for Steps Two through Four, applying primary-source bounding-box and segmentation rates from peer-reviewed AI training literature to the same vendor claims. Part Three closes the series by examining what happens after training is supposedly complete: the validation-statistics evidence base for commercial veterinary AI and the corporate-consolidation revenue model that produces the marketing claims this math addresses. The reader should understand that the math presented here is therefore conservative by an order of magnitude or more relative to the actual labor budget required to produce a credible commercial veterinary AI radiology product. Even so, even at this conservative starting point, the math for the larger claims does not work.

The Math Problem · A Three-Part Investigation

Part 1 (this article): The Labeling Step — Image-Level Classification at Stanford CheXNeXt’s Documented 34.3 Seconds Per Image. The simplest annotation task, the most charitable possible math.

Part 2: The Annotation Steps That Actually Build the Product — Bounding-Box Localization, Pixel Segmentation, and Pathology Correlation. The math the labeling step doesn’t even begin to address, plus three structural infrastructure questions: no veterinary subspecialty fellowship pathway, no pathology-confirmed dataset at scale, and breed-specific anatomic variation.

Part 3: Validation Statistics and Revenue Model — What FDA-cleared human radiology AI is required to demonstrate, what commercial veterinary AI actually demonstrates, and the Mars-Antech-VCA-BluePearl corporate consolidation that explains why the validation gap exists. The vendor/provider separation that aligns incentives on the human side, and the conflict of interest and anticompetitive tying that replace it on the veterinary side.

The Argument in Numbers (Step One Only)

Three U.S. veterinary AI radiology vendors have published, in their own marketing materials, the size of the training corpora used to develop their commercial products. The numbers are large. They are also, in their unqualified form, presented as evidence of clinical reliability — the implicit argument being that an algorithm trained on millions of images annotated by veterinary specialists is well-validated by virtue of the scale alone.

Calculate against documented inputs and the implicit argument falls apart. Even at the simplest possible annotation step.

The Stanford Machine Learning Group’s CheXNeXt paper, published in PLOS Medicine in November 2018 by Pranav Rajpurkar in Andrew Ng’s laboratory, is the most widely cited foundational benchmark in radiology artificial intelligence research. It is also one of the rare published papers that directly documents how long it takes a board-certified radiologist to label diagnostic images for AI training at the simplest annotation step — image-level categorical classification. The paper reports, verbatim: “The average time for radiologists to complete labeling of 420 chest radiographs was 240 minutes (range 180–300 minutes).” That is 34.3 seconds per image at the average, with a documented range of 25.7 to 42.9 seconds per image. The radiologists were marking 14 thoracic pathology categories per chest radiograph as present or absent, with no localization, no bounding boxes, and no measurements required. It is a categorical labeling rate, not a clinical reading rate, and it is for the simplest annotation task in the AI training pipeline.

The American College of Veterinary Radiology — the AVMA-recognized specialty organization for veterinary diagnostic imaging — currently reports “over 800 accredited veterinary radiologists and radiation oncologists.” That figure is inclusive of radiation oncologists, who do not annotate diagnostic radiographs as part of their specialty practice. The most recent published Diplomate breakdown (2019) showed 573 in pure Radiology, 18 dual-boarded, and 95 in Radiation Oncology. Industry sources writing in late 2025 estimate “fewer than 1,000 board-certified veterinary radiologists practice in the United States, a shortage that continues to challenge the profession” — serving a clinical market of more than 80,000 veterinary clinics nationwide. ACVR Executive Director Dr. Tod Drost was quoted in JAVMA News in 2018 observing that with only 43 new diplomates produced annually against approximately 70 open positions per year, “the math doesn’t work out that well.” That was the workforce mismatch in 2018, before the corporate consolidation and AI-vendor expansion of the subsequent seven years compounded it.

Now multiply, for the simplest annotation step alone.

Sixteen million images, at the Stanford CheXNeXt average of 34.3 seconds per image for image-level categorical classification, would require approximately 152,444 hours of dedicated radiologist labeling time. Divided by a 2,080-hour annual full-time work schedule, that is 73.3 radiologist-years. To complete the work in a five-year window, 14.7 board-certified veterinary radiologists would need to work full-time on labeling — no clinical practice, no teaching, no research, no other professional activity. To complete it in two years, 36.7 radiologists would be required full-time. And the applicable human-medicine standard for training data is more demanding: published human-side AI training datasets (VinDr-CXR in Nature Scientific Data 2022, Stanford CheXpert, the FDA-cleared chest x-ray AI training set published in Nature Scientific Reports in October 2024) each used multiple independent radiologist annotations per image, with adjudication procedures for disagreement. At three independent labels per training image — the practical minimum the human-side AI training literature considers methodologically credible — the 16-million-image figure expands to approximately 220 radiologist-years of dedicated specialist labeling work for the categorical classification step alone.

The North American diagnostic imaging specialty, which the ACVR’s own leadership has described as facing a workforce crisis, comprises approximately 600 to 700 active Diplomates. Subtracting the Diplomates in academic appointments, the Diplomates working full-time at multispecialty hospitals, the Diplomates in private radiology practice, the Diplomates running their own teleradiology services, and the Diplomates approaching retirement, the population realistically available to be hired full-time for multi-year AI labeling assignments is small — and the cost of removing each one from clinical practice, at conservative $125-per-study average billing rates (the midpoint of the $85 to $250 range that prevails for U.S. veterinary teleradiology consultation) and a working-day output of approximately 30 studies per Diplomate per day, runs to roughly $937,500 per Diplomate per year in foregone clinical revenue at routine rates, and well above $1 million per Diplomate per year when stat reads, MRI and CT consultations, and after-hours coverage are included in the mix.

And this is the math for Step One only. It does not include any of the additional annotation work — bounding-box localization, pixel-level segmentation, ground-truth correlation — that actual commercial veterinary AI radiology products require to produce the capabilities they are marketed as having. Each of those additional annotation tasks adds time per image. The full labor budget, calculated through Part Two, is several multiples larger than what is presented here. Part Three then documents the validation-statistics evidence base that this labor was supposedly used to produce, and the corporate revenue model under which the larger vendor claims continue to be marketed without independent verification.

The Math Doesn’t Work — Even at the Simplest Annotation Step

At Stanford CheXNeXt’s documented rate of 34.3 seconds per radiologist-labeled image — for the simplest possible task of image-level categorical classification — Antech’s 16-million-image training claim requires 73.3 radiologist-years of full-time labeling work. At the human-side AI standard of three independent labels per training image, it requires 220 radiologist-years.

The North American board-certified veterinary radiologist population, which the ACVR has publicly described as a workforce in crisis, comprises approximately 600 to 700 active diagnostic imaging Diplomates — virtually all of whom are clinically practicing. The labor pool to produce the claimed training corpus the way the marketing implies does not exist, even at the simplest annotation step. Bounding-box, segmentation, and pathology correlation work — which is what actual commercial AI products require — multiplies these figures further. The reconciliation paths (NLP auto-labeling, non-specialist labeling, AI-generated pseudo-labels, or inflated numbers) are each defensible practices when disclosed. None of them has been disclosed.

The Inputs: Where Every Number Comes From

An argument from the math is only as defensible as the inputs to the math. This article relies on five inputs, each of which is sourced to a primary publication or institutional source that any reader can verify independently.

Per-image categorical labeling rate: 34.3 seconds per image (range 25.7 to 42.9 seconds). Source: Rajpurkar P, Irvin J, Ball RL, et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLOS Medicine. 2018;15(11):e1002686. The paper documents that the radiologists in the CheXNeXt validation study labeled 420 chest radiographs in an average of 240 minutes, with a range of 180 to 300 minutes across participants. This calculates to 34.3 seconds per image at the average, 25.7 seconds at the fast end of the documented range, and 42.9 seconds at the slow end. The figure is for image-level categorical classification of 14 thoracic pathologies — applying yes/no flags to pre-defined categories — on chest radiographs in a structured workflow, by U.S. board-certified radiologists. No bounding boxes were drawn. No segmentation was performed. No measurements were recorded. No pathology correlation was applied. The 34.3-second figure is therefore the floor on radiologist labeling effort for AI training, the simplest possible task. Any defensible criticism of this article’s math at the step-one level would have to argue that veterinary specialists perform image-level categorical classification faster than U.S. board-certified human radiologists, which is implausible on its face given the broader scope of veterinary specialty training and case complexity.

Annual radiologist productive labor budget: 2,080 hours. Source: standard U.S. full-time work schedule, defined as 40 hours per week for 52 weeks. This is the textbook calculation used in labor economics and in workforce capacity studies across U.S. healthcare. A radiologist actually working a 2,080-hour year on labeling alone, with no clinical practice, no teaching, no research, no continuing education, no vacation, and no sick days, represents an unrealistic productivity ceiling. Real-world productive labeling capacity, accounting for breaks, fatigue, quality control review, and inter-annotator discussion, is closer to 4–5 productive hours per workday, or approximately 1,000 to 1,250 effective annotation hours per year. Using the 2,080-hour figure errs on the side that helps the vendors — that is, it produces the highest possible images-per-radiologist-per-year figure and the lowest possible number of radiologists required to produce the claimed training corpus. Even at this aggressive productivity assumption, the math does not work for the larger vendor claims at the simplest annotation step.

Active diagnostic imaging Diplomate population in North America: 600 to 700. Source: American College of Veterinary Radiology (acvr.org), which currently states it consists of “over 800 accredited veterinary radiologists and radiation oncologists.” The 2019 official ACVR membership breakdown showed 573 Diplomates in Radiology and 18 dual-boarded, for 591 active diagnostic imaging Diplomates at that time. Growth in the years since (43 new diplomates per year, per ACVR Executive Director Dr. Tod Drost in JAVMA News, 2018) places the current diagnostic imaging Diplomate population in the approximate range of 600 to 700. Industry sources writing in 2025 estimate “fewer than 1,000 board-certified veterinary radiologists practice in the United States” inclusive of all radiology subspecialties. The figure used in this article is conservative; a smaller specialist population would only sharpen the math.

Documented specialist shortage relative to clinical demand. Source: JAVMA News, October 2018, “Specialists in Short Supply.” The article documents Dr. Drost’s analysis that “70 jobs, 43 new people coming in—the math, you can see, doesn’t work out that well.” Subsequent industry reporting (Sage Veterinary Imaging, September 2025; the VET Recruiter, 2024) has continued to characterize the veterinary radiologist workforce as inadequate to meet existing clinical demand, with fewer than 1,000 specialists serving more than 80,000 U.S. clinics. The ACVR’s own documentation of the shortage is the basis on which the economic-rationality argument in this article rests: the same workforce the AI vendors implicitly invoke when claiming “board-certified radiologist-reviewed” training images is the same workforce the specialty college has documented to be insufficient for clinical practice alone, much less for clinical practice plus large-scale AI training labeling on the side.

Three vendor training-corpus claims:

SignalPET: “over 2 million annotated veterinary radiographs,” described in company materials as the largest such corpus in the world.

Vetology: “over 300,000 Board Certified veterinary radiologist-reviewed cases,” with the company also stating its product utilizes “38 different deep-learning architectures.”

Antech RapidRead: “16 million images sourced from an unprecedented library of more than 8 billion images.” Company materials further state that “our team of board-certified radiologists are continually training and measuring the accuracy of the model.”

Each of these figures is taken directly from the vendor’s own publicly accessible marketing materials as accessed at the time of this article’s preparation. No paraphrasing or interpretation has been applied to the numbers themselves. The interpretive question this article addresses is whether the figures, taken at the values the vendors have published, are reconcilable with the documented per-image labeling rates and the documented specialty workforce capacity at the simplest annotation step. Part Two addresses whether they are reconcilable at the more demanding annotation steps that actual commercial product capabilities require. Part Three addresses the validation-statistics evidence and corporate revenue model under which the marketing claims continue to be made.

How the CheXNeXt Study Actually Did the Labeling

To anchor the math properly, it helps to walk through what the CheXNeXt radiologists actually did when they produced the 240-minute average for 420 chest radiographs. Understanding the simplicity of the task makes clear why the rate cannot reasonably be applied to anything more demanding — and why the rate represents the floor on radiologist labeling effort, not a ceiling.

The CheXNeXt validation set consisted of 420 frontal-view chest radiographs selected from the publicly released NIH ChestX-ray14 dataset. The set was curated to contain at least 50 cases of each of the 14 thoracic pathologies the AI system was designed to identify. The pathology categories were: atelectasis, cardiomegaly, consolidation, edema, effusion, emphysema, fibrosis, hernia, infiltration, mass, nodule, pleural thickening, pneumonia, and pneumothorax. The categories were pre-defined by the study designers. The radiologists did not have to come up with the categories; they had to apply them.

For each radiograph, each radiologist marked which of the 14 categories were present. The output was a row of 14 binary flags per image — fourteen yes-or-no decisions, applied to a single chest radiograph, by a board-certified radiologist with the categorical labels visible on screen. No localization. No bounding boxes. No measurements of cardiac silhouette, lung field volume, or rib counts. No characterization of mass shape or border. No assessment of left-versus-right laterality of effusion. No correlation against the patient’s history, prior imaging, or pathology. Just fourteen yes-or-no flags, applied per image.

Six board-certified radiologists from three academic institutions, with an average of 12 years of experience, each labeled the entire validation set. The average time across all six radiologists for the 420-image set was 240 minutes, with a range across radiologists of 180 to 300 minutes. The arithmetic: 240 minutes × 60 seconds = 14,400 seconds, divided by 420 images = 34.3 seconds per image. The fast end of the range: 180 minutes ÷ 420 images = 25.7 seconds per image. The slow end: 300 minutes ÷ 420 images = 42.9 seconds per image.

The work was performed in a structured web-based annotation platform with the categorical checklist visible on screen, the image loaded for review, and no requirement to write narrative, provide differential diagnoses, or interact with the patient’s clinical context. By every methodological consideration, this is the simplest possible AI training task. It is the task that produces the fastest defensible per-image labeling rate in the published radiology AI literature. And it is what the math in this article uses, because it is the rate that gives the most charitable possible benefit-of-the-doubt to vendor claims.

Why Veterinary Labeling Is Almost Certainly Slower Than CheXNeXt

A board-certified veterinary radiologist labeling training data for a commercial veterinary AI product is performing a task that is, in every dimension, more complex than what the CheXNeXt radiologists performed. Veterinary AI products cover multiple species (canine, feline, equine, exotic), multiple anatomic regions (thoracic, abdominal, orthopedic, dental, and others), multiple modalities, and substantially more pathology categories than 14. SignalPET’s product describes covering “all body systems and conditions.” Vetology’s product describes screening for over 90 different conditions. Antech RapidRead Dental, launched May 2025, describes “tooth-by-tooth analysis” that requires per-tooth attention across all four quadrants. None of these tasks is comparable in complexity to applying 14 yes/no flags to a chest radiograph in a structured platform.

In other words: the 34.3-second rate this article uses is conservatively chosen to be as fast as possible. The actual veterinary equivalent of even Step One — image-level categorical classification at the breadth of pathology categories these products claim to cover — is realistically slower than the CheXNeXt rate, not faster. Using the CheXNeXt rate produces the lowest possible labor figure, and even at that lowest figure, the math does not reconcile for the larger vendor claims. The actual labor figure, calculated at a defensible veterinary-specific labeling rate, would be larger still.

The Range Sensitivity: Even the Fast End Doesn’t Save the Larger Claims

Anticipating the vendor-defense response — that the Stanford CheXNeXt average rate of 34.3 seconds per image is too slow for what veterinary AI labeling actually requires — this article calculates the math at all three points of the documented Stanford range. At the fastest published radiologist labeling rate (25.7 seconds per image, the floor of the CheXNeXt range), the smallest of the three vendor training claims becomes plausible at single-radiologist labeling, the middle claim becomes borderline, and the largest claim still cannot be reconciled with the available specialist workforce.

The complete sensitivity analysis, calculated as radiologist-years of full-time labeling required to produce each vendor’s claimed corpus at each point of the Stanford labeling-rate range, is presented in the following table. The “FTE radiologist-years” column represents one full-time radiologist working 2,080 hours per year on nothing but labeling; the “10 FTE radiologist-years” column represents the same calculation but distributed across ten radiologists working full-time in parallel; and the “Every DI Diplomate (700)” column represents the calculation if every active diagnostic imaging Diplomate in North America were assigned full-time to the labeling task and did nothing else.

Important: this table calculates Step One only — image-level categorical classification at the Stanford CheXNeXt rate. It does not include bounding-box, segmentation, or pathology correlation work. Part Two calculates those.

Vendor / Claim Rate per image Total rad-hours required FTE rad-years (1 rad) Distributed years (10 rads) Hours each, if every DI Diplomate (700)
Vetology
300,000 cases
25.7s (fast)
34.3s (avg)
42.9s (slow)
2,142 hrs
2,858 hrs
3,575 hrs
1.03 yrs
1.37 yrs
1.72 yrs
0.10 yrs
0.14 yrs
0.17 yrs
3.1 hrs each
4.1 hrs each
5.1 hrs each
SignalPET
2,000,000 images
25.7s (fast)
34.3s (avg)
42.9s (slow)
14,278 hrs
19,056 hrs
23,833 hrs
6.86 yrs
9.16 yrs
11.46 yrs
0.69 yrs
0.92 yrs
1.15 yrs
20.4 hrs each
27.2 hrs each
34.0 hrs each
Antech RapidRead
16,000,000 images
25.7s (fast)
34.3s (avg)
42.9s (slow)
114,222 hrs
152,444 hrs
190,667 hrs
54.9 yrs
73.3 yrs
91.7 yrs
5.49 yrs
7.33 yrs
9.17 yrs
163 hrs each
218 hrs each
272 hrs each

The table makes the asymmetry between the three claims unmistakable, even at this most charitable annotation step. Vetology’s 300,000-case figure, even at the slow end of the Stanford range, requires only 1.72 radiologist-years of single-pass labeling — feasible for a small team working on the problem over a multi-year period, though still requiring documentation that has not been published. SignalPET’s 2-million-image figure requires up to 11.46 radiologist-years of single-pass labeling — implausible for the small dedicated team Vetology’s number suggests, but conceivable if multiple Diplomates contributed labeling work part-time over a multi-year period under a documented quality control protocol. Antech’s 16-million-image figure requires up to 91.7 radiologist-years of single-pass labeling — a labor expenditure that exceeds the lifetime career output of any individual Diplomate by a factor of roughly three, and that requires removing the equivalent of every diagnostic imaging Diplomate in North America from clinical practice for ten consecutive weeks of dedicated labeling work to complete, even at the fastest end of the Stanford range.

And these are the single-pass numbers, for the simplest annotation step. They assume each image was labeled by one specialist, with no second reader, no adjudication, no cross-validation. That is below the methodological floor that any peer-reviewed human-side AI training-set publication considers acceptable, and substantially below the standard FDA-cleared radiology AI products are required to meet.

The Three-Radiologist Standard: What Human-Side AI Actually Does

In peer-reviewed human-side radiology AI publications, the standard for training-data labeling is multiple independent radiologist annotations per image, with documented adjudication procedures for disagreement. This is not a luxury; it is the recognized methodological floor for credible AI training-data generation in medical imaging.

The Vietnamese VinDr-CXR dataset, published in Nature’s Scientific Data in 2022, established the contemporary public-dataset standard: “Each scan in the training set was independently labeled by 3 radiologists, while each scan in the test set was labeled by the consensus of 5 radiologists.” A team of 17 experienced radiologists labeled 18,000 chest radiographs to that standard, with the methodology, the labeling platform (VinDr Lab), and the adjudication procedures all documented in the published paper.

The Stanford CheXpert dataset (224,316 images) used a different approach — automated rule-based label extraction from existing radiology reports — but explicitly disclosed the methodology and open-sourced the labeling tool (“the CheXpert labeler”) so that other research groups could inspect and validate it. The test set of 500 studies was labeled by eight board-certified radiologists individually, with the majority vote of five serving as ground truth and the remaining three benchmarking radiologist performance.

The Spanish PadChest dataset (160,000 images, published in Medical Image Analysis in 2020) explicitly disclosed that “27% were manually annotated by trained physicians and the remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms.” The hybrid approach — partial human labeling, partial automated labeling — was documented with the percentages broken out, the labeling methodology described, and the resulting label noise characteristics analyzed in the publication.

Apply the three-radiologist standard to the vendor claims, and the radiologist-years required for the larger claims expand correspondingly. The 16-million-image Antech RapidRead claim, at three independent specialist labels per image and the Stanford CheXNeXt average rate of 34.3 seconds per image per pass, requires approximately 220 radiologist-years of dedicated full-time labeling work for image-level classification alone. The cumulative career output of a board-certified veterinary radiologist working a 30-year clinical career at full-time is, by definition, 30 radiologist-years. Producing 220 radiologist-years of dedicated labeling work would therefore require taking approximately 7.3 Diplomates out of clinical practice for their entire careers — or an equivalent distribution across more Diplomates working part-time on the project. The veterinary radiologist workforce, the ACVR has documented, is not large enough to support either configuration.

The three-radiologist standard is not the only one available. The 2024 Nature Scientific Reports paper documenting an FDA-cleared chest x-ray AI system reports that 17 board-certified radiologists with a median 14 years of experience manually annotated a development dataset of 341,355 chest x-ray cases. The result: 6,202,776 individual labels generated across the 17 radiologists. That is what FDA-cleared training looked like for one chest x-ray product: 17 specialists, 341,355 cases, 6.2 million labels. The Antech RapidRead claim of 16 million training images is approximately 47 times larger than this benchmark — in a specialty whose total active diagnostic imaging Diplomate population is roughly a sixth the size of the U.S. radiologist workforce.

The Specialist Shortage: Why the Time Math Compounds With the Economics

The argument so far has assumed, charitably, that Diplomates were available to do the labeling work in the volumes the math implies. The labor-economics reality is more constraining. The ACVR has publicly documented, repeatedly and over a period of years, that the veterinary radiology specialty workforce is inadequate to meet existing clinical demand, much less to provide a parallel labor pool for AI training.

The earliest formal acknowledgment of the shortage in the trade press appeared in JAVMA News in October 2018, in the article “Specialists in Short Supply.” ACVR Executive Director Dr. Tod Drost, quoted in that article, framed the workforce mismatch in the bluntest possible terms: “Last year, there were 43 new diplomates of the ACVR. So, 70 jobs, 43 new people coming in—the math, you can see, doesn’t work out that well.” That was the workforce situation in 2018, before the corporate consolidation accelerated the rate at which Diplomates moved from independent and academic practice into corporate teleradiology and AI ventures, before the post-pandemic surge in pet ownership increased clinical demand, and before three veterinary AI vendors published training-corpus claims of 300,000, 2 million, and 16 million images respectively.

By late 2025, the shortage was being characterized in starker terms. Industry reporting from Sage Veterinary Imaging, published in September 2025, framed the specialty workforce capacity as follows: “fewer than 1,000 board-certified veterinary radiologists practice in the United States, a shortage that continues to challenge the profession” — serving “more than 80,000 clinics nationwide.” Dr. Jimmy Barr, BluePearl Veterinary Partners’ chief medical officer, was quoted in the same 2018 JAVMA News article observing that “the demand for specialists has outstripped supply” and predicting the gap would continue widening. By 2025, that prediction had been confirmed by every available data source.

The economic-rationality consequence of the documented shortage is severe. A board-certified veterinary radiologist’s clinical billing rate, in the U.S. veterinary teleradiology market, runs approximately $85 to $250 per radiograph study read — with the lower end of that range applying to routine standard-turnaround consultations and the upper end applying to stat reads, complex cases, MRI and CT consultations, and after-hours coverage. At a midpoint average of $125 per study and a working-day output of 30 studies per Diplomate per day, that is $3,750 per Diplomate per day in clinical revenue, or approximately $937,500 per Diplomate per year at 250 working days. With a more typical mix that includes stat coverage and complex cases at higher per-study rates, annual per-Diplomate clinical revenue commonly runs to $1.2 million to $1.8 million. To take a Diplomate offline from clinical practice for one full year of dedicated AI labeling work would therefore require compensating that Diplomate for approximately $937,500 to $1.8 million in foregone clinical revenue, in addition to whatever salary or contract fee the AI vendor was paying for the labeling work itself.

Apply this to the vendor claims at the simplest annotation step. To produce the Antech RapidRead 16-million-image training corpus at the Stanford CheXNeXt average rate, single-radiologist labeling pass for image-level classification only, the labor would require 73.3 radiologist-years of dedicated specialist work. At a conservative $937,500 per radiologist-year in opportunity-cost compensation alone (the midpoint figure), the labor expenditure runs to approximately $69 million for just the categorical labeling step. At the three-radiologist methodological standard human-side AI publication considers minimum, the figure runs to approximately $206 million. Using the more realistic mid-range opportunity cost of $1.2 million per Diplomate-year, the three-annotator standard figure rises to approximately $264 million. Mars Petcare, Antech’s parent company, is a privately held conglomerate that does not file public earnings reports for the Antech subsidiary, so direct verification of these labor expenditures against documented company financials is not possible. But neither has Antech disclosed labor expenditures of this magnitude in any marketing material, press release, or industry communication this publication has reviewed. Part Three of this investigation examines the corporate-consolidation revenue model under which these undisclosed expenditures would have to fit, and the structural reasons such a fit is implausible.

And again, this is the cost for image-level categorical classification alone. The bounding-box, segmentation, and pathology correlation work that actual commercial AI products require multiplies these figures further. Part Two calculates those.

The Reconciliation Problem: How Could the Numbers Possibly Be True?

The math forces a question. If 16 million radiographs were not — and could not have been — manually labeled by board-certified veterinary radiologists in the way the marketing implies, even at the simplest categorical level, then how were the labels generated? The technical literature on medical AI training-data preparation provides a finite list of possibilities, each of which is methodologically defensible when disclosed and each of which represents a different claim from what the marketing language suggests.

Possibility One: NLP Extraction from Existing Reports

The first and most common method for generating large-scale medical AI training labels is automated natural language processing of the radiology reports that already exist in the vendor’s database. Every veterinary teleradiology service generates radiologist-authored interpretive reports as a matter of routine clinical practice. Those reports, in aggregate, contain the diagnostic vocabulary that — extracted, normalized, and mapped to imaging metadata — can produce machine-readable training labels at virtually unlimited scale without any fresh radiologist labeling effort.

This is exactly how the largest publicly available human chest x-ray AI training datasets were built. The NIH ChestX-ray14 dataset (112,120 images) used NLP to extract 14 thoracic pathology labels from existing radiology reports. The Stanford CheXpert dataset (224,316 images) used a documented automated rule-based labeler. The Spanish PadChest dataset (160,000 images) used a recurrent neural network with attention mechanisms to generate labels for 73% of its corpus, with only the remaining 27% manually annotated. Each of these projects was transparent about the methodology, the label noise characteristics, and the limitations of automated label generation. The Stanford CheXNeXt paper devoted significant analysis to “the partially incorrect labels in the ChestX-ray14 dataset” and developed a two-stage training process specifically to account for the noise.

If a veterinary AI vendor used NLP extraction from its existing report database to generate training labels — a method that scales to millions of images per radiologist-week of supervisory effort, rather than per radiologist-year of direct labeling — the math reconciles immediately. Sixteen million NLP-extracted labels supervised by a small team of radiologists is achievable. Sixteen million directly radiologist-labeled images is not.

The methodological problem is not the use of NLP extraction. It is the absence of disclosure. A vendor describing its product as trained on a “radiologist-reviewed” corpus, when the actual labeling methodology was NLP extraction with sparse specialist supervision, is making a representation that a reasonable reader would understand differently from what the actual methodology supports.

Possibility Two: Non-Specialist Human Labeling

The second possibility is that the labeling work was performed by humans, but not by board-certified specialists. General-practice veterinarians, residents in radiology training programs, technicians under specialist supervision, or contract clinicians with documented but non-DACVR credentials could each perform high-volume labeling work at substantially lower opportunity cost than fully-trained specialists. The question is whether the resulting labels meet the standard the marketing implies.

A general-practice veterinarian is qualified to identify many radiographic findings — gastric dilatation, obvious fractures, large effusions, gross orthopedic abnormalities. A GP is not, by virtue of GP training, qualified to read all of the findings a board-certified veterinary radiologist reads, and is not the standard the ACVR/ECVDI 2024 teleradiology consensus statement contemplates for clinical-quality interpretation. Training labels generated by GP-level review are, technically, defensible only for the categories of findings GP-level training reliably identifies. Using GP-labeled training data to develop an AI marketed as a substitute for radiologist consultation is methodologically problematic, even if the labels themselves are accurate within their proper scope.

The disclosure question is again the issue. A vendor describing its corpus as “veterinary radiologist-reviewed” when the actual reviewers were a mix of specialists, residents, and GPs is making a claim the reader will reasonably understand differently from what the methodology supports. The 2025 ACVR/ECVDI position statement on AI explicitly identifies transparency about training data as a necessary condition for product use: “Artificial intelligence systems that do not… provide transparency of their underlying methodology, training, and testing sets… should not be used in veterinary practice.”

Possibility Three: AI-Generated Labels — AI Training AI

The third possibility is the one most worth discussing in detail, because it represents the modern standard practice in large-scale medical AI training and also the largest gap between current practice and disclosure expectations. In contemporary computer-vision research, large training corpora are routinely built using a mix of human-labeled “seed” data and AI-generated “pseudo-labels” propagated through self-supervised or semi-supervised techniques.

The methodology works as follows. A small initial dataset is labeled by humans (specialists or otherwise). An initial AI model is trained on that human-labeled seed set. The trained model is then applied to a much larger unlabeled corpus to generate predicted labels — pseudo-labels — for each image. A subset of the pseudo-labels is sampled and reviewed by human specialists for quality control. The pseudo-labeled corpus, with the model’s confidence-weighted predictions, is then used as training data for a more capable subsequent model. The process can be iterated, with each generation of model producing better pseudo-labels for the next, in a self-training loop that scales to corpus sizes no team of human specialists could ever directly label.

This is, in effect, AI training AI. The technique is widely used in modern computer vision and is methodologically defensible when implemented with appropriate quality control and validation. Self-training, pseudo-labeling, and semi-supervised learning are standard topics in the academic AI literature, with extensive discussion of their failure modes and the safeguards required to prevent error propagation.

The specific failure mode that matters here is what AI researchers call “label-noise amplification.” If the initial AI model has systematic errors — for example, it underperforms on dachshunds, or on geriatric patients, or on cases where positioning is non-standard — those errors get propagated into the pseudo-labels. The next-generation model then trains on data that includes the systematic errors of its predecessor, often amplifying rather than correcting them. The result, if not carefully managed, is an AI system whose accuracy claims are statistically valid against its own self-generated test data but whose performance on independent external data is materially worse than the headline numbers suggest.

This is precisely the failure mode the Joslyn et al. peer-reviewed commentary, published in Frontiers in Veterinary Science in June 2025, identified in the only externally co-authored validation study of a major commercial veterinary AI radiology product. The commentary documented sensitivity collapse to 0.444 in difficult cases, a class-imbalanced test set that allowed a trivial “always normal” classifier to score 84% accuracy, and the explicit observation that the AI software was “continuously updated and does not have version numbers” — the algorithm version traceability problem that, in the FDA-regulated human medical device universe, would constitute a regulatory disqualification. Part Three of this investigation examines these validation findings in detail and the corporate-consolidation revenue model under which they continue to be marketed.

If a vendor’s training corpus was built using AI-generated pseudo-labels with sparse specialist quality control, several things follow. First, the vendor’s accuracy claims need to be evaluated against external data — independent test sets the vendor’s pipeline was not trained or validated against — to be meaningful. Second, the vendor’s algorithm version policy matters profoundly, because pseudo-labeled training tends to be unstable across iterations. Third, the vendor’s marketing language describing the training corpus as “specialist-reviewed” needs to be understood as describing the seed data and the quality-control sampling, not the bulk of the labels themselves. None of this is bad practice if disclosed. All of it changes what the marketing claim means. None of the three vendors named in this article has disclosed the methodology at this level.

Possibility Four: Inflated Numbers

The fourth possibility is the simplest. The published headline figures may not represent what they appear to represent. Antech’s “16 million images sourced from an unprecedented library of more than 8 billion images” is itself a layered claim — the 16 million is described as the training corpus, while the 8 billion is described as the broader library. The 8 billion figure most plausibly represents cumulative imaging volume Antech has handled across its services business over years of operation, not training data. The 16 million is the figure that requires scrutiny.

Three subsidiary possibilities exist within the inflated-numbers explanation. The 16 million may represent total images in the database — including duplicate views of the same study (front and lateral, multi-region surveys), images marked as suboptimal or excluded from training, and images from cases that were never actually used to train the production model. It may represent total annotations rather than total images — if each image has multiple labels (e.g., separate labels for cardiomegaly, alveolar pattern, pleural effusion), counting the labels rather than the images can inflate the apparent corpus size by an order of magnitude. Or it may represent a forward-looking aspirational figure — the size of the corpus the vendor intends to have trained against by some future date, rather than the size of the corpus actually used to develop the product currently in the field.

None of these subsidiary explanations is necessarily disqualifying. All of them are different from what the headline implies. A clinic adopting an AI radiology product on the implicit understanding that it was trained on 16 million specialist-reviewed canine and feline radiographs may be making a different decision than the same clinic would make if it understood that the 16 million figure included multiple views per study, technician-flagged quality issues, multi-label counts, or projected future training data.

Part Two Preview: The Annotation Steps That Actually Build the Product

Everything calculated to this point has assumed the simplest annotation task in the AI training pipeline: image-level categorical classification, the application of yes/no flags to pre-defined pathology categories at the Stanford CheXNeXt rate of 34.3 seconds per image. The math at this most charitable level does not work for the larger vendor claims. The math at the more demanding annotation steps that produce the actual capabilities commercial AI products are marketed as having does not work even more dramatically.

For an AI product that shows the user where on a radiograph a finding is located — drawing a box around the cardiomegaly, the alveolar pattern, the foreign body, or the fracture — the training data must include bounding-box annotations for those findings. For an AI product that produces measurements such as vertebral heart score, lung field volume, or mass dimensions, the training data must include pixel-level segmentation of those structures. For an AI product whose accuracy claims are meaningful, the training data must include some fraction of cases with ground-truth correlation against an independent reference standard — pathology, surgical findings, or outcome confirmation.

Each of these additional annotation tasks is documented in the peer-reviewed AI training literature with per-image time figures that are, in every published case, multiples of the categorical classification rate. The following table previews the published figures that Part Two applies to the same vendor claims this article addresses. Per-image rates are drawn from the indicated peer-reviewed sources and are stated at the upper end of what each source documents, to remain conservative.

Annotation Step What the radiologist does Per-image rate (peer-reviewed) Multiple of CheXNeXt rate Source
Step 1: Image-level classification Apply yes/no flags to pre-defined pathology categories. No localization. The basis for this article’s math. 34.3 sec/image (avg)
(range 25.7–42.9 sec)
1× (baseline) Stanford CheXNeXt, PLOS Medicine, 2018
Step 2: Bounding-box localization Draw a rectangle around each abnormality and label it. Per-image time scales with number of findings. Median ~6 minutes/study
(IQR 2.8–10.6 min)
~73 sec/individual structure
~10× to 18× the
CheXNeXt rate
Radiology: AI, PMC8017380, 2021
(coronary CT angiography study)
Step 3: Pixel-level segmentation Outline lesion boundary at the pixel level. Required for measurements, volumetric analysis, shape characterization. Several minutes to tens of
minutes per image, depending
on lesion complexity
~10× to 50× the
CheXNeXt rate
Multiple peer-reviewed sources
documented in Part Two
Step 4: Pathology correlation Cross-reference radiologist labels against necropsy, surgical, or outcome ground truth. Separate workflow per case. Variable — but requires
ground-truth dataset that
most vet AI vendors do not have
N/A (different workflow) Cohen, Fischetti, Daverio
Vet Radiol Ultrasound, 2023

The implication for the vendor claims is direct. If Antech RapidRead’s 16-million-image training corpus required even Step 2 (bounding-box localization) work — which the product’s capability of showing users where findings are located on the image necessarily implies — the labor required is not 73.3 radiologist-years at the Stanford CheXNeXt rate, but somewhere on the order of 700 to 1,300 radiologist-years at the bounding-box rate, before any segmentation work or pathology correlation is added. The cumulative career output of every active North American diagnostic imaging Diplomate, doing nothing else for their entire careers, would be insufficient.

Part Two of this investigation applies the bounding-box and segmentation rates to the same vendor claims, documents the resulting labor figures, and addresses the question Part One could not: how the actual capabilities of the marketed products — finding localization, measurement extraction, multi-pathology assessment — could possibly have been trained without one of the four reconciliation paths (NLP extraction, non-specialist labeling, AI-generated pseudo-labels, or inflated numbers) doing most of the work. The answer, foreshadowed by the math in this article, is that they could not. Part Three then closes the series by examining the validation-statistics evidence base on commercial veterinary AI and the corporate-consolidation revenue model that has produced the marketing claims at issue.

What Honest Disclosure Would Look Like

The CLAIM checklist (Checklist for Artificial Intelligence in Medical Imaging), published by Mongan, Moy, and Kahn in Radiology: Artificial Intelligence in 2020, is the standard documentation framework used by peer-reviewed human medical imaging journals to evaluate AI manuscripts. Its requirements specify what an honest disclosure of training-data methodology looks like. The veterinary AI vendors do not currently meet this standard. They could, if they chose to. The information clinics need to make informed decisions about adopting these products is the same information CLAIM has required from peer-reviewed authors for six years.

Five specific disclosures, in order of importance, would resolve the questions this article raises.

First, the total size of the training corpus, broken down by composition. Number of unique studies. Number of unique animals. Number of unique images (counting separate views as separate images). Number of total image-level annotations (counting multiple labels per image as multiple annotations). Each of these numbers is different. A vendor that has published only one of these numbers, presented as the headline figure, has not fully disclosed the corpus.

Second, the percentage of training labels generated by each labeling methodology. Manually labeled by board-certified veterinary radiologists (DACVR or DECVDI). Manually labeled by residents in approved radiology residency programs. Manually labeled by general-practice veterinarians or other non-board-certified DVMs. Manually labeled by veterinary technicians or other non-DVM personnel. Automatically extracted by NLP from existing reports. AI-generated pseudo-labels propagated from a seed model. Each of these methodologies has different reliability characteristics. The percentage breakdown is the disclosure that turns “the corpus is large” into “the corpus is credible.”

Third, the breakdown by annotation type. What fraction of the corpus has only image-level categorical labels. What fraction has bounding-box localization. What fraction has pixel-level segmentation. What fraction has measurement extraction. Each annotation type supports different AI capabilities; the breakdown documents what the product was actually trained to do versus what marketing suggests.

Fourth, the number of independent annotators per training image and the adjudication methodology. Single-pass labeling, two-pass with disagreement resolution by a third specialist, three-pass with majority vote, or some other documented protocol. The number of annotators per image — and the procedures for handling annotator disagreement — are the difference between a label that means “one specialist thought this looked like cardiomegaly” and a label that means “the consensus of three specialists, after documented adjudication, classified this as cardiomegaly.”

Fifth, the algorithm version frozen for any cited validation study and the vendor’s update policy. If the validation study was conducted on version 2.3.1 and the clinic is currently using a continuously-updated version with no version number, the validation does not necessarily describe the product the clinic is actually using. The Joslyn et al. commentary’s identification of the “continuously updated and does not have version numbers” issue — directly traceable to the vendor’s own self-description in the only published external validation study — is the disclosure failure that compounds every other disclosure failure. Part Three examines the validation-statistics consequences in detail.

A vendor that produces these five disclosures, with the methodology documented and the percentages broken out, has built the product to a standard the clinic can evaluate. A vendor that does not is asking the clinic to take the headline figure on faith. The 2025 ACVR/ECVDI position statement is explicit on this point: “There is currently no commercially available product for diagnostic imaging that meets these standards.” This article identifies, with arithmetic, why that statement is not rhetorical. It is descriptive of a specific gap between what the vendors claim and what the documented specialty workforce could have produced — at the simplest annotation step, before any of the more demanding work is even considered.

The Bottom Line — Part One

Stanford published the radiologist labeling rate in 2018: 34.3 seconds per image at the average, 25.7 to 42.9 seconds across the documented range, for the simplest possible annotation task of image-level categorical classification. The ACVR has published the diagnostic imaging Diplomate population: approximately 600 to 700 active specialists in North America, in a workforce the specialty college’s own leadership has described as facing a profession-level crisis. Multiply the inputs and the math forces a conclusion. Vetology’s 300,000-case claim is plausible at the borderline of a small specialist team’s multi-year output, even at this simplest annotation step. SignalPET’s 2-million-image claim strains the available labor pool but is conceivable with sustained specialist contribution. Antech RapidRead’s 16-million-image claim cannot be reconciled with board-certified specialist labeling within the documented North American Diplomate workforce — at any rate within Stanford’s published range, at any reasonable distribution across specialists, at any defensible economic-rationality assumption about specialist availability for a multi-year labeling assignment.

And this is the math at the simplest possible step. Image-level categorical classification only. No bounding boxes drawn. No segmentation performed. No pathology correlation applied. The actual annotation work required to produce the capabilities commercial veterinary AI products are marketed as having — finding localization, measurement extraction, multi-pathology assessment, ground-truth-validated accuracy — adds substantially more time per image, in some cases by a factor of 50 or more. Part Two of this investigation calculates that math. The conclusion, foreshadowed by the figures in this article, is that the gap between what the vendors claim and what the documented specialty workforce could have produced is not a matter of close call. It is a matter of orders of magnitude. Part Three then examines the validation-statistics evidence base under which these labor figures were supposedly used to produce commercial AI products — and the corporate-consolidation revenue model that has allowed the marketing claims to continue without independent verification.

The four reconciliation paths — NLP auto-labeling, non-specialist human labeling, AI-generated pseudo-labels, or inflated numbers — each represent methodologically defensible practice when disclosed. None has been disclosed by the vendors at the level the human-side AI publication standards require. The CLAIM checklist exists. The 2025 ACVR/ECVDI position statement has called for exactly this kind of disclosure. The vendors have not produced it. The clinic deciding whether to adopt a veterinary AI radiology product on the basis of “trained on millions of specialist-reviewed cases” should understand that the math, calculated against the only published radiologist labeling rate and the documented North American specialty workforce, does not support the claim as the marketing presents it — even at the simplest annotation step.


Frequently Asked Questions

What is image labeling in AI training, and what does it actually involve?

Image labeling for AI training is not a single task. It is the umbrella term for several distinct annotation activities, each with different time requirements and each required for different AI capabilities. The simplest form is image-level categorical classification: a radiologist looks at the image and marks which of N pre-defined categories are present, typically by selecting checkboxes from a fixed list. This is what the Stanford CheXNeXt study measured at 34.3 seconds per image average — the radiologists were marking 14 thoracic pathology categories per chest radiograph as present or absent, with no localization required. Image-level classification is the foundational first step in supervised AI training, but it is not sufficient on its own to produce the kind of AI products that are marketed as showing veterinarians where findings are located on a radiograph or that produce measurements such as vertebral heart score. Those capabilities require additional annotation work — bounding-box drawing for localization, pixel-level segmentation for measurement and shape analysis, and ground-truth correlation against pathology for validation. Each subsequent annotation step takes substantially more time per image than the categorical classification step the CheXNeXt 34.3-second figure measures. This article focuses exclusively on the labeling step, which is the simplest annotation task and the one that produces the most charitable possible math for the vendors. Part Two covers the additional annotation work and what it implies for the actual labor budget required. Part Three addresses the validation-statistics gap between FDA-cleared human radiology AI and commercial veterinary AI, plus the corporate-consolidation revenue model that explains the gap.

How long does it take a board-certified radiologist to label a single chest radiograph for AI training at the simplest level?

The most authoritative published figure for radiologist image-level labeling time on chest radiographs comes from the Stanford Machine Learning Group’s CheXNeXt study, published in PLOS Medicine in 2018 (Rajpurkar P, Irvin J, Ball RL, et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLOS Medicine. 2018;15(11):e1002686). The paper reports that “the average time for radiologists to complete labeling of 420 chest radiographs was 240 minutes (range 180–300 minutes).” Calculated as a per-image rate, that is 34.3 seconds per image at the average, with a documented range of 25.7 to 42.9 seconds per image across the radiologists who participated. This is for the simplest annotation task: applying pre-defined categorical labels — yes/no flags for each of 14 thoracic pathology categories — to images already in a structured workflow. No bounding boxes were drawn. No segmentation was performed. No measurements were recorded. No pathology correlation was applied. The 34.3-second figure is therefore the floor on radiologist labeling effort for AI training, the simplest possible task. Bounding-box annotation, pixel segmentation, multi-label localization, and species- or modality-specific contextual interpretation each add per-image time on top of this baseline. For the analysis of those additional time burdens, see Part Two of this investigation. For the validation-statistics comparison between FDA-cleared human radiology AI and commercial veterinary AI, see Part Three.

How many board-certified veterinary radiologists are there, and how many are available to label AI training data?

The American College of Veterinary Radiology, the AVMA-recognized specialty organization for veterinary radiology in the United States, currently reports “over 800 accredited veterinary radiologists and radiation oncologists.” That figure includes radiation oncologists, who do not annotate diagnostic radiographs as part of their specialty practice. The most recent published breakdown (2019) showed 573 Diplomates in pure Radiology, 18 dual-boarded, and 95 in Radiation Oncology. As of 2025, industry sources estimate fewer than 1,000 board-certified veterinary radiologists practice in the United States serving more than 80,000 clinics nationwide — a documented shortage that the ACVR’s own leadership has publicly described as a profession-level workforce crisis. ACVR Executive Director Dr. Tod Drost was quoted in JAVMA News in 2018 observing that with only 43 new diplomates produced annually against approximately 70 open positions per year, “the math doesn’t work out that well.” The implication for AI training: the same specialist labor pool the vendors implicitly invoke when claiming “board-certified radiologist-reviewed” training images is the same labor pool that has documented insufficiency to meet existing clinical demand. Diplomates available to perform large-scale AI labeling work, in addition to or in place of clinical practice, are a vanishingly small subset of an already inadequate specialty population.

Is it physically possible for a board-certified veterinary radiologist to have manually labeled 16 million radiographs for AI training even at the simplest categorical level?

Calculating from primary sources: at the Stanford CheXNeXt paper’s documented average rate of 34.3 seconds per image — for image-level categorical labeling alone, the simplest annotation task — labeling 16 million radiographs would require approximately 152,444 radiologist-hours, or 73.3 radiologist-years of full-time work at a 2,080-hour annual schedule. To complete the work in five years, 14.7 radiologists would need to be working full-time exclusively on labeling — no clinical practice, no teaching, no research, no other professional activity. To complete it in two years, 36.7 radiologists would be required full-time. The American College of Veterinary Radiology counts approximately 600 to 700 active diagnostic imaging Diplomates in North America, the great majority of whom are clinically practicing full-time at standard specialist billing rates. The applicable human-medicine standard for training data labeling is more demanding: training data for FDA-cleared AI radiology products typically uses three independent radiologist annotations per image, with adjudication. Multiplied by three, the 16 million figure would require approximately 220 radiologist-years of dedicated specialist labeling work for the categorical classification step alone, exceeding the lifetime career output of approximately 7.3 Diplomates. And these calculations cover only Step One of the training pipeline — image-level categorical labeling. Bounding-box localization, pixel segmentation, and pathology correlation each add substantial additional time burdens on top of this figure. The Part Two analysis quantifies those additional burdens. Part Three examines the validation-statistics evidence base and the corporate revenue model that produces the marketing claims this math addresses.

What is the difference between image-level labeling, bounding-box annotation, and pixel-level segmentation?

These are three distinct AI training annotation tasks with progressively higher time requirements per image. Image-level classification: the radiologist marks which categorical labels apply to the entire image, without specifying where on the image the findings are located. This is the simplest form of annotation and what the Stanford CheXNeXt study measured at 34.3 seconds per image average. Bounding-box annotation: the radiologist draws a rectangle around each abnormality on the image and labels it with the relevant category. The radiologist must identify the lesion’s edges, click and drag the rectangle, label it, and verify the label. A peer-reviewed study published in Radiology: Artificial Intelligence (PMC8017380) documented bounding-box annotation rates for coronary CT angiography studies at a median of 6.08 minutes per study (interquartile range 2.8–10.6 minutes), or 73 seconds per vessel — an order of magnitude longer than image-level categorical classification. Pixel-level segmentation: the radiologist outlines the lesion boundary at the pixel level, producing a precise shape mask used for measurement, volumetric analysis, and shape characterization. Published per-image segmentation times for complex cases run several minutes to tens of minutes. The progression from image-level classification to bounding-box annotation to pixel-level segmentation typically increases annotation time per image by factors of 5x to 50x. AI products that produce localizations on images, perform measurements like vertebral heart score, or output disease-specific shape characterizations require some combination of these higher-effort annotation steps in their training data — and consequently require labor budgets several multiples larger than the image-level classification math alone implies.

Could veterinary AI vendors have used AI to generate the labels for their training data?

Yes, this is technically feasible and methodologically defensible — provided the practice is disclosed, the methodology is documented, and the resulting label quality is independently validated. Several public human-side AI training datasets used automated label generation. The NIH ChestX-ray14 dataset (112,120 images) used natural language processing to extract labels from existing radiology reports rather than fresh radiologist labeling; the methodology was documented and the resulting label noise was extensively analyzed in subsequent peer-reviewed papers. The Stanford CheXpert dataset (224,316 images) used “the CheXpert labeler, an automated rule-based labeler to extract observations from the free text radiology reports to be used as structured labels for the images,” with the labeler itself open-sourced for independent inspection. The PadChest dataset (160,000 images, published in Medical Image Analysis, 2020) explicitly disclosed that “27% were manually annotated by trained physicians and the remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms.” These are all defensible practices when disclosed. The methodological problem in veterinary AI is not the use of automated labeling itself; it is the absence of documentation specifying what fraction of training labels were human-generated by what category of clinician, what fraction were automatically extracted from existing reports, and what fraction were generated or refined by AI in semi-supervised or self-supervised pipelines.

What are the four possible explanations for the gap between vendor training-set claims and the documented specialist labor pool?

When the time math and the specialist workforce math are applied to the larger vendor training-corpus claims, four reconciliation possibilities emerge. Each is methodologically defensible if disclosed, and each represents a different claim from what marketing language typically implies. First, NLP extraction from existing reports — labels generated by automated text processing of radiology reports already in the vendor’s database, the way ChestX-ray14 and CheXpert built their public datasets. This is a legitimate method when disclosed. Second, non-specialist human labeling — labeling performed by general-practice veterinarians, residents, technicians, or non-board-certified clinicians, with limited or no specialist quality control. This shifts the meaning of “reviewed by radiologists” substantially, and in some implementations may not constitute specialist review at all. Third, AI-generated or AI-assisted labels — pseudo-labels produced by an earlier model trained on a smaller human-labeled seed set, then propagated to the larger corpus through self-supervised or semi-supervised techniques. This is widely used in modern computer vision but introduces label-noise dynamics that require documented quality-control methodology. Fourth, inflated numbers — the actual size of the human-reviewed training corpus is materially smaller than the headline figure, with the larger number representing total imaging volume the vendor has handled rather than training corpus specifically. Any of the four is news. None has been disclosed by the vendors named in this article at the level human-side AI publication standards would require.

What disclosures should clinics demand from veterinary AI vendors before signing a contract?

Clinics evaluating commercial veterinary AI radiology products should request, in writing, the following disclosures regarding training-set methodology — modeled on the documentation requirements of the CLAIM checklist (Checklist for Artificial Intelligence in Medical Imaging) used by peer-reviewed human medical imaging journals. First, the total size of the training corpus, broken down by species, body region, and pathology class. Second, the percentage of training labels generated by board-certified veterinary radiologists, by general-practice veterinarians or residents, by technicians, by automated NLP extraction from existing reports, and by AI-generated pseudo-labels in semi-supervised or self-supervised pipelines. Third, the breakdown of annotation type — image-level classification only, bounding-box localization, pixel-level segmentation, measurement extraction, and pathology-confirmed ground truth — for what fraction of the corpus. Fourth, the number of independent annotators per training image and the methodology for adjudicating disagreement between annotators. Fifth, the algorithm version frozen for any cited validation study and the vendor’s update policy for the version the clinic is currently using. A vendor that cannot or will not produce this documentation has not built the product to a standard that human-side AI publication would consider adequate. The 2025 ACVR/ECVDI position statement on AI explicitly identifies transparency about training data as a necessary condition for veterinary AI adoption — a standard that, in the position statement’s own words, no commercially available veterinary diagnostic imaging product currently meets. For more on what credible AI radiology validation looks like, see our coverage of the engineering rigor gap; for the regulatory framework that would constrain these products on the human side, see our coverage of the regulatory gap; for the bounding-box and segmentation math plus structural infrastructure analysis, see Phantom Radiologists Part Two; and for the validation-statistics evidence and the corporate-consolidation revenue model, see Phantom Radiologists Part Three.


Vendor Marketing Materials Quoted in This Article
  • SignalPET: Training corpus claim of “over 2 million annotated veterinary radiographs” sourced from company materials at https://www.signalpet.com/.
  • Vetology: Training corpus claim of “over 300,000 Board Certified veterinary radiologist-reviewed cases” and “38 different deep-learning architectures” sourced from https://vetology.net/ai/.
  • Antech RapidRead (Mars): Training corpus claim of “16 million images sourced from an unprecedented library of more than 8 billion images” sourced from https://www.antechdiagnostics.com/imaging-services/rapidread/.
Artificial Intelligence AI Training Data Image Labeling Bounding Box Annotation Pixel Segmentation Radiologist Labeling CheXNeXt Stanford ML Group PLOS Medicine SignalPET Vetology Antech RapidRead Mars Petcare ACVR Diplomate Workforce Specialist Shortage CLAIM Checklist VinDr-CXR CheXpert PadChest NLP Extraction Pseudo-Labels Semi-Supervised Learning Training Set Transparency The Math Problem

Editorial & Legal Disclaimer. VeterinaryTeleradiology.com is an independent industry publication. This article is Part One of a three-part investigation into the relationship between commercial veterinary AI radiology vendor training-corpus claims and the documented capacity of the board-certified veterinary radiologist specialty workforce. Part One focuses exclusively on the simplest annotation task — image-level categorical classification — using the labeling-rate benchmark from Stanford CheXNeXt published in PLOS Medicine. Part Two addresses the additional annotation tasks (bounding-box localization, pixel-level segmentation, ground-truth correlation) that produce the actual capabilities marketed in commercial products. Part Three documents the validation-statistics evidence base for commercial veterinary AI versus FDA-cleared human radiology AI, and the corporate-consolidation revenue model that has produced the marketing claims at issue. The math in this article is conservative by design — it uses the most charitable possible inputs — and is therefore the floor on the labor required, not the ceiling.

This article is based entirely on publicly available and documented sources, each identified in the Primary Documents Referenced and Vendor Marketing Materials sections above. Sources include: peer-reviewed papers published in PLOS Medicine, Nature Scientific Data, Nature Scientific Reports, Medical Image Analysis, Frontiers in Veterinary Science, JAVMA, Radiology: Artificial Intelligence, and Veterinary Radiology & Ultrasound; institutional sources including the American College of Veterinary Radiology and the American Veterinary Medical Association; trade-press reporting in JAVMA News; and publicly accessible vendor marketing materials and product descriptions from SignalPET, Vetology, and Antech Diagnostics. No confidential sources, non-public documents, or unverified information is relied upon in this article. Every factual claim, every input to the math, and every conclusion is attributable to one or more of the above primary or secondary sources.

This article presents a quantitative analysis applying a published per-image radiologist labeling rate (Stanford CheXNeXt, PLOS Medicine 2018) and a documented specialty workforce population (American College of Veterinary Radiology) to publicly stated training-corpus claims by three commercial veterinary AI radiology vendors. The mathematical calculations presented are reproducible from the inputs cited. The conclusions drawn — specifically, that the larger training-corpus claims cannot be reconciled with board-certified specialist labeling within the documented specialty workforce, even at the simplest annotation step — are descriptive of the arithmetic, not assertions of legal wrongdoing or fraudulent representation. The four reconciliation possibilities identified (NLP extraction from existing reports, non-specialist human labeling, AI-generated pseudo-labels, or composition differences in the headline corpus number) are each methodologically defensible practices when disclosed; the article identifies the absence of disclosure as the central methodological gap, not the practices themselves.

The article does not assert that any vendor has engaged in fraud, misrepresentation, or unfair trade practices. It does not assert that any vendor’s training methodology fails on its merits. It asserts, descriptively, that the published training-corpus claims are not supported at the level of methodological detail the human-side AI literature considers standard, and that the gap between the marketing claims and the documented specialty workforce capacity invites disclosure that the vendors have not produced. Each vendor named in this article is invited to publish the training-data methodology disclosures the CLAIM checklist requires; any disclosure supported by documentary evidence will be published in full by this publication. This invitation is extended directly and without prejudice to SignalPET, Vetology, Antech Diagnostics, Mars Petcare, and any other vendor whose products are discussed.

This publication is not a law firm and does not provide legal advice. Veterinarians, state regulators, and other readers with specific factual or legal questions should consult qualified counsel. The mathematical analysis presented is intended to inform reader and regulatory consideration of how the marketing claims of large commercial veterinary AI vendors compare against documented specialty workforce capacity. It is not a substitute for vendor-specific due diligence by clinics evaluating these products for adoption.

Scroll to Top