How Human Radiology AI Actually Gets Built — RSNA Working Groups, Public Datasets of 100,000+ Images, Kaggle Competitions With Hundreds of Teams, FDA-Cleared Concurrent Reading Aids — and the Wild West of Veterinary AI Where None of That Exists
A peer-reviewed Frontiers commentary published in June 2025 by four veterinary AI researchers — including the lead author of the ACVR/ECVDI’s official AI position statement — methodically dismantled the only published external validation study of a major veterinary AI radiology product. Circular ground truth, severe class imbalance, sensitivity of 0.444 in difficult cases, the wrong statistical test, no version traceability. That is the state of validation in commercial veterinary AI. On the human side, by contrast, a model called CheXNet was trained on 112,120 publicly released chest radiographs in 2017, validated against three independent cardiothoracic specialists, published in PLOS Medicine, and then beaten on the public leaderboard by hundreds of subsequent teams. That is what the scientific method looks like in medical AI. The veterinary industry skipped it.
Why This Article Is Different From the Last One
An earlier piece in this publication mapped the regulatory gap that allows AI-primary veterinary radiology reads to operate without FDA oversight, without state practice-act enforcement, and without the reimbursement gatekeeping that protects human patients. The conclusion of that piece was that the safeguards which protect human-side AI radiology are absent on the veterinary side, and that the ACVR and ECVDI have formally said so without any enforcer responding.
This piece is about something different — and, in some ways, more fundamental. Even if every regulatory layer were operational tomorrow, there would still be the question of whether the underlying products are scientifically sound. The regulatory gap matters because it lets bad products onto the market. The scientific gap matters because the products themselves were never built the way medical AI is supposed to be built. They lack the upstream infrastructure — the public datasets, the open challenges, the peer-reviewed external validations, the multi-reader multi-case study designs, the version tracking, the locked-algorithm verification — that the human radiology field spent the last decade constructing before a single product was cleared for clinical use.
The human side built the science. The FDA then audited it. On the veterinary side, no equivalent science exists, the regulator has nothing to audit, and the vendors fill the vacuum with marketing claims. That is the gap this article documents.
The autonomous-AI-replacing-the-radiologist business model that veterinary AI vendors operate today does not exist anywhere in U.S. human clinical medicine. Zero FDA-cleared products. Zero approved billing pathways. Zero state medical board endorsements. Across roughly 700–950 FDA-cleared radiology AI devices, not one is labeled to issue a diagnostic interpretation that bypasses radiologist review.
The veterinary AI vendors are not selling products that lag a few years behind their human-side equivalents. They are selling products that have no human-side equivalent at all — products whose business model is structurally illegal in U.S. human medicine, and which exist on the veterinary side only because none of the three regulatory layers that block them on the human side reaches the veterinary market. That is the asymmetry, in its sharpest form. Every clinic considering the use of these products should understand this before signing a contract.
What the Human Side Has, Specifically, Refused to Build
Before walking through how human radiology AI actually got built, it is worth establishing a single point as concretely as possible: the business model the veterinary AI vendors operate is not an emerging human-side technology that veterinary medicine has gotten ahead of. It is not a model that exists in human medicine in a more cautious or more regulated form. It is a model that does not exist in human medicine at all, and it does not exist for three independent reasons that each, on their own, would be sufficient to prevent it.
The FDA has refused to clear it. The Food and Drug Administration has authorized somewhere between 700 and 950 radiology AI devices over the past decade — depending on which counting methodology is used — across chest imaging, mammography, brain CT, cardiac, orthopedic, and dozens of other categories. Not one of those devices is labeled for autonomous diagnostic interpretation. The labeled intended use language is uniform across the entire portfolio: “concurrent reading aid,” “computer-aided detection,” “computer-aided triage,” “decision support during interpretation by qualified clinician.” Every device presumes a radiologist reads the study, evaluates the AI’s output, and signs the final report. The FDA has not refused to clear autonomous AI by oversight; it has refused as a matter of policy, recently and publicly. After the FDA’s December 2024 workshop on AI integration in medical imaging, the American College of Radiology and the Radiological Society of North America jointly told the agency that it could not provide reasonable assurance of the safety and effectiveness of autonomous AI in radiology under the current evidentiary framework. The FDA’s January 2025 draft guidance on AI-Enabled Device Software Functions emphasizes continued human oversight as a structural requirement. No vendor has chosen to attempt an autonomous primary-reader clearance, because the evidentiary burden would be unprecedented and the outcome would be uncertain.
State medical practice acts make it illegal regardless of FDA action. Every state in the country defines diagnosis as the practice of medicine and reserves diagnostic acts to licensed physicians. An AI software company is not a licensed physician. A product that issues a diagnostic interpretation directly to a non-radiologist physician — in a form that physician relies on without independent review — runs into unauthorized-practice-of-medicine prohibitions at the state level. State medical boards regularly investigate and prosecute unauthorized practice; the prohibitions have teeth. Even if the FDA cleared an autonomous AI radiology product tomorrow, deploying it in a way that bypassed radiologist review would expose the deploying party to state medical board action in fifty separate jurisdictions. No vendor has chosen to test these statutes by deploying an autonomous diagnostic product into U.S. human clinical practice.
CMS will not pay for it. The Centers for Medicare & Medicaid Services does not reimburse a radiology interpretation unless a licensed physician personally reviews the study and signs the report. Private insurers follow Medicare’s lead almost universally. An autonomous AI interpretation with no physician signature is not a billable professional service. The economic model collapses before the regulatory model is even tested: even a hospital that wanted to deploy autonomous AI to cut costs would find no insurance reimbursement for the resulting interpretations, and no commercial AI vendor has been willing to build a product whose intended deployment is unbillable.
The combined effect of these three independent constraints is that autonomous AI replacing the radiologist on diagnostic radiograph interpretation is not an activity that occurs in U.S. human clinical medicine. It is not happening in some hospitals and being debated in others. It is not happening anywhere. The closest exception that proves the rule is IDx-DR, an AI system cleared in 2018 to autonomously screen for diabetic retinopathy in primary care offices. That product is cleared only for a narrow screening task — detecting “more than mild” diabetic retinopathy in adults with diabetes who have not been previously diagnosed — only after years of pre-market clinical trials, only with a referral pathway requiring an ophthalmologist for any positive screen, and only outside the radiology specialty entirely. It is not analogous to a radiograph diagnostic interpretation. It is the single FDA-approved instance of autonomous AI diagnostic activity in U.S. clinical medicine, and it operates under conditions that no veterinary AI radiology product comes close to meeting. (For a deeper analysis of how the FDA, state practice acts, and CMS reimbursement layers each independently block this business model on the human side — and why none of them currently reach the veterinary market — see the companion regulatory analysis, The Safeguards That Don’t Apply Here.)
Now compare that to what is operational on the veterinary side today. SignalPET’s SignalSTAT product page states explicitly that the service “does not include a human (radiologist) review.” Vetology’s CEO has publicly stated that “an AI product MUST be 100% autonomous to have a valid result. If a human intervenes during any part of the result creation, it’s not artificial intelligence, it’s human intelligence.” Antech’s RapidRead delivers AI-generated reports with no DACVR overread on the overwhelming majority of cases. SignalPET alone reports processing 50,000 weekly radiographs across more than 2,300 clinics. The corresponding human-side products do not exist, are not approved, are not legally permitted under state practice acts, and would not be reimbursed if they did exist. They are non-things on the human side — three independent regulatory walls each, on its own, sufficient to keep them off the U.S. human medical market.
This is the framing that should accompany every other observation in this article. The veterinary AI vendors are not behind their human-side counterparts on validation rigor; they have no human-side counterparts. They are not lagging on transparency standards; they are operating in a category that does not exist in the comparison field. The “wild west” descriptor in this article’s title is not rhetorical hyperbole — it is the precise and accurate description of a market segment that has no equivalent in the regulated medical AI universe and that the veterinary AI industry has built specifically by occupying the space the human medical system has explicitly refused to allow.
What the Human Radiology Field Built — and Why It Took a Decade
To understand the asymmetry, it helps to walk through how human chest x-ray AI actually came into being. The story begins in 2017, when the National Institutes of Health released a dataset called ChestX-ray14: 112,120 frontal-view chest radiographs from 30,805 unique patients, labeled for 14 thoracic pathologies, made publicly available for any researcher to download. This was not a vendor’s proprietary training set. It was a public good, contributed by an academic institution to seed a field.
Within months, a team at Stanford’s Machine Learning Group — led by Pranav Rajpurkar in Andrew Ng’s lab — published a paper called CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. They trained a 121-layer DenseNet convolutional neural network on the public NIH dataset and reported that it exceeded the average performance of four practicing Stanford radiologists on the F1 metric for pneumonia detection. The paper was posted to arXiv on November 14, 2017. The follow-up paper, CheXNeXt, validated against three independent cardiothoracic specialist radiologists with an average of 15 years of experience and published in PLOS Medicine in 2018.
What happened next is the part that matters. CheXNet was not a product launch. It was a benchmark. Hundreds of subsequent research groups downloaded the same NIH dataset, replicated CheXNet’s results, identified its weaknesses, and published improvements. Eight years later — in May 2025, while this article was being researched — researchers were still publishing papers reproducing and improving CheXNet using the same public data, with full reproducibility code on GitHub. The most recent reproduction reports an average AUC-ROC of 0.85 across the 14 pathologies and notes that even after years of effort, F1 scores on rare findings remain modest. That is what honest scientific progress looks like: an open benchmark, a transparent baseline, public competition to improve it, and an honest accounting of what still does not work.
A DenseNet-121 is a particular kind of deep neural network with 121 layers in which each layer is connected to every layer that comes before it. It is good at extracting features from medical images. It is the architecture CheXNet used and the architecture most subsequent radiology AI work has been built around.
AUC-ROC (area under the receiver operating characteristic curve) is a measure of how well a classifier separates positive from negative cases across all possible decision thresholds. A value of 1.0 is perfect; 0.5 is no better than chance. An AUC-ROC of 0.85 across 14 pathologies means the model has learned something real, but is still substantially imperfect on individual classes.
F1 score is the harmonic mean of precision and recall, useful for imbalanced datasets where one outcome (normal) is far more common than the other (abnormal). F1 of 0.39 — the score the recent CheXNet reproduction reports — sounds low, and it is. It tells you that even after eight years of sustained academic effort on a public dataset of 112,120 images, getting deep learning to consistently identify diverse thoracic pathology is hard.
Why veterinary readers should care: vendor claims of “92 percent accuracy” or “95 percent accuracy” without specifying AUC-ROC, F1, sensitivity, specificity, prevalence, and operating point are essentially marketing copy. Accuracy on its own — especially against a class-imbalanced test set where 84 percent of cases are normal — can be inflated by a classifier that simply calls everything normal. The technical literature on chest x-ray AI has known this for almost a decade. Veterinary AI marketing has not caught up.
The Open-Challenge Infrastructure: Why Human Radiology AI Has a Public Leaderboard
Public datasets like ChestX-ray14 are only the beginning. Sitting on top of them is an entire infrastructure of organized scientific competition that the veterinary field has no equivalent of. The Radiological Society of North America, working with subspecialty groups like the Society of Thoracic Radiology, has run an annual AI challenge since 2017. Each one releases a new public dataset, defines a clinically meaningful task, and invites the world to compete on it.
The 2017 challenge was on pediatric bone age estimation. The 2018 challenge focused on pneumonia detection on chest radiographs, partnering with the NIH and the Society for Thoracic Radiology and using cases drawn from ChestX-ray14. Over 1,400 teams participated in the training phase. Kaggle, Google’s data-science competition platform, hosted the challenge and contributed $30,000 in prize money. The 2019 challenge, on intracranial hemorrhage detection on head CT, used a dataset of more than 25,000 head CT scans — the first multiplanar dataset used in an RSNA challenge. The 2020 challenge, on pulmonary embolism detection, used the RSNA-STR Pulmonary Embolism CT (RSPECT) Dataset: more than 12,000 CT pulmonary angiography studies from five international research centers, comprising almost 1.8 million expertly annotated images, labeled by more than 80 expert thoracic radiologists. More than 700 international teams competed. The 2021 challenge, conducted with SIIM and FISABIO, focused on COVID-19 pneumonia detection and grading; 1,786 participants from 82 countries formed 1,305 teams.
What does this infrastructure produce that vendor proprietary development does not? Three things, none of which any veterinary AI vendor’s product currently has.
First, an externally validated public benchmark. Anyone — academic researcher, regulator, hospital system, journalist — can take the same dataset, run a candidate algorithm against it, and compare results against the published leaderboard. There is no “we believe our accuracy is 92 percent.” There is a published number, generated under controlled conditions, that anyone can reproduce.
Second, a community of independent researchers stress-testing every claim. The published winners’ solutions to RSNA challenges are open source. The runner-up papers are published in Radiology: Artificial Intelligence. Every weakness in every winning model gets identified and published, and the next year’s competition incorporates lessons learned. This is what peer review looks like at scale, applied to engineering rather than to text.
Third, a labeled dataset with documented ground truth. The RSNA-STR pulmonary embolism dataset was labeled by more than 80 thoracic radiologists, with documented adjudication procedures for disagreement. The CheXNet validation set was labeled by three cardiothoracic specialists with consensus reference standards. When a paper reports “AUC-ROC of 0.85,” it is reporting performance against a labeling protocol that is itself documented and reviewable.
The veterinary field has none of this. There is no public veterinary radiograph dataset of 100,000+ images. There is no annual veterinary AI challenge. There is no veterinary equivalent of Kaggle leaderboards. There is no community of independent researchers downloading the same data, training their own models, and publishing improvements over published baselines. The vendors — SignalPET, Vetology, Antech RapidRead — operate on proprietary datasets they do not release, with labeling protocols they do not disclose, against benchmarks they have not externally validated. Each company’s accuracy claim is internally generated, internally audited, and internally marketed. There is no public infrastructure that would let anyone else check the work.
What the FDA Actually Approves on the Human Side, and What It Refuses To
Once the academic and competitive infrastructure has produced credible candidate algorithms, the FDA’s role begins. It is critical to understand what FDA clearance is and is not. It is not a finding that an AI product is accurate or clinically beneficial. It is a finding that the product is “substantially equivalent” to a previously cleared predicate device, that its labeled intended use is supported by validation data, and that its risk-benefit profile is acceptable for the use case described in the labeling. The labeled intended use — the language printed in the device’s official documentation — is the entire game. It defines what the product is allowed to be marketed for and how it must be presented to clinicians.
The pattern across radiology AI clearances is striking and consistent. Take Gleamer’s BoneView, an AI fracture-detection algorithm cleared by the FDA in 2022 for adults and in 2025 for pediatric patients. The FDA 510(k) Summary for K212365 states the intended use directly: “BoneView is intended for use as a concurrent reading aid during the interpretations of radiographs. BoneView is for prescription use only and is indicated for adults only.” The clearance covers radiographs of the limbs, pelvis, rib cage, and dorsolumbar vertebra. The algorithm flags suspicious areas with bounding boxes. The radiologist still reads the image, validates the AI’s flagged regions, and signs the report. The product is described in Gleamer’s own materials as software that “detects fractures in X-ray images and submits them to radiologists for final validation.”
This is the template across the entire FDA-cleared radiology AI universe. Aidoc’s chest CT triage products for pulmonary embolism and intracranial hemorrhage flag suspicious cases for radiologist review at the worklist level — they do not generate reports. RapidAI’s stroke-detection products do the same for large vessel occlusion. Lunit INSIGHT CXR and Annalise.ai’s chest x-ray products produce computer-aided detection markings that radiologists evaluate, accept, or override. Hologic’s mammography AI, Therapixel’s MammoScreen, and iCAD’s ProFound all operate as concurrent reading aids during radiologist interpretation. Viz.ai’s stroke pathway tools route imaging studies to neurointerventionalists on a priority basis but do not autonomously diagnose. Across the FDA’s cleared radiology AI portfolio — which now includes hundreds of products across cardiology, neurology, orthopedics, oncology, breast imaging, and more — the structural principle is uniform: the AI is an assistive tool, the human radiologist is the diagnostician, and the labeled intended use makes that division explicit and binding.
What the FDA has not cleared is equally informative. There is no FDA-cleared autonomous primary reader of diagnostic imaging in clinical use today. There is no AI product that is allowed under its labeling to issue a diagnostic report directly to a non-radiologist physician without a radiologist in the loop. The closest exception, IDx-DR for diabetic retinopathy screening, was cleared for an extremely narrow use case (diagnosing more-than-mild DR in adults with diabetes who have not been previously diagnosed) and only after years of pre-market clinical study. Even there, the product is a screening tool, not a primary diagnostic reader.
The professional societies actively defend this boundary. In a joint letter to the FDA following the agency’s December 2024 workshop on AI integration in medical imaging, the American College of Radiology and the Radiological Society of North America told the agency it is unlikely the FDA could provide reasonable assurance of the safety and effectiveness of autonomous AI in radiology patient care without more rigorous testing, surveillance, and other oversight than currently exists. ACR’s own 2024 member survey found that 95 percent of radiologists who use AI in clinical practice would not use AI algorithms without a physician overread. The boundary is not abstract — it is what radiologists themselves have demanded and continue to demand of the regulator.
The Worked Example: How One FDA-Cleared Product Was Actually Validated
Walking through how Gleamer’s BoneView came to market makes the contrast with veterinary AI unmistakable. BoneView was developed beginning in 2018, received CE mark certification in Europe in March 2020 (a regulatory clearance comparable to FDA 510(k) for the European market), and obtained FDA 510(k) clearance for adult use in March 2022. Between European clearance and U.S. clearance, the company conducted ClinicalTrials.gov-registered prospective study NCT04532580: Clinical Validation of BoneView for FDA Submission: Evaluation of the Ability of the Artificial Intelligence Software, Boneview, to Improve Physicians’ and Radiologists’ Performances in Detecting Fractures on Bone X-Rays Radiographs.
The study was an observational reader study comparing physician performance with and without BoneView assistance. The 510(k) submission included specificity and sensitivity calculations with 95 percent Clopper-Pearson confidence intervals at high-sensitivity and high-specificity operating points, broken out by anatomical subgroup. Boston University School of Medicine published an independent validation study showing BoneView improved fracture detection sensitivity by 10.4 percent, shortened reading time by 6.3 seconds per patient on average, and reduced false-negative rates by 29 percent when used as a clinician assistant.
Sensitivity is the proportion of actually-positive cases that the test correctly identifies as positive. (If 100 cases truly have a fracture and the AI catches 90, sensitivity is 0.90.) Specificity is the proportion of actually-negative cases the test correctly identifies as negative.
These two numbers trade off against each other. A test that calls everything positive has 100 percent sensitivity and 0 percent specificity. A test that calls everything negative has 100 percent specificity and 0 percent sensitivity. A useful diagnostic test has both, at a clinically appropriate operating point — and in screening contexts (where missing a disease is worse than a false alarm), high sensitivity matters most.
MRMC stands for Multi-Reader Multi-Case study design. It is the FDA-recognized methodology for proving an AI product is non-inferior to (or superior to) human readers. Multiple radiologists independently read the same set of cases with and without AI assistance; statistical methods (DBM, Obuchowski-Rockette, Hillis) account for the fact that the same case is read multiple times and the same reader reads multiple cases. Human-side AI clearance routinely involves MRMC studies.
Class imbalance matters because — as the Joslyn et al. commentary on the SignalPET-funded study points out — if 84 percent of test cases are normal and only 16 percent are abnormal, a trivial classifier that calls everything normal will achieve 84 percent accuracy. Reporting “84 percent accuracy” against such a test set without disclosing prevalence is misleading. The standard is to report sensitivity, specificity, and area under the curve at multiple operating points, and to test against externally curated cases with documented prevalence — all of which BoneView’s submission did and which most veterinary AI products’ marketing materials do not.
Even after FDA clearance, BoneView’s compliance obligations did not end. The FDA’s January 2025 draft guidance on AI-Enabled Device Software Functions imposes ongoing post-market surveillance requirements: continuous performance monitoring, Medical Device Reporting (MDR) for clinically significant errors, and — for adaptive algorithms — pre-approved Predetermined Change Control Plans that specify in advance which kinds of model updates are permissible without re-clearance. Algorithm version tracking is mandatory. Every clinical use can in principle be associated with the specific frozen version of the model that produced the output.
That is the standard. Public dataset, peer-reviewed foundational paper, registered prospective clinical trial, MRMC reader study, FDA 510(k) submission with sensitivity/specificity at documented operating points, intended-use labeling that legally constrains marketing, professional society oversight, ongoing post-market surveillance, and version control. It is not a perfect system, and human-side AI has its own failure modes — but the system exists, it is documented, and it produces a defensible scientific record that any clinician, regulator, or plaintiff’s lawyer can audit.
The Veterinary Side: What Is Actually Public, and What Is Not
Now turn to veterinary AI. The largest commercial veterinary AI radiology vendors describe their training datasets in marketing materials with impressive numbers and almost no specifics.
SignalPET reports its AI was trained on “over 2 million annotated veterinary radiographs” — the largest such corpus in the world, by their account. The annotation methodology is not disclosed. The labeling protocol is not published. The breakdown by species, breed, body region, equipment, image quality, and pathology prevalence is not available. The data is not accessible to academic researchers. There is no version of this dataset that an independent group could download, train their own model on, and publish a comparison.
Antech’s RapidRead reports training on “16 million images sourced from an unprecedented library of more than 8 billion images” — the latter figure presumably representing the cumulative imaging volume Antech has handled across its veterinary imaging services business. The training-set composition is not disclosed. Antech has stated that “our team of board-certified radiologists are continually training and measuring the accuracy of the model” but has not published the methodology for that measurement, the test sets used, or the performance results in a peer-reviewed forum.
Vetology states its AI was “built using a foundation of over 300,000 Board Certified veterinary radiologist-reviewed cases” and that it utilizes “38 different deep-learning architectures.” The training-set composition is not disclosed. Vetology’s CEO, Dr. Seth Wallack — a board-certified veterinary radiologist — has published a position on the company’s website that “an AI product MUST be 100% autonomous to have a valid result. If a human intervenes during any part of the result creation, it’s not artificial intelligence, it’s human intelligence.” This is a design philosophy, not a validation methodology.
The Joslyn et al. commentary in Frontiers in Veterinary Science, published in June 2025, is uncompromising on what this opacity means in practice. Reviewing the Ndiaye et al. study (the SignalPET/Edinburgh head-to-head comparison that SignalPET cites in its marketing), Joslyn and colleagues — including Dr. Ryan Appleby, lead author of the ACVR/ECVDI position statement — write that the AI software is “proprietary and ‘continuously updated and does not have version numbers’… the absence of fixed versioning or a detailed algorithm description prevents replication and raises concerns about whether future iterations will behave similarly.” They observe that “transparency about training data is also limited, described broadly as a large, multi-institutional dataset.” They note that the authors of the original study claimed to follow the CLAIM checklist (Checklist for Artificial Intelligence in Medical Imaging) but in fact omitted multiple elements that the checklist requires.
The “continuously updated and does not have version numbers” issue deserves particular attention. In the FDA-regulated human medical device world, this would not be a feature; it would be a regulatory disqualification. Any product that updates its algorithm continuously, with no version traceability and no notification to clinicians using it, has failed the most basic post-market surveillance requirement: the ability to associate a specific clinical output with a specific frozen version of the software. If the AI flagged a fracture last Tuesday and missed one this Tuesday, was the algorithm the same? In FDA-cleared products, the answer is documented. In veterinary AI products that are continuously and silently retrained, the answer is “we don’t know.”
The SignalPET/Edinburgh Study, Read Carefully
The Ndiaye et al. study published in Frontiers in Veterinary Science in February 2025 is the only externally co-authored peer-reviewed validation study of a major commercial veterinary AI radiology product to date. SignalPET’s marketing prominently cites it. The study’s findings, as reported in the SignalPET marketing materials, are uniformly favorable: “SignalPET’s AI matched the best radiologist in overall accuracy,” “Exceptional Specificity,” “Consistent Performance” in challenging cases, “Reliable Results.”
The Joslyn et al. peer-reviewed commentary on that same study, published four months later in the same journal, identified multiple methodological problems serious enough to call the conclusions into question. The commentary is open access and worth reading in full. Several of its specific findings are worth surfacing here, because they cut directly against the marketing claims SignalPET has built around the study.
Circular ground truth. The study did not validate against an independent gold standard such as surgical or pathological confirmation. Instead, “ground truth” was defined as the majority opinion of the participating radiologists — and the AI’s own output was included in establishing the consensus. As Joslyn et al. note, “the ground truth should be independent from the variable being evaluated, so including the AI’s own output in establishing the correct answer is a form of circular logic — the tool being evaluated helps decide whether its prediction is considered correct.”
Severe class imbalance. Of all reported findings in the study, 84 percent were determined on consensus to be normal and only 16 percent abnormal. As Joslyn et al. observe, “the authors even note that a naive strategy of calling everything ‘normal’ would be correct 84 percent of the time on this dataset. Indeed, one of the participating radiologist’s accuracy was not significantly better than this trivial baseline.” The headline accuracy claim — “AI matched the best radiologist” — does not survive contact with the prevalence data.
Low sensitivity, declining further on hard cases. The AI’s overall sensitivity was 0.688 — meaning it correctly identified about 69 percent of true abnormalities. In low-ambiguity cases (where the radiologists mostly agreed), sensitivity dropped to 0.578. In high-ambiguity cases (where the radiologists disagreed), sensitivity collapsed to 0.444. As Joslyn et al. write: “High sensitivity is essential for ruling out disease and is a critical requirement for any screening test. However, the AI demonstrated overall low sensitivity (0.688), which declined further in both low-ambiguity (0.578) and high-ambiguity (0.444) settings. Therefore, the authors’ conclusion suggesting the use of AI as a screening tool is contradictory.”
To translate: an AI marketed as a screening tool for use by general-practice veterinarians is correctly identifying fewer than half of abnormalities in cases where experienced specialists disagree about what is on the film. Those are exactly the cases where a screening tool needs to be most reliable, because they are the cases the GP is least equipped to interpret independently.
Inadequate statistics. The study used z-tests for proportions to evaluate differences between observers — a basic test that assumes independence of observations. The data violated independence in two ways: each case was read by multiple radiologists, and each radiologist read multiple cases. Joslyn et al. note that this is precisely the data structure that requires Multi-Reader Multi-Case methodology — generalized estimating equations, two-family gatekeeping, bootstrapping — and that “the statistical significance reported is likely overstated” as a result. They also note that the study reports “no statistically significant difference between the AI and the single best radiologist (p ≈ 0.08 in one metric), and declared them equivalent, but with only 50 cases this could simply reflect limited power rather than true equality.”
No version traceability. The AI software was, by the study’s own description, “continuously updated and does not have version numbers.” The authors used the July 2022 version. Whether the version a clinic is using today produces the same outputs is unknown and unknowable.
Joslyn and colleagues’ conclusion is restrained but unmistakable: “Given the fundamental flaws highlighted — lack of independent ground truth, small sample size, methodological biases, and inadequate statistical analysis — the study’s conclusions are highly questionable, necessitating extreme caution regarding clinical uptake. Over-enthusiastic interpretation of results could mislead practitioners or policymakers into adopting AI tools prematurely or with insufficient oversight.”
This is not a fringe critique. The lead author of the commentary published in the official journal of the ACVR (Veterinary Radiology & Ultrasound) on evaluating veterinary AI algorithms in 2022. The senior author is Ryan Appleby, the same Ryan Appleby who is the lead author of the 2025 ACVR/ECVDI position statement on AI. The Ontario Veterinary College and Murdoch University, where the commentary’s authors are based, are mainstream academic veterinary institutions. This is the established veterinary AI research community telling the rest of the profession that the only published external validation of a major commercial product is methodologically inadequate to support the marketing claims being made about it.
Pathology, Necropsy, and the “Ground Truth” Veterinary AI Has Almost Never Used
The ground-truth problem in the SignalPET/Edinburgh study points to a deeper issue across veterinary AI development. In human radiology AI, the gold standard for ground truth is, wherever possible, an independent confirmatory test: tissue pathology for masses, surgical findings for fractures and obstructions, autopsy for catastrophic missed diagnoses, long-term clinical outcome for screening tasks. Radiologist consensus is used when nothing better is available, but it is not the preferred reference, because radiologists agree with each other more readily than they agree with the actual disease state — a phenomenon called inter-observer correlation, and one of the reasons why “the AI agrees with the radiologists” is a weaker validation than “the AI agrees with the necropsy report.”
A 2023 paper in Veterinary Radiology & Ultrasound by Cohen, Fischetti, and Daverio at the Animal Medical Center in New York investigated exactly this. The authors compared veterinary radiologist reports against necropsy findings for a large series of cases and quantified the radiologist error rate using the gold standard of tissue confirmation. Their findings established that even experienced board-certified veterinary radiologists make meaningful errors when their interpretations are checked against pathology — which is, of course, the entire point of using pathology as ground truth in the first place. If radiologists themselves are imperfect against necropsy, an AI trained against radiologist consensus inherits all of those errors, plus whatever errors the AI itself introduces, plus errors introduced by labeling protocols.
Veterinary AI vendors have, with rare exceptions, not used pathology-confirmed datasets for training or validation. The reasons are partly practical: pathology confirmation requires a referral hospital or a teaching institution, and most companion-animal cases that go through general practice never receive necropsy. Building a pathology-confirmed training set of 100,000 veterinary radiographs is logistically harder than building a radiologist-consensus-labeled set of 2 million. The result is that the entire commercial veterinary AI radiology field is built on top of a labeling methodology that human radiology AI considers a fallback rather than a primary reference standard.
This matters because it puts a ceiling on how good veterinary AI can ever get under the current development paradigm. An AI trained to agree with radiologists will, at best, be as accurate as the radiologists who labeled its training data. It cannot exceed that ceiling, because the ceiling is built into the labels. To go above the ceiling — to actually catch findings that even good radiologists miss — the field would need pathology-confirmed datasets, prospective outcome-tracking studies, and the willingness to accept that “the AI disagreed with the radiologist and was right” is sometimes the correct verdict. None of that infrastructure exists in commercial veterinary AI today.
The Asymmetry, in Engineering Terms
Pulled together, the comparison between human-side and veterinary-side AI development looks like this. Each row represents a piece of scientific infrastructure that the human radiology field built before — and during — its AI clearance era, and that the veterinary field has either skipped or is not currently doing.
| Development Pillar | Human Radiology AI | Veterinary Radiology AI |
|---|---|---|
| Public training datasets | NIH ChestX-ray14 (112,120 images), MIMIC-CXR (377,000), CheXpert (224,316), RSNA-STR PE (1.8M images), and dozens more — all open access | None at scale. All major commercial training sets are proprietary and unavailable for independent audit |
| Open AI competitions | Annual RSNA AI Challenges since 2017 on Kaggle, with 700–1,800 international teams per event; MICCAI Grand Challenges; SIIM-RSNA collaborations | No veterinary equivalent of any kind. No public leaderboard, no annual challenge, no organized academic competition |
| Foundational benchmark papers | CheXNet (Stanford 2017), CheXNeXt (PLOS Med 2018), and hundreds of subsequent peer-reviewed papers improving on the public baseline | Limited literature; only one major commercial vendor has published any external validation study, and that study is itself the subject of a published peer-reviewed methodological critique |
| Working groups and standards | RSNA Machine Learning Steering Subcommittee; CLAIM checklist for AI manuscript reporting; FDA/Health Canada/MHRA Good Machine Learning Practice principles | ACVR AI Education and Development Committee published one position statement in 2025; no veterinary CLAIM equivalent; no veterinary GMLP equivalent |
| Ground truth methodology | Pathology, surgical, or outcome-based confirmation preferred; multi-radiologist consensus with documented adjudication used as fallback; prevalence and operating points reported | Radiologist-consensus labels predominate; pathology-confirmed datasets at scale are rare to nonexistent in the commercial training pipelines |
| Statistical methodology | MRMC reader studies with DBM/OR/Hillis methods; Clopper-Pearson confidence intervals; AUC-ROC across operating points; FROC for detection tasks | Inadequate. The single major published validation used z-tests for proportions, ignoring the multi-reader multi-case data structure that requires more sophisticated analysis |
| Algorithm versioning | Locked algorithm or pre-approved Predetermined Change Control Plan required; every clinical use can be tied to a specific frozen software version | Not required. Major vendor product is “continuously updated and does not have version numbers.” Reproducibility and post-hoc audit are impossible |
| Intended use labeling | FDA 510(k) labeled intended use language (“concurrent reading aid,” “computer-aided detection”) legally binds marketing and clinical deployment | No intended-use labeling regime. Marketing claims are unconstrained by any regulatory document |
| Reader study before launch | MRMC clinical validation, often via ClinicalTrials.gov-registered prospective studies, required before clearance | No equivalent regime. Some vendor-funded studies exist; methodology criticized in published peer-reviewed commentary |
| Post-market surveillance | FDA Medical Device Reporting; manufacturer monitoring; real-world performance tracking; algorithm drift detection | Not required. No central reporting system. No public registry of veterinary AI errors or patient harms |
Every row of that table is a piece of infrastructure the human-side built — sometimes through the FDA, sometimes through professional societies, sometimes through academic medical centers volunteering datasets to the public. The veterinary side either skipped each one or has only a recent, limited, single-instance equivalent. The market did not wait for any of this to be built before commercial products launched, and the products that have launched have consequently been built without the upstream infrastructure that makes good medical AI possible.
The Real-World Performance Gap
This engineering asymmetry is not theoretical. It produces measurable differences in product behavior that should concern any veterinarian using these tools.
The first is sensitivity collapse on hard cases — the precise pattern Joslyn et al. documented in the SignalPET/Edinburgh study. The AI performs adequately on easy cases and degrades on the cases where the radiologists themselves disagree. This is the exact opposite of what a screening tool should do. A screening tool should be most useful precisely in the cases where the GP cannot reliably interpret the film themselves — which are the cases where the radiologists also disagree. An AI whose sensitivity drops to 0.444 on hard cases is useful as a confidence booster on easy cases and a liability on hard ones.
The second is silent algorithm drift. A veterinary AI product that updates “continuously” without version tracking is, from the user’s perspective, a different product every week. The clinic that adopted SignalPET in 2023 and validated it against their own patient population is using a different algorithm in 2026. There is no notification when the model is retrained. There is no released change log. There is no mechanism for the clinic to revert to the version they originally validated. From an engineering standpoint, this violates basic principles of safety-critical software deployment.
The third is hidden error patterns. Without public datasets and independent audit, systematic failure modes — “the AI underperforms on dachshunds compared to retrievers,” “the AI struggles with feline thoracic radiographs taken on certain DR systems,” “the AI’s specificity collapses on geriatric patients” — go undetected. Human-side AI catches these patterns through external validation studies. Veterinary AI does not run those studies, so the patterns persist invisibly until enough individual cases accumulate to make them anecdotally obvious.
The fourth is the absence of differentials. Even Ndiaye et al. acknowledged in their study that the AI “does not provide differential diagnoses.” This is a fundamental architectural difference between AI screening and radiologist consultation. A radiologist looking at a thoracic radiograph generates a differential — alveolar pattern with right middle lobe distribution, prioritizing aspiration, hemorrhage, pneumonia, and atelectasis in approximately that order based on signalment and history. The AI returns a label: “alveolar pattern present.” The clinical work of integrating that finding with the patient’s history, signalment, physical exam, and other diagnostics — the actual diagnostic act — has been removed from the AI output entirely. What the AI produces is closer to a check-the-box screening result than to a radiologist consult, regardless of how the report is formatted.
Why the Working-Group Question Matters
One of the things that makes the human-side scientific infrastructure work is the existence of organized working groups that operate outside any single vendor’s commercial interest. The RSNA Machine Learning Steering Subcommittee. The DICOM Working Group 25 (Veterinary Imaging) and Working Group 23 (AI Image Description Standards) on the human side. The CLAIM checklist authors. The FDA/Health Canada/MHRA joint Good Machine Learning Practice principles working group. Each of these bodies sets standards, defines reporting requirements, and audits vendor claims against documented criteria — without itself selling product.
The veterinary AI field has, as of 2026, exactly one organized body that has published authoritative guidance: the ACVR/ECVDI Artificial Intelligence Education and Development Committee, which produced the 2025 position statement. That committee has no enforcement authority. It cannot certify products. It cannot require disclosure of training data. It cannot mandate reader studies before launch. It can — and did — publish a position statement saying no current commercial product meets the standard for use in practice. The vendors continued operating.
Compare this to what the RSNA built. The RSNA Machine Learning Steering Subcommittee curates public datasets, organizes annual challenges, peer-reviews competition results in Radiology: Artificial Intelligence, publishes the CLAIM checklist that journals use to evaluate AI manuscripts, and convenes vendor-academic working groups to develop performance standards. Its work is funded by RSNA membership dues, supplemented by Kaggle’s contribution of platform infrastructure. It produces concrete public goods — datasets, benchmarks, standards — that any researcher can use. It is, in effect, the upstream science that makes downstream FDA review possible.
Building an analogous structure on the veterinary side would not require new law. It would require funding (modest, by comparison to commercial AI development budgets), institutional commitment from at least one major veterinary academic center to host a public dataset, and willingness from at least one specialty college to organize a challenge and publish standards. The commercial vendors have shown no inclination to seed this infrastructure, because their proprietary datasets are their competitive advantage and their unaudited accuracy claims are their marketing edge. The infrastructure will be built — if it is built — by the academic community, the specialty colleges, and possibly the AVMA.
What an Honest Validation Study Would Look Like
For veterinary AI to be evaluated to anything resembling the standard human-side AI is held to, the necessary study design is well-understood from twenty years of human medical imaging informatics. The components are not exotic; they are missing.
A defensible external validation study of a commercial veterinary AI radiology product would have, at minimum, the following components. A test set of at least several hundred cases — ideally from multiple institutions, not just the academic center collaborating with the vendor. Pathology, surgical, or long-term clinical outcome confirmation as ground truth wherever possible, with radiologist consensus only as a fallback. Documented case-selection methodology with disease prevalence reported, so accuracy claims can be evaluated against the trivial baseline of always-predicting-the-most-common-class. Multi-reader multi-case study design with appropriate statistical methods (DBM, OR, Hillis), not z-tests for proportions. AUC-ROC across operating points, sensitivity and specificity reported with confidence intervals, and breakdown by anatomical region and pathology class. A locked algorithm version for the duration of the study, with the version number reported in the manuscript. Adherence to the CLAIM checklist for medical imaging AI, with all 42 items either reported or explicitly disclosed as omitted. Publication in a peer-reviewed venue with the dataset (or at least the test set) made available to other researchers for replication.
None of this is unusual. All of it is standard practice in human medical imaging AI research. The Ndiaye et al. study failed on most of these dimensions, and was published anyway, in part because the veterinary AI literature has not yet matured to the point where reviewers consistently catch these failures. Joslyn et al.’s commentary is a marker that the literature is starting to mature. It is not yet at the point where vendor-funded studies of inadequate methodology would be rejected at the editorial stage.
Three Things a Sophisticated Veterinary Buyer Should Demand
For the practicing veterinarian deciding whether to adopt a commercial AI radiology product — or for a hospital network considering deployment at scale — several specific disclosures should be requested before signing.
First, the algorithm version and update policy. If the answer is “we update continuously and do not version” — the answer Vetology, SignalPET, and most other vendors give — that should be a flag. Ask whether the vendor will, at any point in the contract term, freeze the version the clinic is using and notify the clinic of any update. Most vendors will not commit to this, because their continuous-update model is a competitive feature, not a bug. The clinic should understand that they are using a different product every week and that the validation studies they read may not describe the version they are actually using.
Second, the confusion matrix. The vendor’s CEO at Vetology has said it himself: ask for the confusion matrix. A confusion matrix is a 2×2 (or larger) table showing true positives, true negatives, false positives, and false negatives by pathology class. From it, sensitivity, specificity, positive predictive value, and negative predictive value can be computed. A vendor that cannot or will not produce a confusion matrix on its own product, broken out by pathology class and tested against external data, has not validated its product to a defensible standard. The clinic should treat that as a disqualifying gap.
Third, the methodology of any cited validation study. If the vendor cites a peer-reviewed study, read the study itself rather than the marketing summary. Read the published commentaries on the study, if any exist. The Joslyn et al. commentary on the Ndiaye et al. SignalPET study is open access. Reading both — the original and the critique — takes about ninety minutes and changes the calculus on what the original paper actually establishes. The clinic that signs up based on the marketing summary alone has not done the diligence the situation warrants.
The Field Will Catch Up — But It Has Not Yet
None of this is an argument against AI in veterinary radiology. It is an argument that AI in veterinary radiology should occupy the same role it occupies in human radiology: assistive, augmentative, concurrent-reading, triage, measurement, workflow optimization. The 2025 ACVR/ECVDI position statement is explicit that AI should always be used with a qualified veterinary professional, preferably a board-certified radiologist, in the loop. The Joslyn et al. commentary states that “AI remains a promising adjunct, not a replacement, for veterinary radiologists.” Both documents call for the field to adopt the standards human radiology AI has spent twenty years building. Neither document endorses, anywhere, the proposition that veterinary AI should serve as a primary diagnostic reader without specialist oversight.
The argument that AI-primary reads “expand access” for clinics without ready access to a board-certified radiologist deserves direct examination, because it is the central justification the vendors offer for the entire model. It does not survive scrutiny. Every clinic currently using SignalPET, Vetology, or Antech RapidRead has, by definition, installed the imaging hardware, configured DICOM transmission, integrated cloud-based image submission into its workflow, and trained its staff to send studies to a remote service. The technical capability to send images to a remote destination exists in every one of those clinics. The destination is a vendor choice, not an access constraint. A clinic capable of submitting cases to SignalPET’s cloud is equally capable of submitting them to PetRad, Golden Hour, Vets Choice Radiology, Cornell University’s teleradiology service, or any of the dozens of independent and academic teleradiology services that provide DACVR consultations. The infrastructure is identical. What differs is the price per study and the turnaround time — neither of which is an access barrier in the technical sense the word implies.
The veterinary AI vendors have promoted “access” framing precisely because it provides a clinical-public-good justification for what is, in operational reality, a price-driven substitution of algorithmic throughput for specialist labor. There is a legitimate conversation to be had about the cost of board-certified veterinary radiology and whether the supply of DACVRs is adequate to meet demand. That conversation is not the same conversation as whether veterinary patients should receive primary diagnostic reads from unvalidated AI without specialist oversight. Conflating the two is a category error, and the field should not allow it to stand unchallenged.
The deeper problem with the “access” framing is what it implies about the standard of care veterinary patients are owed. On the human side, no serious participant in medical AI policy argues that patients in underserved areas should receive autonomous AI diagnostic reports while patients in wealthy urban hospitals get specialist consults. The standard of care is the standard of care. Where access is a problem, the answer is to expand specialist coverage — through telemedicine, residency training, distributed reading networks — not to lower the standard of who is allowed to issue a diagnostic interpretation. The veterinary AI vendors are arguing, in effect, that veterinary patients in underserved settings should get a different and lower standard of diagnostic care than veterinary patients in well-served settings, with that standard delivered by software trained on undisclosed data, validated by methodologically inadequate studies, and continuously updated without version traceability. That is not what “expanding access” means in any other branch of medicine. It should not mean that here.
What is happening today is that commercial products have outrun the science. The marketing has outrun the validation. The scale of deployment — 50,000 weekly radiographs through SignalPET alone — has outrun the public infrastructure that would let anyone independently audit whether the products work as claimed. The veterinary AI field is in roughly the position human radiology AI was in around 2014, before ChestX-ray14, before CheXNet, before the RSNA challenges, before FDA cleared the first SaMD radiology device. The difference is that human radiology AI in 2014 was almost entirely academic; the products were not yet in clinical deployment. The veterinary AI field in 2026 is the opposite: products are deployed at industrial scale, and the academic infrastructure is just beginning to develop.
That is a strange and uncomfortable place for a profession to be. The 2025 ACVR/ECVDI position statement is the field’s first formal acknowledgment of the gap. The Joslyn et al. commentary is the first peer-reviewed methodological critique of a major commercial product’s validation evidence. These are signs of maturation. They are not yet maturity. The vendors selling AI-primary reads today are operating in a window between the products being available and the science catching up to evaluate them.
That window will close. When it does, products that cannot survive scrutiny against the standards human radiology AI now meets routinely will be in trouble. Veterinarians using those products without understanding their limitations will be in trouble too — not because the AI is necessarily wrong on any given case, but because the malpractice plaintiff’s bar will eventually figure out what the engineering literature already shows: that a product trained on undisclosed data, with non-existent versioning, validated against a methodologically inadequate study, is a product whose accuracy claims are essentially marketing rather than science. When that happens, the legal liability that has flowed downhill to the GP veterinarian will start flowing back uphill to the vendor, the clinic that adopted the vendor’s product, and the corporate acquirer that endorsed it. The professional society warned the field in 2025. The peer-reviewed commentary documented the methodological inadequacies in mid-2025. What happens next is the question of whether the rest of the profession was paying attention.
The Bottom Line, in Engineering Terms
Human radiology AI took roughly a decade to mature into a field with public datasets, open competitions, foundational peer-reviewed papers, working groups setting standards, FDA-cleared products with labeled intended use, MRMC validation requirements, locked-algorithm versioning, post-market surveillance, and active professional society oversight. By the time the FDA cleared the first chest x-ray AI device, the upstream science was already in place to support it. The regulator audited science the field had built; it did not invent the science from scratch.
Veterinary AI radiology skipped the upstream science. The commercial products went to market on proprietary datasets that were never released, without organized public competitions, without foundational peer-reviewed papers establishing benchmarks, without working groups setting reporting standards, without intended-use labeling, without MRMC validation as a precondition for sale, and without algorithm versioning that allows post-hoc audit of clinical errors. The 2025 ACVR/ECVDI position statement said this. The Joslyn et al. commentary documented it in specifics. Vendors continued selling.
The engineering reality is that veterinary AI radiology products in commercial use today were not built to the standards that would clear the bar in human medicine — and the standard veterinary patients are entitled to receive is not lower than the standard human patients receive. The same evidentiary expectations, validation methodologies, and oversight structures that protect human patients from premature AI deployment should protect veterinary patients too. The professional society has said so. The peer-reviewed methodological literature has said so. The documents are public and the conclusion they collectively reach is unambiguous: a veterinary diagnostic AI product that has not been built to a standard recognizable to human medical AI is not a product that should be issuing primary diagnostic interpretations on veterinary patients, regardless of how the marketing frames the access question. The infrastructure to do this work properly already exists for any clinic that wants to use it. The vendors offering AI-primary reads are not solving an access problem. They are selling a price-driven substitute for the standard of care, and the field should be unsparing in saying so.
Frequently Asked Questions
What is the difference between human-side and veterinary-side AI radiology development?
The human-side AI radiology field built the upstream scientific infrastructure for medical AI before commercial products were cleared for clinical deployment. That infrastructure includes public training datasets at scale (NIH ChestX-ray14 with 112,120 images, MIMIC-CXR with 377,000, CheXpert with 224,316, the RSNA-STR Pulmonary Embolism dataset with approximately 1.8 million annotated images), organized open competitions (the annual Radiological Society of North America AI Challenge running since 2017, with 700 to 1,800 international teams per event), foundational peer-reviewed benchmark papers (CheXNet from Stanford in 2017, CheXNeXt in PLOS Medicine in 2018, and hundreds of follow-on papers), working groups setting reporting standards (RSNA Machine Learning Steering Subcommittee, the CLAIM checklist for AI manuscript reporting, FDA/Health Canada/MHRA Good Machine Learning Practice principles), and FDA 510(k) clearance with locked algorithms and labeled intended use that legally constrains marketing claims. The commercial veterinary AI radiology field skipped most of this. Major commercial products from SignalPET, Vetology, and Antech RapidRead are built on proprietary training datasets that are not publicly released, without organized open competition between independent research teams, without externally validated public benchmarks, and without algorithm version traceability. The 2025 ACVR/ECVDI position statement published in JAVMA stated categorically that “currently, no commercially available AI products for veterinary diagnostic imaging meet the required standards for transparency, validation, or safety.” For the regulatory analysis of why this is operationally permissible on the veterinary side, see our coverage of the safeguards that don’t apply to veterinary AI radiology.
What is the CheXNet study and why does it matter for understanding AI radiology validation?
CheXNet is a foundational AI radiology benchmark study published in November 2017 by a team at Stanford University’s Machine Learning Group led by Pranav Rajpurkar in Andrew Ng’s laboratory. The team trained a 121-layer DenseNet convolutional neural network on the National Institutes of Health’s publicly released ChestX-ray14 dataset — 112,120 frontal-view chest radiographs from 30,805 unique patients, labeled for 14 thoracic pathologies — and reported that the resulting model exceeded the average performance of four practicing Stanford radiologists on the F1 metric for pneumonia detection. The follow-up paper, CheXNeXt, validated the model against three independent cardiothoracic specialist radiologists with an average of 15 years of experience and was published in PLOS Medicine in 2018. CheXNet matters as a reference point for AI radiology validation because of what happened after it was published: it was not a product launch but a benchmark. Hundreds of subsequent academic research groups downloaded the same NIH dataset, replicated CheXNet’s results, identified its weaknesses, and published improvements in peer-reviewed venues. As of 2025, eight years after the original publication, researchers are still publishing papers reproducing and improving CheXNet using the same public data, with full reproducibility code on GitHub. This is what credible AI radiology validation looks like — an open benchmark, a transparent baseline, public competition to improve it, and an honest accounting in peer-reviewed literature of what still does not work. The veterinary AI field has no equivalent foundational study or public benchmark.
What did the Joslyn et al. commentary find about the SignalPET validation study?
The Joslyn et al. commentary, published in June 2025 in Frontiers in Veterinary Science (Joslyn SK, Faulkner J, Ma D, Appleby R; Vol. 12, article 1615947), is a peer-reviewed methodological critique of the Ndiaye et al. study published February 2025 in the same journal — the only externally co-authored peer-reviewed validation study of a major commercial veterinary AI radiology product to date, which SignalPET cites in its marketing materials. The senior author of the commentary, Dr. Ryan Appleby, is the same Ryan Appleby who is the lead author of the 2025 ACVR/ECVDI position statement on AI. The commentary identifies several methodological problems. First, circular ground truth — the study did not validate against an independent gold standard such as surgical or pathological confirmation but instead defined ground truth as the majority opinion of the participating radiologists, with the AI’s own output included in establishing the consensus. Second, severe class imbalance — 84 percent of reported findings were normal and only 16 percent abnormal, meaning a trivial classifier that called every case normal would achieve 84 percent accuracy. Third, sensitivity collapse on difficult cases — the AI’s overall sensitivity was 0.688, dropping to 0.578 in low-ambiguity cases and 0.444 in high-ambiguity cases (precisely the cases where a screening tool needs to be most reliable). Fourth, inadequate statistics — the study used z-tests for proportions, ignoring the Multi-Reader Multi-Case data structure that requires generalized estimating equations and similar more sophisticated methods. Fifth, no version traceability — the AI software was “continuously updated and does not have version numbers.” The commentary’s conclusion: “Given the fundamental flaws highlighted — lack of independent ground truth, small sample size, methodological biases, and inadequate statistical analysis — the study’s conclusions are highly questionable, necessitating extreme caution regarding clinical uptake.”
What is the CLAIM checklist and why does it matter for AI radiology validation?
The CLAIM checklist — Checklist for Artificial Intelligence in Medical Imaging — is a 42-item reporting standard published by Mongan, Moy, and Kahn in Radiology: Artificial Intelligence in 2020 (Mongan J, Moy L, Kahn CE. Checklist for artificial intelligence in medical imaging (CLAIM): a guide for authors and reviewers. Radiol Artif Intell. 2020;2(2):e200029). It establishes the documentation requirements that any peer-reviewed manuscript reporting an AI medical imaging study should meet: how training data was collected and labeled, how validation was performed, what statistical methods were used, how ground truth was established, how the algorithm version was specified, what operating points were reported, and how performance metrics were calculated. CLAIM exists because the medical imaging research community recognized that AI manuscripts were being published with insufficient methodological detail to allow independent evaluation or replication. The checklist gives journal reviewers a structured way to assess whether a submission meets the basic transparency requirements for an AI claim. Adherence to CLAIM is now standard practice in human medical imaging journals. The Joslyn et al. commentary on the SignalPET validation study notes that the original study claimed to follow CLAIM but in fact omitted multiple required elements. There is currently no veterinary equivalent of CLAIM that vendor-funded studies are evaluated against, and consequently the veterinary AI literature has not yet matured to the point where reviewers consistently catch the kind of methodological inadequacies the Joslyn et al. commentary documented.
What is ground truth in AI radiology validation, and why does it matter?
Ground truth in AI radiology validation refers to the reference standard against which an AI system’s predictions are evaluated to determine accuracy. Different reference standards have different reliability. The gold standard in human radiology AI is, wherever available, an independent confirmatory test — tissue pathology for masses and tumors, surgical findings for fractures and obstructions, autopsy for catastrophic missed diagnoses, and long-term clinical outcome for screening tasks. Radiologist consensus is used as a fallback when nothing better is available. Radiologist consensus is the weakest acceptable standard because radiologists agree with each other more readily than they agree with the actual disease state — a phenomenon called inter-observer correlation. An AI trained against radiologist-consensus labels will, at best, be as accurate as the radiologists who labeled its training data. It cannot exceed that ceiling because the ceiling is built into the labels. Most commercial veterinary AI radiology products are trained and validated against radiologist-consensus labels rather than pathology-confirmed reference standards. This puts a fundamental ceiling on how good these systems can be: they are by construction calibrated to agree with radiologists, not to agree with the underlying disease state. A 2023 paper in Veterinary Radiology & Ultrasound by Cohen, Fischetti, and Daverio at the Animal Medical Center quantified veterinary radiologist error rates against necropsy findings — establishing that even experienced board-certified veterinary radiologists make meaningful errors when their interpretations are checked against tissue confirmation. To exceed that ceiling, the veterinary AI field would need pathology-confirmed datasets, prospective outcome-tracking studies, and the willingness to accept that “the AI disagreed with the radiologist and was right” is sometimes the correct verdict. None of that infrastructure exists in commercial veterinary AI today.
Why does it matter that veterinary AI products are continuously updated and do not have version numbers?
The “continuously updated and does not have version numbers” description comes directly from the Ndiaye et al. study of SignalPET’s AI radiology software published in Frontiers in Veterinary Science in February 2025. In the FDA-regulated human medical device universe, this software practice would not be a feature; it would be a regulatory disqualification. Basic post-market surveillance requirements for AI medical devices require the ability to associate a specific clinical output with a specific frozen version of the software. The FDA’s January 2025 draft guidance on AI-Enabled Device Software Functions specifies that any adaptive algorithm requires a pre-approved Predetermined Change Control Plan (PCCP) defining in advance which kinds of model updates are permissible without re-clearance, with version tracking and clinician notification. Algorithm version tracking is mandatory; every clinical use can in principle be associated with the specific frozen version of the model that produced the output. The implications of not having this for veterinary AI products are direct. First, a clinic that adopted a product in 2023 and validated it against their own patient population is using a different algorithm in 2026 — there is no notification when the model is retrained, no released change log, no mechanism to revert to the version originally validated. Second, if the AI flagged a fracture last Tuesday and missed one this Tuesday, no one can determine whether the algorithm was the same on both days. Third, post-hoc audit of clinical errors is structurally impossible. The Joslyn et al. commentary specifically flags this issue: “the absence of fixed versioning or a detailed algorithm description prevents replication and raises concerns about whether future iterations will behave similarly.” From an engineering standpoint, a continuously updated unversioned medical AI product violates basic principles of safety-critical software deployment.
What is the FDA’s regulatory standard for AI radiology in human medicine, and what does it actually approve?
The U.S. Food and Drug Administration has authorized between approximately 700 and 950 radiology AI devices over the past decade — depending on the counting methodology used — across chest imaging, mammography, brain CT, cardiac, orthopedic, and dozens of other categories. Not one of those devices is labeled for autonomous diagnostic interpretation. The labeled intended use language is uniform across the entire portfolio: “concurrent reading aid,” “computer-aided detection,” “computer-aided triage,” “decision support during interpretation by qualified clinician.” Every device presumes a board-certified radiologist reads the study, evaluates the AI’s output, and signs the final report. A representative example is Gleamer’s BoneView, an AI fracture-detection algorithm cleared by the FDA in 2022 (510(k) Summary K212365). The clearance language states directly: “BoneView is intended for use as a concurrent reading aid during the interpretations of radiographs.” The algorithm flags suspicious areas with bounding boxes; the radiologist still reads the image, validates the AI’s flagged regions, and signs the report. The professional societies actively defend this boundary. In a joint letter to the FDA following the agency’s December 2024 workshop on AI integration in medical imaging, the American College of Radiology and the Radiological Society of North America jointly told the agency it is unlikely the FDA could provide reasonable assurance of the safety and effectiveness of autonomous AI in radiology patient care without more rigorous testing, surveillance, and other oversight than currently exists. The ACR’s own 2024 member survey found that 95 percent of radiologists who use AI in clinical practice would not use AI algorithms without a physician overread. The single FDA-approved instance of autonomous AI diagnostic activity in U.S. clinical medicine is IDx-DR, cleared in 2018 only for narrow diabetic retinopathy screening in primary care offices, only after years of pre-market clinical trials, with required ophthalmologist referral for any positive screen. For the regulatory contrast with veterinary AI radiology, see our coverage of the safeguards that don’t apply to veterinary AI radiology.
What would honest external validation of a veterinary AI radiology product look like?
A defensible external validation study of a commercial veterinary AI radiology product would have, at minimum, the following components — none of which are exotic, all of which are standard practice in human medical imaging AI research. First, a test set of at least several hundred cases — ideally drawn from multiple institutions rather than only the academic center collaborating with the vendor, to demonstrate generalizability beyond a single clinical setting. Second, pathology, surgical, or long-term clinical outcome confirmation as ground truth wherever possible, with radiologist consensus used only as a fallback when no better reference is available. Third, documented case-selection methodology with disease prevalence reported, so that accuracy claims can be evaluated against the trivial baseline of always-predicting-the-most-common-class — the failure mode the Joslyn et al. commentary identified in the Ndiaye et al. study. Fourth, Multi-Reader Multi-Case (MRMC) study design with appropriate statistical methods (DBM, Obuchowski-Rockette, or Hillis methodology), not z-tests for proportions that violate independence assumptions. Fifth, AUC-ROC reported across operating points, with sensitivity and specificity reported with confidence intervals, broken out by anatomical region and pathology class. Sixth, a locked algorithm version maintained for the duration of the study, with the version number reported in the manuscript so future iterations can be distinguished from the validated version. Seventh, adherence to the CLAIM checklist for medical imaging AI, with all 42 items either reported in the manuscript or explicitly disclosed as omitted. Eighth, publication in a peer-reviewed venue with the test set (or at least a representative subset) made available to other researchers for replication. The Ndiaye et al. study failed on most of these dimensions and was published anyway, in part because the veterinary AI literature has not yet matured to the point where reviewers consistently catch these failures. The Joslyn et al. commentary is a marker that the literature is starting to mature, but is not yet at the point where vendor-funded studies of inadequate methodology are rejected at the editorial stage.
- Ndiaye YS, Cramton P, Chernev C, Ockenfels A, Schwarz T. Comparison of radiological interpretation made by veterinary radiologists and state-of-the-art commercial AI software for canine and feline radiographic studies. Front Vet Sci. 2025;12:1502790. Open access. SignalPET-funded study; the only major external validation of a commercial veterinary AI radiology product to date.
- Joslyn SK, Faulkner J, Ma D, Appleby R. Commentary: Comparison of radiological interpretation made by veterinary radiologists and state-of-the-art commercial AI software for canine and feline radiographic studies. Front Vet Sci. 2025;12:1615947. Open access. Peer-reviewed methodological critique by the ACVR/ECVDI position statement’s senior author and colleagues.
- Appleby RB, Difazio M, Cassel N, Hennessey R, Basran PS. American College of Veterinary Radiology and European College of Veterinary Diagnostic Imaging position statement on artificial intelligence. JAVMA. 2025;263(6):773–776. Open access.
- Cohen EB, Gordon IK. First, do no harm. Ethical and legal issues of artificial intelligence and machine learning in veterinary radiology and radiation oncology. Vet Radiol Ultrasound. 2022;63(S1):840–844. PMC.
- Joslyn S, Alexander K. Evaluating artificial intelligence algorithms for use in veterinary radiology. Vet Radiol Ultrasound. 2022;63(S1):871–879.
- Cohen J, Fischetti AJ, Daverio H. Veterinary radiologic error rate as determined by necropsy. Vet Radiol Ultrasound. 2023;64(4):573–584. The pathology-confirmed error-rate study from the Animal Medical Center.
- Rajpurkar P, Irvin J, Zhu K, et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. arXiv:1711.05225. November 2017. arXiv.
- Rajpurkar P, Irvin J, Ball RL, et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLOS Medicine. 2018;15(11):e1002686. Open access.
- Mongan J, Moy L, Kahn CE. Checklist for artificial intelligence in medical imaging (CLAIM): a guide for authors and reviewers. Radiol Artif Intell. 2020;2(2):e200029.
- FDA / Health Canada / UK MHRA. Good Machine Learning Practice for Medical Device Development: Guiding Principles. October 2021. FDA.gov.
- FDA / Health Canada / UK MHRA. Transparency for Machine Learning-Enabled Medical Devices: Guiding Principles. June 2024. FDA.gov.
- Gleamer BoneView 510(k) Summary K212365. FDA AccessData. Intended-use language: “BoneView is intended for use as a concurrent reading aid during the interpretations of radiographs.”
- RSNA AI Challenges archive. RSNA.org. 2017 pediatric bone age, 2018 pneumonia, 2019 intracranial hemorrhage, 2020 pulmonary embolism, 2021 COVID-19, and subsequent annual challenges.
- The Safeguards That Don’t Apply Here: How Veterinary AI Radiology Vendors Operate Outside Every Rule That Governs the Human Side — The companion regulatory analysis: FDA clearance, state practice acts, and reimbursement gatekeeping, and why none of them reach veterinary AI.
Editorial & Legal Disclaimer. VeterinaryTeleradiology.com is an independent industry publication. This article is based entirely on publicly available and documented sources, each identified in the Primary Documents Referenced section above. Sources include: peer-reviewed papers and commentaries published in Frontiers in Veterinary Science, Veterinary Radiology & Ultrasound, JAVMA, PLOS Medicine, and Radiology: Artificial Intelligence; FDA 510(k) Summaries available on the FDA AccessData public database; RSNA AI Challenge archives and accompanying peer-reviewed dataset descriptions; FDA, Health Canada, and UK MHRA jointly published guidance documents on Good Machine Learning Practice and Machine Learning-Enabled Medical Device Transparency; published vendor marketing materials, terms of service, and product descriptions from SignalPET, Vetology, Antech Diagnostics, and Gleamer; and academic preprints posted to arXiv. No confidential sources, non-public documents, or unverified information is relied upon in this article. Every factual claim is attributable to one or more of the above primary or secondary sources.
This article presents documented facts, structural and methodological observations, and questions for reader and regulatory consideration. It does not assert legal conclusions, make criminal accusations, or impute wrongdoing, fraud, or illegal conduct to any individual or entity. Characterizations of vendor products are based on those vendors’ own publicly posted marketing materials and on peer-reviewed scientific literature evaluating those products. Where this article describes a peer-reviewed study or peer-reviewed commentary, the characterization reflects the published authors’ own conclusions and the documented methodology of those publications.
The methodological critique of the Ndiaye et al. study reflected in this article is the published, peer-reviewed analysis of Joslyn et al. (2025) in Frontiers in Veterinary Science, including all specific factual claims about ground truth methodology, sample size, class imbalance, sensitivity values, and statistical methods. Readers are encouraged to read both the original Ndiaye et al. paper and the Joslyn et al. commentary in their entirety, both of which are open access.
Vendors named in this article are characterized based on their own public marketing materials and on independent published scientific analysis of their products. Any vendor whose product offerings, training methodology, validation regime, versioning practice, or operational model differs from the description above is invited to contact this publication with the specifics, and any corrections supported by documentary evidence will be published in full. This invitation is extended directly and without prejudice to SignalPET, Vetology, Antech Diagnostics, Gleamer, IDEXX, Radimal, Patterson Veterinary, and any other vendor whose products are discussed in this article.