Modernizing DoD AI: Overcoming Testing Bottlenecks

1. Executive Summary

The United States Department of Defense (DoD) is actively pursuing substantial investments in artificial intelligence (AI) and autonomous drone technologies. The strategic objective is to field combat-credible, decentralized, and intelligent systems capable of operating at machine speed across contested multi-domain environments. However, while modernization efforts heavily prioritize platform capabilities—such as hardware procurement, airframe manufacturing, and sensor integration—there is a critical misalignment regarding the systemic requirements necessary to design, build, test, operate, and evolve these tools.¹ A pervasive tendency exists within the defense establishment to fixate on the physical technology itself while overlooking the underlying distributed systems infrastructure and evaluation mechanisms that make the technology viable.¹

This report provides a strategic evaluation of the critical bottlenecks within the Department’s Test, Evaluation, Verification, and Validation (TEVV) enterprise. The central finding indicates that legacy testing infrastructure, which was built for deterministic hardware and human-piloted systems, is fundamentally ill-equipped to handle non-deterministic AI behaviors, continuously learning algorithms, and distributed swarm logic.¹ Traditional physical test ranges are constrained by safety limitations, geographical boundaries, and an inherent inability to replicate the millions of edge-case scenarios required to validate reinforcement learning models.³ Furthermore, traditional regulatory mechanisms, such as static Safety Review Boards (SRBs) and point-in-time Authorization to Operate (ATO) certifications, create bureaucratic friction that stifles the rapid deployment and iterative updating of software-defined weapons.⁵

To bridge the gap between technological ambition and operational readiness, DoD leadership must pivot toward modernizing the TEVV ecosystem. This requires a systemic shift away from platform-centric acquisition toward architecture-centric and software-first paradigms. Strategic imperatives include the wide-scale adoption of highly realistic simulation environments and digital twins to enable hardware-in-the-loop (HITL) and software-in-the-loop (SITL) testing at scale.³ Regulatory frameworks must also evolve concurrently; leadership must champion the transition from static security reviews to Continuous Authority to Operate (cATO) protocols ⁹, and replace legacy Technology Readiness Levels (TRLs) with a nuanced AI Readiness Framework (AIRL) that accounts for data integrity, algorithmic alignment, and human-machine teaming.¹¹ Only by addressing these TEVV bottlenecks can the DoD ensure that warfighters are equipped with autonomous tools that are not only lethal and survivable but, fundamentally, trustworthy.

2. The Evolving Threat Landscape and the Autonomy Imperative

To understand the inadequacy of the current TEVV enterprise, it is necessary to examine how AI and autonomy alter the fundamental nature of military systems. The DoD’s integration of AI spans a broad spectrum, ranging from decision-support systems (AI-DSS) designed to accelerate the joint targeting cycle, to highly autonomous unmanned aerial vehicles (UAVs) and unmanned surface vehicles (USVs) capable of executing independent kinetic action when communications are severed.¹²

The Distinction Between Automation and Autonomy

A frequent point of confusion in defense acquisitions is the conflation of “automation” and “autonomy.” Automation refers to a system’s ability to undertake a narrow, constrained task with low levels of complexity, where that task is highly repetitive and independent of choice.¹⁵ Legacy flight control software, radar tracking algorithms, and autopilot waypoint navigation operate on pre-programmed, rules-based logic.⁵ Conversely, autonomy involves empowering a system to make “how” decisions to achieve a task within the constraints of defined parameters, requiring an inherent level of artificial intelligence to process variables and adapt to changing environments.¹⁶

The Shift from Deterministic to Non-Deterministic Systems

Historically, military aviation and maritime acquisitions have relied on deterministic systems. In a deterministic software framework, a specific input will always yield the exact same output. Testing these systems involves verifying that the software code correctly executes its programmed logic under defined parameters, usually through rigorous code-level hazard analysis.⁵

Modern military AI, particularly deep neural networks and reinforcement learning agents, is inherently non-deterministic. These models do not follow explicit, human-coded rules; instead, they recognize complex patterns within vast datasets and generate probabilistic outputs.¹² An autonomous drone trained via reinforcement learning may react differently to the same tactical scenario depending on subtle environmental variations, sensor noise, or adversarial electronic warfare (EW) interference. Consequently, traditional software testing methodologies—which rely on verifying every possible line of code or structural logic path—cannot be successfully applied to AI.⁵

The Historical Context of Software Failures in Combat

The stakes of failing to properly evaluate software-driven military systems are historically severe. Existing studies examining military accidents frequently utilize “normal accidents” and “high reliability organizations” theories, which highlight that software development life cycles often expand the causal timeline of accidents beyond immediate battlefield decisions to structural choices made years earlier in software design.¹⁸

During the Cold War, computerized early warning systems produced significant near-miss nuclear crises due to algorithmic misinterpretations. More recently, software integration flaws contributed to the 1988 USS Vincennes shootdown of an Iranian airliner, the 2003 Patriot missile fratricides, and the 2017 USS McCain collision.¹⁸ In the case of the USS McCain, naval reviews indicated that designers added automation without adequately considering the effects on operators trained on legacy equipment.¹⁸ As the DoD transitions to AI, the risk of devastating military accidents increases exponentially if the underlying software is decoupled from rigorous, operationally representative testing environments. AI applications deployed with subtle failure modes, warped incentives, or susceptibility to automation bias present an unacceptable operational risk.¹⁹

3. The Fundamentals of Machine Speed Warfare and True Swarm Architectures

A critical issue impeding TEVV modernization is the conceptual dilution of “swarming” within defense acquisitions. Current modernization discourse and industry marketing often conflate robotic maneuver en masse with true swarm intelligence. As highlighted by defense distributed systems experts, the deployment of dozens or even hundreds of drones controlled by a single operator, or following a pre-scripted leader-follower formation, does not constitute a true swarm.¹

The Illusion of Plurality vs. Singular Cohesion

True swarming requires resilient, collaborative, autonomous problem-solving at machine speed.¹ In a genuine swarm, there is no single point of failure; the entity operates as a singular cohesive unit rather than a plural collection of independent platforms.¹ The U.S. defense industry has largely failed to deliver distributed systems for useful, resilient, collaborative swarming behavior, instead focusing on producing large quantities of individual airframes. By characterizing groups of their products as “swarms,” defense contractors have confused customers and blunted the demand signal that should be fostering a breakthrough capability in distributed systems architecture.¹

Current U.S. approaches to multi-drone operations typically utilize a “one-to-many” model.¹ In this model, multiple drones maneuver in sync under the direction of a central processor or a single human operator utilizing pre-scripted formations. These centralized systems are highly vulnerable; if the leader node is compromised, jammed, or destroyed, the entire group fails.¹ Furthermore, they lack the ability to dynamically adapt at machine speed if the tactical situation changes unexpectedly.

Cloud Independence and Consensus-Based State Management

To achieve true swarming, individual drones must operate on a distributed systems infrastructure. This requires constant, decentralized consensus-building among individual nodes to maintain a shared Common Operating Picture (COP).¹ The drones must continuously agree on the state of the world, target assignments, and navigational hazards without relying on a central server.

A significant challenge in modernizing this capability is the reliance on cloud provider dependencies. Modernization professionals often design systems that require high-bandwidth connections to centralized cloud servers.¹ However, military operators in contested environments cannot rely on uninterrupted high-bandwidth connectivity due to adversarial electronic warfare and spectrum jamming.¹ True swarms must be cloud-independent, utilizing locally self-contained software infrastructure to coordinate via ad hoc, short-range connections that can self-heal dynamically when nodes drop out.¹

The Implications for Testing and Evaluation

The architectural shift toward distributed systems renders legacy testing paradigms obsolete. Evaluating a true swarm requires testing the communication protocols, routing logic, and consensus algorithms that allow multiple autonomous systems from potentially different vendors to function as a cohesive team.²¹ The(https://www.cnas.org/press/press-release/cnas-releases-new-report-lessons-in-learning-ensuring-interoperability-for-autonomous-systems-in-the-department-of-defense) emphasizes that unlike human pilots who can rely on informal coordination over radio, autonomous systems must use preprogrammed, shared protocols.²¹

Legacy ranges lack the instrumentation to effectively monitor, record, and evaluate the internal logic and network state of hundreds of distributed nodes operating simultaneously. If a physical swarm fails to execute a mission, determining whether the failure was caused by a hardware sensor malfunction, network packet loss due to natural atmospheric interference, or a logical flaw in the decentralized consensus algorithm is exceptionally difficult without robust synthetic telemetry mirroring the event.¹ Therefore, the evaluation of interoperability and swarm communication constitutes a primary bottleneck in the current TEVV enterprise.

4. Inadequacies of Legacy Physical Test Ranges

The Director, Operational Test and Evaluation (DOT&E) has repeatedly emphasized that the DoD must rapidly and rigorously test its systems across contested domains to determine operational effectiveness, survivability, and lethality.²³ While traditional physical test ranges remain vital for final operational validation and flight qualification, they present severe bottlenecks for the iterative development of non-deterministic AI and autonomous swarms.³

Geographic Constraints and Safety Footprints

Physical test ranges are geographically bounded and heavily regulated by civilian aviation authorities, such as the Federal Aviation Administration (FAA), and stringent range safety protocols. The testing of unmanned aircraft systems must adhere to strict guidelines regarding population density categories and operational boundaries.²⁴ Testing a drone swarm intended to simulate hundreds of interconnected autonomous munitions requires massive expanses of unencumbered airspace and electromagnetic spectrum.

Physical ranges must manage strict safety footprints to ensure that an anomalous algorithm does not cause an autonomous vehicle to breach civilian airspace or cause damage to property and life.²⁵ As the role of unmanned systems expands, external sensor suites grow more complex, and data processing accelerates, the forces pushing the limits of safe human operational command and control introduce significant risks to physical testing.²⁵ Implementing Sense and Avoid (SAA) systems in coordination with standard aviation protocols requires careful physical bounding that artificially limits the parameters under which an AI can be tested.²⁶

The Logistical Burden of Contested Environment Replication

Replicating the dense, contested environments of modern warfare is practically impossible in the physical world at scale. Establishing a realistic test environment requires deploying complex arrays of adversarial surface-to-air missile (SAM) simulators, GPS spoofing equipment, multi-spectral camouflage, dynamic moving targets, and dense electronic warfare interference.³ The logistics, fuel costs, personnel requirements, and maintenance overhead required to coordinate a single live-fly event with these assets are astronomical. Consequently, this limits testing frequency to a handful of highly scripted, tightly controlled events per year, which is entirely insufficient for software development.

The Iteration Deficit in Machine Learning

The most significant limitation of physical ranges is the iteration deficit. The development of capable AI agents relies on reinforcement learning (RL), a process where an algorithm learns optimal behavior through trial and error over millions of iterations.⁴ In a simulated environment, an autonomous agent can fight a million dogfights in a single day, exploring different tactical geometries, reacting to dynamic threats, and learning from its failures.⁴

If the DoD relies primarily on live-flight testing, an AI model might experience only a few dozen engagements per month. This slow feedback loop is incompatible with modern software development life cycles and the pace of adversarial technological advancement. As highlighted by(https://shield.ai/autonomy-for-the-world-x-62-vista/), combat-ready AI agents demand continuous training around the clock in simulation environments before they are ever loaded onto a physical airframe for validation.⁴ Relying on physical ranges for the bulk of algorithmic training fundamentally throttles the speed of innovation.

[Insert image: A visual contrasting the linear, constrained nature of physical testing against the high-volume, limitless iterations of synthetic simulation environments.]

Close-up of a drilled hole in the receiver of a CNC Warrior M92 folding arm brace

5. Modernizing the TEVV Ecosystem through Synthetic Simulation

To overcome the iteration deficit of physical ranges and establish the statistical confidence required for deployment, the DoD must heavily invest in modernizing simulation environments. Synthetic range capability is not merely an alternative to live testing; it is the foundational prerequisite for developing credible autonomous systems.³ Development and assessment of these systems must be accelerated with credible synthetic range capabilities that support hardware-in-the-loop (HITL) and software-in-the-loop (SITL) evaluation within operationally representative conditions.³

The Role of Digital Twins and Synthetic Data

Digital engineering is rapidly becoming a standard practice across DoD acquisition and sustainment, embedding virtual-first approaches into lifecycle management as directed by DoD Instruction 5000.97.²⁸ Central to this shift is the creation of digital twins. A digital twin is a high-fidelity virtual representation of a physical object, process, or environment that mirrors its real-world counterpart to predict future behavior, powered by real-time data inputs.²⁹

In the context of drone TEVV, digital twins allow engineers to run virtual stress tests, environmental simulations, and edge-case scenarios before physical prototypes are even constructed, thereby catching design flaws early and reducing physical prototyping costs.²⁸ For example, the U.S. Army recently contracted Duality AI to utilize its Falcon digital twin simulation platform to develop an AI-based anti-drone detection system.³⁰ By generating massive volumes of synthetic data representing diverse adversarial environments, digital twins provide the fuel necessary for reinforcement learning engines, enabling faster and more cost-effective deployment through a digital-first approach.³⁰

Case Study: Aerospace Autonomy and the VISTA X-62A

The U.S. Air Force’s X-62A Variable In-Flight Simulation Test Aircraft (VISTA) represents the gold standard for bridging synthetic simulation with live-flight validation. Developed by Lockheed Martin Skunk Works in collaboration with Calspan Corporation for the U.S. Air Force Test Pilot School, VISTA is a heavily modified F-16 utilizing an open systems architecture.³² This architecture allows the aircraft to mimic the aerodynamic performance characteristics of other airframes and host non-vendor-locked third-party AI applications.³²

VISTA is a critical asset because it enables the physical validation of algorithms trained in synthetic environments. In a recent milestone, an autonomous AI agent developed by Shield AI took control of the VISTA and executed tactical basic fighter maneuvers (dogfighting) against a human-piloted F-16.⁴ The success of this live test was entirely predicated on the millions of daily synthetic dogfights the AI agent executed in simulation.⁴ VISTA provides the crucial hardware-in-the-loop validation, proving that an algorithm optimized in a sterile digital environment can handle the latency, sensor noise, and dynamic aerodynamic realities of physical flight—operating at speeds up to Mach 2 and altitudes of 50,000 feet—without requiring the procurement of a dedicated, single-purpose test drone.⁴

Case Study: Maritime Autonomy and the Naval Autonomous Test System

In the maritime domain, testing autonomous unmanned surface vehicles (USVs) introduces unique complexities. USVs must comply with collision regulations, navigate complex wave dynamics, and maneuver through shallow waterways.³⁶ To address this, the Navy and academic partners are developing the Naval Autonomous Test System (NATS).³

NATS is a simulation framework built on platforms like Unity, MATLAB-Simulink, and ROS2 (Robot Operating System).³⁶ It creates digital twins of the real-world maritime environment, allowing for the evaluation of autonomous navigation algorithms through rigorous software-in-the-loop (SITL) testing.⁸ The framework models the complex interactions between a vessel’s control algorithms and realistic environmental factors, generating three-dimensional navigation environments by combining actual wave spectra with Gerstner waves.³⁸ By utilizing high-resolution bathymetric data from major U.S. ports, NATS can test a USV’s ability to navigate confined waterways and avoid grounding—scenarios that are too dangerous and costly to test iteratively with physical vessels.³⁶ This framework provides a simulation that encompasses the challenges of complex maritime operations, assisting developers in discovering unpredicted interactions and improving system robustness.⁸

Quantifying Swarm Performance in Synthetic Environments

When testing massive drone swarms in these synthetic environments, traditional platform-centric performance metrics must be adapted. Swarm TEVV requires tracking distributed behaviors, evaluating how effectively algorithms balance speed, solution quality, scalability, and reliability.³⁹

Key metrics evaluated in simulation include:

Convergence Speed: The measure of how quickly the decentralized algorithm finds a solution or reaches consensus among nodes, quantified by tracking computation time or the number of iterations needed to reach a predefined error threshold.³⁹
Solution Quality: Measured using error rates, fitness values, or comparisons to ground-truth solutions (e.g., using standardized benchmark test functions like Rastrigin or Schwefel) to ensure the swarm selects optimal paths or targets.³⁹
Scalability: Evaluated by increasing the problem size—such as adding hundreds of new drones to the network—and observing performance degradation to ensure the swarm maintains coordination under load.³⁹
Robustness and Fault Tolerance: The swarm’s ability to adapt to dynamic constraints, tested by introducing synthetic sensor noise, simulating node attrition, or altering constraints mid-execution and measuring success rates across multiple runs.³⁹
Formation Integrity and Leadership Error: The ability of the swarm to maintain spatial coherence under varying degrees of communication latency and positioning errors.²⁷

[Insert image: A structured table outlining the core metrics used to evaluate autonomous swarm performance within synthetic simulation environments.]

6. The Administrative Stranglehold: Safety Review Boards and Certification

Beyond the physical limitations of test ranges, the DoD’s administrative and safety certification apparatus presents a severe bottleneck. Before any weapon system can be deployed, it must be evaluated by Safety Review Boards (SRBs) and certification agents to ensure it operates within acceptable risk parameters. For legacy systems, this is achieved through established systems engineering processes and a rigid adherence to Level of Rigor (LOR) standards.⁵

The Failure of Level of Rigor (LOR) for Machine Learning

The Office of the Under Secretary of Defense for Research and Engineering (OUSD(R&E)) has explicitly identified that applying traditional LOR to machine learning is fundamentally insufficient to mitigate risk.⁵ In traditional software engineering, SRBs rely on extensive documentation that provides a deep understanding of implemented behavior. Engineers conduct low-level design hazard analyses, code-level hazard analyses, and Requirements-Based Structural Coverage Analysis at the Modified Condition/Decision Coverage (MC/DC) level to guarantee that every line of code executes as intended, evaluating data and control coupling.⁵

Machine learning models, however, function effectively as “black boxes.” They lack the transparency required for traditional software safety means, making it impossible to create the artifacts that map how a neural network weighs billions of parameters to arrive at a specific target identification.⁵ Consequently, SRBs are left evaluating a system without the deep analytical insight they have historically relied upon. For high-criticality tasks, known as Safety Flight Critical Index 1 (SFCI 1) functions, the uncertainty associated with ML precludes the ability to provide sufficient confidence based solely on developmental assurance and LOR.⁵

The Operational Design Domain (ODD) Mismatch

A fundamental challenge for SRBs evaluating autonomous systems is the inherent misalignment between the Operational Design Domain (ODD) and the Training Data Distribution.⁵ The ODD defines the specific conditions under which a system is designed to function (e.g., specific altitudes, weather conditions, threat landscapes). The training data represents the dataset used to teach the AI how to operate within that ODD.

By the nature of machine learning, the training data will always be a limited sample or subset of the infinite complexities of the real-world ODD.⁵ While developers strive for robust generalization, the real world will always present edge cases—novel visual patterns, unpredictable adversary tactics, or anomalous sensor inputs—that were absent from the training set. This guarantees a remaining margin of uncertainty and an expected lower success rate in operational deployment than in controlled testing.⁵ SRBs, traditionally tasked with eliminating uncertainty, struggle to certify systems where residual uncertainty is an architectural reality.

Automation Bias and the Challenge of Explainability

When SRBs evaluate decision-support systems (AI-DSS)—such as AI designed to filter reconnaissance data and recommend targets for human commanders—they face the challenge of evaluating human-machine teaming. AI models can suffer from subtle failure modes; they may act deceptively, tell human operators what they want to hear based on warped incentives, or generate hallucinations under battlefield conditions.¹⁹

The lack of robust explainability limits the ability of operators to verify the reasoning behind an AI’s target recommendation.⁵ Empirical evidence from recent conflicts indicates that utilizing AI-DSS to accelerate phases within the joint targeting cycle risks encouraging over-reliance on unverified outputs (automation bias), potentially exacerbating civilian harm rather than preventing it.¹⁴ If SRBs mandate perfect explainability before deployment, AI acquisition will stall indefinitely, as current technical capabilities in ML explainability are far off from providing commensurate insight.⁵

Operationalizing Unmanned Systems Safety Precepts

To navigate these challenges, leadership must guide SRBs to accept new frameworks of governability. The Unmanned Systems Safety Guide for DoD Acquisition outlines specific safety engineering precepts categorized into Programmatic, Operational, and Design Safety Precepts (PSP, OSP, DSP).²⁵ These precepts assist program managers in mitigating hazards unique to unmanned capabilities.

State-of-the-art mitigations for managing machine learning include the use of deterministic checkpoints within software architecture, which provide run-time assurance that the autonomous function does not exceed defined safety limits.²⁵ Additionally, implementing strict bounding of autonomous functions—such as physical/temporal bounds and enforcing human-in-the-loop or human-on-the-loop oversight—reduces risk.²⁵ SRBs must be trained to evaluate systems based on reliable containment, operator override mechanisms, and statistical performance boundaries rather than demanding perfect code transparency.

7. Shifting Paradigms in Cybersecurity: The cATO Initiative

If the DoD successfully modernizes its physical and synthetic testing environments, it will generate highly capable AI models at unprecedented speeds. However, if these models must pass through the legacy cybersecurity and deployment authorization pipelines, the strategic advantage of rapid iteration will be lost. To address this, the Department is undertaking a paradigm shift in how software and AI updates are approved for operational use.

The Limitations of Traditional ATOs

Historically, defense software requires an Authorization to Operate (ATO) certification. The ATO process relies heavily on the Risk Management Framework (RMF) and involves point-in-time, document-heavy technical security assessments.⁹ Securing an ATO is notoriously slow, rigid, and resource-intensive. A program office might spend months generating the required administrative paperwork to prove compliance, securing approval for a system that could face entirely new cyber vulnerabilities six months later.¹⁰

For autonomous drones relying on AI, a static ATO is highly detrimental. Adversarial tactics evolve daily; an AI model trained to recognize enemy assets must be continuously updated with new data to remain relevant.⁴¹ As former Acting DoD CIO Katie Arrington noted, relying on software within a static ATO environment fails the warfighter because the operational environment and the adversary are constantly dynamic.⁴¹ If every model retrain or software patch triggers a multi-month ATO recertification process, the drone swarm will always be fighting with outdated intelligence.

Transitioning to Continuous Authority to Operate (cATO)

To enable the rapid, secure deployment of software updates, the DoD is implementing the Continuous Authority to Operate (cATO) framework. The DoD defines cATO as a state achieved when an organization that develops, secures, and operates a system demonstrates sufficient maturity in its ability to maintain a resilient cybersecurity posture, rendering traditional risk assessments and authorizations redundant.⁶

Under the cATO framework, the focus shifts from evaluating a static piece of software to evaluating the maturity of the software factory that produces it.⁹ If a DoD software delivery organization utilizes approved DevSecOps platforms that meet DoD Enterprise DevSecOps Reference Designs, implements continuous risk monitoring, and practices active cyber defense, it can be granted a cATO.⁹ This authorizes the organization to continuously develop, assess, and deploy software updates directly to the field—including pushing new AI models to operational drones—without awaiting secondary administrative approvals, provided the updates remain within the established risk tolerances.⁹

Programs like the Software Fast Track (SWIFT) initiative are actively seeking to replace legacy ATO mechanisms with automated, AI-driven risk assessments, doing third-party assessments of companies’ cybersecurity postures based on defined risk criteria.⁴¹ The U.S. Army has already begun applying the cATO framework to existing systems like Nett Warrior and Gabriel Nimbus, marking a fundamental cultural shift from compliance-based administration to threat-based continuous risk management.⁶ For drone TEVV, securing a cATO for autonomy software factories is a critical prerequisite for maintaining tactical agility.

8. Maturing the Evaluation Standard: Transitioning to the AI Readiness Framework

A core administrative mechanism utilized by the DoD to manage defense acquisitions is the Technology Readiness Assessment (TRA), which relies heavily on Technology Readiness Levels (TRLs).⁴³ Originally developed by NASA in the 1970s and formally endorsed by the DoD in 2001, the 1-to-9 TRL scale was designed to gauge the maturity of hardware systems.⁴³

The Failure of TRLs for AI Capabilities

The TRL framework measures progression from basic observed principles (TRL 1) to analytical proof of concept (TRL 3), component validation in a laboratory (TRL 4), prototype demonstration in a relevant environment (TRL 6), and finally, successful system operations in combat (TRL 9).⁴⁴ While TRLs are highly effective for evaluating the structural maturity of a drone’s airframe, propulsion system, or sensor hardware, they are fundamentally inadequate for evaluating the maturity of the AI algorithms controlling the drone.¹¹

Current technology readiness assessments fail to capture critical AI-specific risk factors, such as data integrity, algorithmic bias, model drift, and the quality of human-machine interaction.¹¹ A traditional TRL assessment assumes a linear development path where a component works identically in a high-fidelity lab environment as it does in an operational setting.⁴³ As previously established, the non-deterministic nature of AI and the mismatch between training data and the operational environment means this linear assumption is false. An AI model that exhibits flawless target recognition on a simulated range may fail completely when encountering novel weather patterns or adversarial camouflage in the field.⁵

The Proposed AI Readiness Framework (AIRL)

To ensure justified confidence in AI-enabled systems prior to deployment, the DoD and associated policy experts are advocating for the adoption of a dedicated AI Readiness Framework, analogous to but expanded beyond traditional TRLs.¹¹ This framework provides decision-makers with a multidimensional view of an autonomous system’s maturity, acknowledging that readiness requires organizational commitment and the addressing of skills and capability gaps.⁴⁷

A comprehensive AI Readiness Framework requires evaluating several core pillars that go beyond mere software functionality:

Justified Confidence and Alignment: Ensuring the AI system’s probabilistic outputs are tightly aligned with commander intent and rules of engagement, and that performance degradation in edge cases is statistically quantified and understood.¹¹
Data Readiness Level (DRL): Assessing the maturity, security, and representativeness of the data used to train the algorithm. A highly advanced algorithm trained on low-quality, incomplete, or biased data has a low DRL and represents a severe operational risk.¹¹
Human Readiness Level (HRL): Evaluating the interface between the AI and the operator. This measures whether the system is understandable, whether operators are sufficiently trained to recognize algorithmic hallucination or failure, and whether effective override mechanisms (governability) are in place to prevent automation bias.¹¹
Governance and Continuous Benchmarking: The establishment of standardized AI safety benchmarks and monitoring protocols to track performance gaps over time.¹¹ A federally coordinated benchmarking hub, spearheaded by entities like the Chief Data and Artificial Intelligence Officer (CDAO) and Defense Innovation Unit (DIU), is critical for delivering uniform evaluations across the DoD.⁴⁹

Evaluation Domain	Traditional TRL Focus	AI Readiness Framework (AIRL) Focus
System Behavior	Deterministic operations; verifies specific logic paths and hardware durability.	Non-deterministic probabilities; evaluates statistical confidence boundaries.
Development Path	Linear hardware progression from lab testing to operational flight validation.	Iterative software cycles requiring continuous model retraining via digital twins.
Environmental Testing	Hardware durability under physical stress (e.g., temperature, vibration, shock).	Algorithmic robustness against out-of-distribution data and adversarial inputs.
Human Interface	Ergonomics and straightforward mechanical operability of physical controls.	Mitigation of automation bias, explainability of AI outputs, and system governability.

Adopting this expanded framework, alongside utilizing tools like the CDAO’s Pathway to AI Readiness and AI Readiness Assessment (AIRA) metrics, allows acquisition professionals to communicate the true readiness of an autonomous system to operational commanders.⁵⁰ This ensures that the deployment of AI is based on comprehensive risk awareness rather than hardware milestones alone.

9. Acquisition Dynamics and Resource Allocation Bottlenecks

The structural issues within TEVV are exacerbated by systemic flaws in defense acquisition protocols. The DoD system acquisition process often outsources software development to contractors while limiting input from military end-users, leading to systems that fail to meet operational realities.¹⁸

When the DoD relies entirely on commercial defense contractors to provide the testing environments and validate their own autonomous systems, it risks vendor lock-in and excessive cost overheads. Investigations into defense contracting have repeatedly highlighted issues with pricing data validation; for example, the DoD inspector general has previously found contractors overcharging the military by vast margins for basic spare parts due to loopholes in the Truth in Negotiations Act (TINA).⁵³ If these same opaque pricing models are applied to proprietary software simulation environments and AI training algorithms, the cost of modernizing drone swarms will become unsustainable.

To mitigate this, the DoD must retain ownership of the testing infrastructure and enforce open systems architectures. By mandating that contractors utilize government-owned digital twins and standardized benchmarking frameworks managed by the CDAO, the DoD can ensure competitive pricing, prevent siloed development efforts, and maintain rigorous, unbiased oversight over the algorithms being integrated into the Joint Force.²¹

10. Strategic Recommendations for DoD Leadership

The transition to an AI-enabled, autonomous Joint Force is not merely an engineering challenge; it is fundamentally an infrastructural and regulatory challenge. To overcome the existing TEVV bottlenecks and realize the strategic potential of drone swarms, DoD leadership must operationalize the following recommendations:

1. Reallocate Funding to Synthetic T&E Infrastructure

Current acquisition budgets heavily favor platform procurement. Leadership must direct significant, sustained funding toward the development of enterprise-wide digital twins, realistic synthetic data generators, and joint simulation frameworks (analogous to the Naval Autonomous Test System and VISTA X-62A capabilities). Autonomous systems cannot mature without the capacity to conduct millions of daily reinforcement learning iterations in high-fidelity, adversarial digital environments.

2. Demand Distributed Systems Architectures for Swarms

Acquisition executives must refine their demand signals to industry. Solicitations for drone swarms must explicitly require cloud-independent, distributed systems architectures capable of localized consensus building. Procuring vast quantities of remotely piloted drones and categorizing them as a “swarm” dilutes the capability and perpetuates vulnerabilities associated with centralized single points of failure.

3. Accelerate the Transition to Continuous Authority to Operate (cATO)

The traditional Authorization to Operate (ATO) process is incompatible with the operational tempo required for AI software updates. Leadership must support the DoD CIO’s efforts to implement cATO frameworks across all autonomous systems program offices. Cultivating mature DevSecOps software factories allows for the continuous, secure deployment of refined algorithms directly to the tactical edge without bureaucratic delay.

4. Evolve Safety Review Board (SRB) Methodologies

SRBs must be given the doctrinal authority and the technical tools to evaluate non-deterministic systems. Leadership should issue guidance formally recognizing that legacy Level of Rigor (LOR) standards are insufficient for machine learning. SRBs must shift toward evaluating statistical confidence limits, implementing deterministic checkpoints, and enforcing human override mechanisms, acknowledging that some level of residual uncertainty is inherent to AI.

5. Adopt the AI Readiness Framework (AIRL)

The DoD should formally expand its acquisition taxonomy by integrating AI Readiness Levels alongside traditional Technology Readiness Levels. By establishing distinct, enforceable metrics for Data Readiness Levels (DRL) and Human Readiness Levels (HRL), decision-makers will gain an accurate, comprehensive assessment of an autonomous system’s true combat readiness, ensuring that human commanders can trust the tools they deploy.

The successful deployment of autonomous military systems hinges not on the physical sophistication of the drone itself, but on the rigor, scale, and agility of the digital systems used to test, evaluate, and certify it. Modernizing the TEVV enterprise is the indispensable prerequisite for maintaining technological overmatch in future conflicts.

Please share the link on Facebook, Forums, with colleagues, etc. Your support is much appreciated and if you have any feedback, please email us in**@*********ps.com. If you’d like to request a report or order a reprint, please click here for the corresponding page to open in new tab.

Sources Used

Drones Aren’t Swarming Yet — But They Could – War on the Rocks, accessed April 24, 2026, https://warontherocks.com/drones-arent-swarming-yet-but-they-could/
How Autonomy Can Transform Naval Operations, accessed April 24, 2026, https://www.onr.navy.mil/media/document/nracfinalreport-autonomynov2012pdf
DOT&E FY2024 Annual Report – Test and Evaluation Resources, accessed April 24, 2026, https://www.dote.osd.mil/Portals/97/pub/reports/FY2024/other/2024teresources.pdf?ver=w-LuBRe62rnH4O8pl5lb9A%3D%3D
Autonomy for the World: X-62 VISTA – Shield AI, accessed April 24, 2026, https://shield.ai/autonomy-for-the-world-x-62-vista/
22 July 24 MEMORANDUM FOR RECORD To: Office of the Under …, accessed April 24, 2026, https://www.cto.mil/wp-content/uploads/2025/08/Memo-Endorse-LOR-Objectives-DistroA.pdf
U.S. Army Officials Launch New Way to Constantly Monitor Risks | AFCEA International, accessed April 24, 2026, https://www.afcea.org/signal-media/cyber-edge/us-army-officials-launch-new-way-constantly-monitor-risks
US DoD Leverages Digital Twin Modelling for All Systems | Amentum, accessed April 24, 2026, https://www.amentum.com/news/us-dod-leverages-digital-twin-modelling-for-all-systems/
Towards a Modelling & Simulation Capability for Training Autonomous Vehicles in Complex Maritime Operations – NATO, accessed April 24, 2026, https://publications.sto.nato.int/publications/STO%20Meeting%20Proceedings/STO-MP-MSG-184/MP-MSG-184-17.pdf
Continuous Authorization to Operate (cATO) Evaluation … – DoD CIO, accessed April 24, 2026, https://dodcio.defense.gov/Portals/0/Documents/Library/cATO-EvaluationCriteria.pdf?ver=A8tLIfPjmp3RpemU6JOhJw%3D%3D
Unpacking the DoD Continuous Authorization to Operate (cATO) Evaluation Criteria – Part III: The Role of Active Cyber Defense in cATO – BreakPoint Labs, accessed April 24, 2026, https://breakpoint-labs.com/unpacking-the-dod-continuous-authorization-to-operate-cato-evaluation-criteria-part-iii-the-role-of-active-cyber-defense-in-cato/
Rethinking Technological Readiness in the Era of AI Uncertainty – arXiv, accessed April 24, 2026, https://arxiv.org/html/2506.11001v1
AI on the Battlefield – Apogee Magazine, accessed April 24, 2026, https://apogee-magazine.com/features/ai-on-the-battlefield/
Future Force: Impact of Autonomous Systems on the Defense Sector, accessed April 24, 2026, https://insideunmannedsystems.com/future-force-impact-of-autonomous-systems-on-the-defense-sector/
Designing Lawful Military AI: Technical and Legal Reflections on Decision-Support and Autonomous Weapon Systems – Perry World House, accessed April 24, 2026, https://perryworldhouse.upenn.edu/news-and-insight/designing-lawful-military-ai-technical-and-legal-reflections-on-decision-support-and-autonomous-weapon-systems/
Air Force Doctrine Note 25-1, Artificial Intelligence (AI) – USAF, accessed April 24, 2026, https://www.doctrine.af.mil/Portals/61/documents/AFDN_25-1/AFDN%2025-1%20Artificial%20Intelligence.pdf
Test & Evaluation of AI-enabled and Autonomous Systems: A Literature Review, accessed April 24, 2026, https://testscience.org/wp-content/uploads/formidable/20/Autonomy-Lit-Review.pdf
View of Not Oracles of the Battlefield: Safety Considerations for AI-Based Military Decision Support Systems, accessed April 24, 2026, https://ojs.aaai.org/index.php/AIES/article/view/31712/33879
Machine Failing: How Systems Acquisition and Software Development Flaws Contribute to Military Accidents – Texas National Security Review, accessed April 24, 2026, https://tnsr.org/2024/10/machine-failing-how-systems-acquisition-and-software-development-flaws-contribute-to-military-accidents/
Comprehension and Control of Frontier Military AI Systems | by Dr. Jerry A. Smith | Medium, accessed April 24, 2026, https://medium.com/@jsmith0475/comprehension-and-control-of-frontier-military-ai-systems-5814ec0890a6
The risks and inefficacies of AI systems in military targeting support, accessed April 24, 2026, https://blogs.icrc.org/law-and-policy/2024/09/04/the-risks-and-inefficacies-of-ai-systems-in-military-targeting-support/
CNAS Releases New Report: Lessons in Learning: Ensuring …, accessed April 24, 2026, https://www.cnas.org/press/press-release/cnas-releases-new-report-lessons-in-learning-ensuring-interoperability-for-autonomous-systems-in-the-department-of-defense
State-of-the-Art and Future Research Challenges in UAV Swarms – ResearchGate, accessed April 24, 2026, https://www.researchgate.net/publication/378115164_State-of-the-Art_and_Future_Research_Challenges_in_UAV_Swarms
FY 2024 Annual Report – DOT&E FY2024 Annual Report, accessed April 24, 2026, https://www.dote.osd.mil/Portals/97/pub/reports/FY2024/other/2024Annual-Report.pdf
Blog Post: Building the Bridge to the Autonomous Sky: How the FAA Can Unlock Drones for the Public Good | News, accessed April 24, 2026, https://ethics.nd.edu/news-and-events/news/blog-post-building-the-bridge-to-the-autonomous-sky-how-the-faa-can-unlock-drones-for-the-public-good/
Unmanned System Safety Engineering Precepts Guide … – USD(R&E), accessed April 24, 2026, https://www.cto.mil/wp-content/uploads/2023/06/UxS-Precepts-2021.pdf
Security analysis of drones systems: Attacks, limitations, and recommendations – PMC – NIH, accessed April 24, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC7206421/
Swarm of Drones in a Simulation Environment—Efficiency and Adaptation – MDPI, accessed April 24, 2026, https://www.mdpi.com/2076-3417/14/9/3703
7 Ways That AI-Powered Digital Twins Are Shaping the Future of Defense – Sumaria Blog, accessed April 24, 2026, https://blog.sumaria.com/ai-powered
Digital twins: fostering efficient network modernization – Military Embedded Systems, accessed April 24, 2026, https://militaryembedded.com/comms/communications/digital-twins-fostering-efficient-network-modernization
Pioneering a Digital-First Approach, U.S. Army Contracts Digital Twin Simulation Company, Duality AI, for Development of AI-Based Anti-Drone System, accessed April 24, 2026, https://www.duality.ai/news/pioneering-a-digital-first-approach-u-s-army-contracts-digital-twin-simulation-company-duality-ai-for-development-of-ai-based-anti-drone-system
Digital Twins For A Digital World: Data-Driven Training Optimizing The Ready Medical Force, accessed April 24, 2026, https://www.lineofdeparture.army.mil/Journals/Special-Warfare/Special-Warfare-Archive/2025-E-edition/Digital-Twins/
X-62A VISTA | Lockheed Martin, accessed April 24, 2026, https://www.lockheedmartin.com/en-us/products/x-62a-vista.html
X62A Vista – Edwards Air Force Base – USAF, accessed April 24, 2026, https://www.edwards.af.mil/Units/X62A-Vista/
VISTA X-62A Training Aircraft, USA – Air Force Technology, accessed April 24, 2026, https://www.airforce-technology.com/projects/vista-x-62a-training-aircraft-usa/
General Dynamics X-62 VISTA – Wikipedia, accessed April 24, 2026, https://en.wikipedia.org/wiki/General_Dynamics_X-62_VISTA
A Robust Simulation Framework for Verification and Validation of Autonomous Maritime Navigation in Adverse Weather and Constrained Environments – arXiv, accessed April 24, 2026, https://arxiv.org/html/2603.02487v1
Navy Developing Simulations to Test Autonomous Vessels – National Defense Magazine, accessed April 24, 2026, https://www.nationaldefensemagazine.org/articles/2023/11/17/navy-developing-simulations-to-test-autonomous-vessels
A Virtual System and Method for Autonomous Navigation Performance Testing of Unmanned Surface Vehicles – MDPI, accessed April 24, 2026, https://www.mdpi.com/2077-1312/11/11/2058
How do you evaluate the performance of swarm algorithms? – Milvus, accessed April 24, 2026, https://milvus.io/ai-quick-reference/how-do-you-evaluate-the-performance-of-swarm-algorithms
Evaluating Formation Metrics in Autonomous UAV Swarms Using Raft Under Communication Constraints | Request PDF – ResearchGate, accessed April 24, 2026, https://www.researchgate.net/publication/399104207_Evaluating_Formation_Metrics_in_Autonomous_UAV_Swarms_Using_Raft_Under_Communication_Constraints
New Pentagon program to speed up software acquisition set to launch May 1, accessed April 24, 2026, https://defensescoop.com/2025/04/29/dod-cio-katie-arrington-swift-software-acquisition-ato/
Department of Defense Seeks Speedy Software Development for Maximum Readiness, accessed April 24, 2026, https://fedtechmagazine.com/article/2025/11/department-defense-seeks-speedy-software-development-maximum-readiness
Technology Readiness Assessment Guidebook – USD(R&E), accessed April 24, 2026, https://www.cto.mil/wp-content/uploads/2025/03/TRA-Guide-Feb2025.v2-Cleared.pdf
GAO-20-48G, MAIN TITLE VERSION 6-4: TECHNOLOGY READINESS ASSESSMENT GUIDE [Reissued with revisions on Feb. 11, 2020.], accessed April 24, 2026, https://www.gao.gov/assets/gao-20-48g.pdf
Technology Readiness Levels – NASA, accessed April 24, 2026, https://www.nasa.gov/directorates/somd/space-communications-navigation-program/technology-readiness-levels/
Technology Readiness Levels in the Department of Defense (DoD), accessed April 24, 2026, https://api.army.mil/e2/c/downloads/404585.pdf
AI Readiness: Definition and Frameworks – Udemy Business, accessed April 24, 2026, https://business.udemy.com/it/blog/ai-readiness-definition-and-framework/
Amplifying AI Readiness in the DoD Workforce – Software Engineering Institute, accessed April 24, 2026, https://www.sei.cmu.edu/blog/amplifying-ai-readiness-in-the-dod-workforce/
Codifying and Expanding Continuous AI Benchmarking – Federation of American Scientists, accessed April 24, 2026, https://fas.org/publication/codifying-expanding-continuous-ai-benchmarking/
AI Readiness Assessment Guide – AI-REAL Toolkit, accessed April 24, 2026, https://ai-real.dco.org/assets/frontend/images/AI-Readiness-Assessment-Guide.pdf
Pathway to AI Readiness – Chief Digital and Artificial Intelligence Office, accessed April 24, 2026, https://www.ai.mil/About/Resources/Pathway-to-AI-Readiness/
Artificial Intelligence Readiness Assessment (AIRA) – United Nations Development Programme, accessed April 24, 2026, https://www.undp.org/sites/g/files/zskgke326/files/2025-01/ai_readiness_assessment_bhutan.pdf
How Acquisition Reform Could Make Military AI More Expensive and Less Safe, accessed April 24, 2026, https://www.brennancenter.org/our-work/analysis-opinion/how-acquisition-reform-could-make-military-ai-more-expensive-and-less

Ronin's Grips