This is the second blog in a series. In the first blog (Accelerating Automated Driving System Deployment with Scalable, Data-Driven Evaluation), Mike Stellfox pointed out that the real challenge in AV development has shifted from simply building systems to ensuring we can truly trust them.

In the quest to build trust in self-driving safety, it’s not about how far you drive, but what you find along the way and how completely you understand scenario coverage. Logging billions of uneventful miles is inefficient for proving safety and reliability because it contributes little to meaningful scenario coverage. A far more powerful method is to actively find and analyze the rare, safety-critical events that truly matter for the safety and performance of an Automated Driving System (ADS), dramatically improving scenario coverage.

This article explores a methodology for automated scenario curation, the process of identifying and extracting the most valuable, safety-critical scenarios from massive sets of real-world and simulated drive logs in order to strengthen scenario coverage. By focusing on these meaningful slices of data, we can efficiently and effectively evaluate a system’s true capabilities and directly advance scenario coverage. This shift from quantity to quality in data analysis is crucial for ensuring that autonomous vehicles are thoroughly tested in the very conditions that pose the greatest risk, ultimately building confidence in their deployment.

Beyond simply finding interesting events, efficient scenario curation is critical for establishing the absence of unreasonable risk. By identifying, extracting, and validating the specific scenarios that pose a genuine risk for the ADS, this process is the foundational layer for both verification and validation (V&V) and AI training workflows. For V&V, it helps pinpoint the root causes of unexpected behavior and streamlines debugging. For training, it enables mining and assembling datasets that represent crucial edge cases, anomalies, and underrepresented conditions, significantly boosting model robustness and an ADS’s ability to handle the unexpected.

The automated approach enabled by the Foretify Evaluation solution allows engineers to tackle three critical tasks:

  1. Finding and analyzing real-world edge cases to build a diverse and representative dataset of scenarios essential for both training and validation
  2. Objectively measuring testing or training coverage within a given Operational Design Domain (ODD), proving that the ADS has been exposed to a wide variety of relevant situations. 
  3. Identifying ODD gaps to provide quantitative insights into micro-ODDs that the ADS has not yet been exposed to. This transforms raw data into actionable intelligence, significantly accelerating the path to a safer ADS.

Finally, by focusing on these valuable, data-driven insights:

  • Engineers and Engineering Leaders can prioritize development around the situations proven to cause system uncertainty or failure in the real world.
  • ADS Safety Assurance Teams can clearly quantify risks and define mitigation methods for development and deployment teams.
  • Deployment teams can apply appropriate constraints to ensure the ADS operates only within the safe subset of its target operational domain (TOD).

Autonomous vehicles (AVs) are built on sensors. Cameras, lidar, and radar act as the vehicle’s eyes and ears, feeding perception systems with the data they need to understand the environment. If these sensors fail or if the vehicle’s software misunderstands them, safety is at risk. Relying only on real-world testing to ensure sensor performance and the (functional) safety of sensor data is cost-intensive. That’s why high-fidelity sensor simulation is no longer optional. It has become a cornerstone of how the industry validates AV and ADAS systems.

Why Sensor Simulation Is Essential

For years, the automotive industry has relied almost entirely on real-world testing. Fleets of cars are driven across countries to capture different weather, light, and traffic conditions: Germany to Finland or Detroit to Canada for winter, down to Spain or Arizona for hot weather. This approach is slow, expensive, and incomplete. Many critical scenarios occur so rarely that test fleets may miss them altogether or fail to capture enough data to validate performance with confidence.

The problem is magnified as AV stacks move toward end-to-end AI models, like those pioneered by Tesla and Wayve. These systems can’t be trained or validated component by component. They need full-stack, sensor-inclusive training and testing. To keep costs under control and timelines competitive, training and testing must shift into simulation.

Limitations of Traditional Training and Testing Data

Real-world campaigns bring hidden inefficiencies. Data often becomes useless if sensor hardware or mounting positions change. Design discussions between engineers and designers, such as whether a lidar belongs on the roof or hidden in the bumper, stall without evidence. Every change means repeating the same expensive mileage to generate new data.

In contrast, simulation allows the same scenarios to be replayed instantly under new conditions. This flexibility is crucial for both accelerating development and resolving internal design trade-offs.

Advantages of Sensor Simulation

High-fidelity sensor simulation unlocks advantages that road testing cannot:

  • Early validation: Teams can test virtual sensor rigs long before front-end freeze.
  • Test reuse: The same scenarios can be run across projects, even if sensor configurations evolve.
  • Design flexibility: Mounting options can be explored virtually to balance technical and design needs.
  • Scalability: Most importantly, teams can scale testing faster and at lower cost, cutting time-to-market.

This last point is the most important: sensor simulation is not just an alternative to road testing; it is the only way to achieve the speed and scale the industry now demands.

How Sensors Are Different (And Difficult To Simulate)

Not all sensors behave the same, and that makes simulation challenging:

  • Camera: Relatively straightforward to simulate, but realism and compute performance are key. Sun glare, fog, rain, and reflections all need to look natural to fool computer vision algorithms.
  • LiDAR: Generates dense point clouds through time-of-flight or FMCW technology. Simulation must account for noise, environmental effects like snow or heavy rain, and wavelength-specific behavior.
  • Radar: The most complex to simulate. Sensor specifics, like multipath reflections off guardrails, wet roads, or barriers, create “ghost” targets, false objects that move realistically and can confuse tracking algorithms. High-fidelity radar simulation requires deep, sensor-specific modeling.
Fig 1: Driving scenario with ghost object presence

These differences and challenges are exactly why high-fidelity sensor simulation is so complex, and why only a handful of solutions in the market can perform it effectively.

Foretellix + NVIDIA: A Complete Toolchain

The Foretify platform is the industry’s Physical AI toolchain for training, validation, and safety evaluation of AI-powered AV stacks. It combines scenario-driven generation and evaluation with high-fidelity sensor simulation to close the gap between real-world and virtual testing.

  • Foretify Generate: Creates synthetic data and hyper-realistic scenarios at scale, covering edge cases across diverse operational design domains (ODDs).
  • Foretify Evaluate: Unifies real-world driving data with simulation results to identify coverage gaps and deliver safety-critical performance metrics.
  • Integration with NVIDIA Omniverse, NuRec, and Cosmos: Adds hyper-realistic sensor rendering and vendor-specific models, capturing the behavior of cameras, lidar, and radar with high fidelity. It also enables neural reconstruction, which transforms logged real-world drives into reusable, editable simulation scenes, allowing teams to vary conditions like weather, actors, or sensor placements.

Together, these capabilities give AV developers a data-driven autonomy development toolchain that unifies training and testing. By supporting both open- and closed-loop simulation in the behavior and sensor domains, it enables the generation of diverse training data, accelerates validation, reduces cost, and, most importantly, ensures safe AI-powered autonomy.

Fig. 2: Synthetic Variation with Smart Replay (Behaviors) and Cosmos Transfer (Scenery)

The Road Ahead

Sensor simulation is still early in its adoption curve. Today, only a minority of traditional OEMs use it beyond research projects, while new entrants lean on it heavily. Within five years, we anticipate every automaker will depend on sensor simulation as part of daily development.

The fidelity of simulation is also improving rapidly. Sensor simulation is already highly realistic in nominal scenarios and covers many edge cases and physical behaviors, but the industry is now pushing toward near-complete coverage of real sensor behavior. While some physical road miles will always be required, the majority of training and validation will move into the virtual domain.

With rising cost pressures, shorter timelines, and the shift to end-to-end AI stacks, high-fidelity sensor simulation has become indispensable. Foretellix, in partnership with NVIDIA, is helping the industry accelerate this transformation, delivering the realism and flexibility needed to enable safe AI-powered autonomy at scale.

Learn more about Foretellix’s end-to-end toolchain for AV and ADAS validation.

The autonomous vehicle industry is undergoing a transition from high-definition maps toward real-time perception-based navigation. This shift raises a critical challenge for development teams: their simulation and validation tools still require detailed maps that are becoming increasingly costly and impractical to maintain. This article explores how automated map construction from vehicle perception data offers a solution, enabling AV teams to generate high-quality maps directly from drive logs without the overhead of traditional HD mapping operations. By leveraging the same perception data that enables map-free driving, this approach promises to bridge the gap between operational autonomy and development infrastructure needs.

Not so long ago, autonomous driving applications commonly relied on high-definition (HD) maps for guidance. Such maps were sometimes shared by a fleet of AVs and were updated on the fly as vehicles in the fleet shared data acquired while driving.

At present, the AV industry is moving away from HD maps. Instead, AV sensors and the perception pipeline produce a detailed representation of the AV surroundings, often including the road’s drivable area, lane lines, lane markings, traffic signs and so on. This temporary map is created repeatedly as the vehicle drives. Importantly, this approach doesn’t depend on a real-time communication network, a fleet or a remote server. 

Multiple factors contribute to this trend:

  • AV sensors and perception algorithms have improved sufficiently to reliably produce such temporary local maps
  • Creating, maintaining and sharing HD maps became a costly ongoing operation. Scaling HD map creation as AV deployment expands to ever-larger geographies has become economically unsustainable.
  • Fusing together map information and perception data is challenging, especially for end-to-end ML stacks

While AVs seem to operate well without HD maps, the development environment still depends on maps. Simulators require a map, and so do many monitoring, search and evaluation utilities used for validation. To address these needs, AV development teams continue to produce maps of regions where test drives take place, but this costly and time consuming operation is tapering down. 

The absence of detailed maps poses a challenge for development and validation teams. Since maps are required, why not create detailed-enough maps automatically? Presumably, if the data collected by an AV is good enough to drive by, it should be sufficient to produce a detailed map of the AV’s route. Foretellix developed a technology to do just that.

The input to our mapping pipeline is a recorded drive log. Rather than the full recording of all sensor channels and AV stack signaling, the log captures more abstract details produced by the perception pipeline. These may include:

  • The AV global position
  • Drivable area boundaries, road and lane markings
  • Road signs, traffic lights, on road markings such as stop lines and arrows
  • Additional road attributes as available. Some examples include: 
    • Lane use – bike lanes, public traffic lanes, sidewalks and such
    • Road surface: asphalt, concrete, gravel for instance
    • Access restrictions, for example toll-roads, private roads

The above information is captured every cycle of the AV stack, typically every 20-100 milliseconds. In principle, integrating this information over time will produce a detailed map of the AV’s route. Unfortunately there are various issues that need to be addressed before a usable map can be produced.

Input data quality varies, and is not always accurate or even usable. Several factors influence the log data quality:

  • Sensor noise, inaccuracy or interference (glare for example)
  • Occlusion, causing parts of the scene to be missing from the log
  • Poor or non-existent road marking and signage
  • Perception errors, like recognizing a shadow across the road as a stop line
  • Inconsistency between time frames, for example where a sign is recognized in some frames but not in others

The mapping pipeline uses several strategies to overcome data quality issues. First, since the data is integrated over time, each element in the scene is typically represented multiple times. For example a lane boundary line is present in most frames, though it may be missing in some. Interpolating over the series of occurrences will typically fill in the missing details.

Next, to address incomplete perception data, the system employs several compensation algorithms that leverage structural invariants and contextual assumptions. For instance, when lane markings are absent from a road segment, the algorithm analyzes adjacent sections to extract consistent road properties, such as lane count and width, then interpolates these characteristics across the unmarked area. Similarly, when a large vehicle creates prolonged occlusions that obscure lane boundaries, the mere presence of that vehicle provides sufficient evidence to infer the underlying lane presence, as vehicles typically follow established traffic patterns.

Figure 1: Map Construction Pipeline, from drive log to detailed map

Data is only collected along the AV route. This is sufficient for some applications, but simulation may require a broader map, for example when experimenting with variations of the recorded scene. Foretellix’s Smart Replay capability can create many variations that may expand beyond the immediate vicinity of the AV, like adding cross traffic in a junction. This requires the crossroads to be present on the map. Simulating synthetic scenarios on the driven area also requires a map broader than what is produced by following a single trajectory.

Fortunately, road testing of AVs is often performed by multiple vehicles driving around in a specific area. Over time, much of that area is covered. The mapping pipeline accumulates the maps created by each individual drive and merges them. This has several advantages:

Improved map quality: Overlapping drives reinforce the interpretation of road data, where imperfections in one drive are compensated by other drives. The merged map is therefore more complete and reliable than any of the individually produced maps.

Producing a complete area map: Stitching maps together creates a unified map of the drive area. That map can be used in full, or clipped to a specific region-of-interest (ROI) on demand. 

Figure 2: A map service leverages drive log fusion to expand map coverage and enhance fidelity through repeated observations.

To conclude, the dependence of AV development tools on detailed maps is becoming a challenge as these maps are no longer being produced and used for AV driving and the Operational Design Domain (ODD) is increasing past the areas that they do have maps for. As the industry transitions to this new way of development, a technology for creating maps from incomplete AV drive logs is vital. The result is a high quality map of the drive area usable for all evaluation, verification and validation needs.

If your AV team’s development process is being challenged by the lack of maps, and you’re interested in learning more or sharing your thoughts, then please contact us to stay in the loop.

Short on time? Below is a podcast summary of this blog, generated by Google NotebookLM.

Many of today’s most advanced vehicles proudly display their 5-star NCAP safety ratings, and for good reason. These standardized tests have driven major safety improvements across the industry, offering a reliable benchmark for features like Advanced Emergency Braking (AEB) and Crash Avoidance.

But passing NCAP doesn’t always mean a system is ready for the real world. Slight deviations in pedestrian appearance, road layout, or vehicle speed can cause even 5-star-rated systems to miss, delay, or fail to respond.

This discrepancy highlights a growing need: ensuring that ADAS functions perform not just in controlled conditions, but across the variability and unpredictability of actual roads. Bridging that gap requires rethinking how we define, execute, and scale scenario testing.

The Limits of NCAP’s Concrete Scenarios

NCAP (New Car Assessment Program) protocols are designed to evaluate how well a vehicle performs in a tightly defined set of safety-critical scenarios. These tests are extremely specific: fixed layouts, precise vehicle speeds (e.g., 10, 15, or 20 kph), limited target types (adult pedestrian, bicycle, motorcycle), and predetermined trajectories. While this standardization supports repeatability and scoring, it does not reflect the variability of the real world.

CMFtap scenario VUT and EMT paths

As noted in ISO 34505, “From a mathematical perspective the probability of a concrete scenario (concrete values of continuous parameters) is zero”, in other words: the probability that any one NCAP scenario will occur exactly as defined is essentially zero. Junction angles vary. Pedestrians don’t always walk straight. Vehicles arrive at different speeds and lateral positions. Even a small deviation can trigger a different outcome, including total system failure.

We’ve seen this in practice in the tests performed by Luminar Technologies at the CES conference in Las Vegas in 2022 and 2023.  A range of 5-star Euro NCAP vehicles, including models from Tesla, BMW, Audi, Mercedes, and Lexus, were tested in scenarios that deviated slightly from the NCAP spec. Sometimes the pedestrian was a child instead of an adult, or crossed at a slight angle instead of straight on. In many of these cases, the vehicle didn’t brake at all or braked too late to prevent a serious impact.

And yet, most OEMs still optimize their ADAS systems to pass these tests alone. The risk here is overfitting: vehicles perform well in NCAP but fail in near-identical situations not explicitly covered by the protocol.

Foretellix Helps Teams Go Further with Abstract Scenarios

To achieve real-world ADAS safety, not just compliance, developers need to go beyond NCAP’s constrained test space. Foretellix’s scenario-based approach enables teams to do both: execute official NCAP test cases and collect KPIs and coverage metrics during those runs, and go further with large-scale, abstract testing to capture real-world variability.

For example, rather than testing a left-turn-across-path (LTAP) scenario at just three ego speeds and three oncoming speeds (as NCAP prescribes), Foretellix can generate and execute hundreds of tests across randomized speeds, angles, actor types, and arrival timings without needing to manually script each case. If a system performs consistently across this abstract scenario space, teams gain confidence not just in its ability to pass NCAP, but to handle reality.

This approach also ensures that developers don’t fall into the trap of false positives or missed edge cases that may appear only in low-frequency conditions.

Example: Lane Support System With Oncoming Vehicle

Here’s what that looks like in practice, using a Lane Support System (LSS) scenario with an oncoming vehicle:

NCAP defines very specific parameters for an Emergency Lane Keeping oncoming vehicle scenario:

  • Road curvature: 0 (straight)
  • Global Vehicle Target (GVT or aka oncoming vehicle) lane offset: 1.5 m
  • Collision overlap: 10% of the ego vehicle’s front bumper
  • Swerve side: Left (driver’s side)
  • Lateral speed values tested: 0.2, 0.3, 0.4, 0.5, 0.6 m/s

These fixed parameters are used to verify Lane Support System (LSS) response under very specific conditions. But real roads aren’t that predictable.

Foretellix abstract testing lets you define parameters that are both broader and more realistic. The list below shows one example set that would be valuable to test:

  • Vehicle Under Test (VUT) speed is varied across [50–130] kph
  • GVT oncoming vehicle with speed across [50–130] kph
  • GVT offset ranges from 0 to 3 meters
  • Swerve timing ranges from 5 to 20 seconds before potential impact
  • Swerve amplitude ranges from 1 to 1.5 meters
  • Simulations are run across a wide range of road curvatures and map geometries

This kind of coverage-driven, insight-rich workflow helps teams catch issues early, improve performance robustness, and make data-backed safety claims with confidence, enabling:

  • Running hundreds of automatically generated abstract scenarios
  • Automatically filtering runs based on outcomes, e.g., largest lane-border exceedance
  • Analyzing aggregated metrics like:
    • Lateral acceleration
    • Minimum distance to target
    • Lateral deviation at key timepoints
  • Quickly identifying fragile behaviors or regressions that wouldn’t show up in standard NCAP testing

One Platform, Multiple Validation Modes

Foretellix’s toolchain supports validation across the full range of test types: abstract, concrete, and real-world.

Foretify Evaluate helps teams measure where they are in the development and testing process. It enables coverage analysis, triage, and KPI tracking across both abstract and concrete scenarios. Whether a test comes from a synthetic input, a real-world log, or a concrete NCAP replay, Foretify Evaluate applies the same metrics and analysis, helping teams understand what has been tested and what gaps remain.

Smart Replay, part of Foretify Generate, lets teams recreate exact NCAP tests or run slight variations on them. For example, teams can simulate the same concrete test with a vehicle 10 cm off the centerline or approaching 2 kph faster, conditions that might easily occur in reality but fall outside the rigid NCAP specs. This helps identify fragile behavior early and improves performance robustness.

In parallel, Foretellix’s abstract scenario technology can be used to automatically generate thousands of concrete test instances, systematically covering parameter ranges around the original NCAP setup. This makes it possible to validate both compliance and robustness using the same infrastructure.

Together, these tools allow OEMs and Tier 1s to layer abstract testing around concrete scenarios, improving their safety case without sacrificing compliance. And that approach isn’t just effective; it’s aligned with where regulation is heading.

Aligned with Where Regulation Is Heading

Euro NCAP 2026 marks a shift toward broader safety validation. The new protocols introduce a Robustness Layer that allows small variations in test parameters, such as slight speed or offset changes, and formally support simulation to validate these deviations when physical testing isn’t feasible.

Other standards, including UN DCAS proposals, are following suit. Regulators increasingly recognize that real-world safety depends on more than rigid track tests, requiring scalable, explainable, scenario-based validation.

Foretellix enables teams to meet today’s requirements while preparing for tomorrow’s expectations, providing the flexibility and coverage regulators are beginning to call for.

From Checkbox Compliance to Real-World Confidence

Today, many OEMs test for compliance and hope for generalization. But “hope” isn’t a safety strategy. As simulation credibility improves and regulatory expectations shift, the industry is moving toward coverage-based verification at scale.

Foretellix empowers this transition by giving ADAS teams the ability to validate broadly, measure meaningfully, and act confidently without duplicating effort or sacrificing transparency. This approach blends well with concrete NCAP tests on test tracks and testing in simulation using concrete and abstract scenarios.  This hybrid approach lowers cost and speeds up the process of achieving NCAP compliance in the quest to improve real world safety.

Whether you’re preparing for NCAP, scaling real-world validation, or doing both in parallel, Foretellix helps you get there with a smarter, more scalable testing workflow. Contact us today to learn more.

Introduction: The Critical Need for Data-driven Evaluation of Autonomous Driving Systems

As advanced ADAS systems are doing more of the driving for us and fully autonomous vehicles hit the streets without safety drivers, e.g Waymo Robotaxis, Aurora Self-driving Trucks, the question isn’t just ‘can we build them?’—it’s ‘can we trust them?’ Significant advances in AI have accelerated ADS development, but their black-box nature makes formal, human-interpretable performance and safety evaluation even more critical1. Given the massive volume of simulation and drive log data, a highly automated, scalable evaluation pipeline is essential for ensuring ADS safety. Finally, as Fleet Operators and Logistics Companies are investigating how to optimize their businesses by leveraging Autonomous Driving Systems, they increasingly demand transparent metrics and KPIs to confidently deploy at scale. Waymo’s Safety Impact dashboard along with their recent publication “Determining Absence of Unreasonable Risk: Approval Guidelines for an Automated Driving System Deployment” provides key insights into why scalable, data-driven, and transparent safety and performance evaluation is now a make-or-break factor for ADS projects.

Key Challenges in ADS Evaluation Across Test Platforms

While evaluation pipelines have long guided ADS development, custom ‘home-grown’ solutions struggle with higher levels of autonomy such as SAE L3 and L4. Some of the main challenges include:

  • Volume and Variety of Test Data: The massive amount of data which needs to be systematically aggregated, evaluated, and analyzed against safety and performance metrics and KPIs coming from many different test platforms spanning SIL/HIL/VIL simulation to test track and public road drive logs, extracted in a variety of different formats.

  • Engineering Efficiency and Scalability: The huge engineering effort required to manually curate and harvest interesting scenarios and events from millions of miles of drive logs, triage issues, and perform scenario likelihood and criticality analysis. Often different teams using different test platforms (like real world drive logs vs synthetic simulations) are unable to share evaluation metrics, KPIs, and analysis tools – limiting reuse, causing effort duplication, and frequently resulting in inconsistent interpretations and implementations of the same metrics across platforms.

  • Developing the required evaluation content: Creating the required ADS “Evaluators” (KPIs, checks, and coverage metrics) in a reusable and extensible way requires significant engineering effort and this content is often not portable for evaluating data across test platforms. Additionally, if the evaluators are not captured with a good level of formal abstraction, it can be challenging for humans to easily interpret the intent of the evaluators.

  • Measuring testing completeness within the ODD: Lack of coverage metrics based on the requirements and risk dimensions of the ODD which can be used to determine when testing is considered complete based on aggregating, evaluating, and reporting the test results across all test platforms.

The Opportunity for a New Approach – Foretify Evaluate

Foretellix addresses these challenges with Foretify Evaluate: a test platform-agnostic, automated, scalable, and explainable evaluation framework delivering actionable insights for technical and management teams.

Why Foretify Evaluate Stands Out

Foretify Data-Driven ADS Development Platform

In the dynamic realm of self-driving technology, every mile driven—real or simulated—brings new learning opportunities and fresh risks. Foretify Evaluate is purpose-built to unlock those insights:

  • Curated Scenarios from Real-World Drive Logs: Extract, annotate, and evaluate scenarios from vast, real-world driving logs using a combination of AI and rule-based automation— enabling the analysis of performance and safety metrics in the context of key scenarios while also ensuring your testing reflects real-world complexity, not just theoretical models. This capability ensures validation and testing efforts stay grounded in reality, targeting the most relevant and impactful situations.

  • Extensive Evaluator Library: Access an ever-expanding library of ready-to-use evaluators within the “Evaluation V-Suite”. This library gives you configurable evaluation content to assess a wide spectrum of AV behaviors, metrics, and KPIs — accelerating time to insight and deployment readiness.

  • Comprehensive Analysis, Real and Virtual: Whether your scenarios play out on bustling city streets or synthetic simulations, Foretify Evaluate delivers structured, meaningful analytics of evaluation results. The same analytic tools can be used for both real-world and synthetic data, from detailed scenario analysis to aggregated metrics dashboards, the platform highlights performance or safety gaps and critical issues that might otherwise go undetected.

  • Unified ODD Coverage Metrics: Leverage OpenSCENARIO DSL coverage metrics to provide an objective measure of testing completeness within your target ODD. Aggregate and track test coverage across real-world and simulation test platforms, ensuring nothing falls through the cracks.

  • Focus Where It Matters: Foretify Evaluate provides advanced search and triage capabilities to prioritize the riskiest situations and the most significant issues, directing your engineering attention—and your resources—where they’ll have maximum impact.

Once Foretify Evaluate shines a light on every gap in the safety and performance of your ADS, or your ODD test coverage, Foretify Generate can be used to automatically generate targeted scenarios, close the loop on validation, and advance from insights to action—all in one platform.

What to Expect Next in This Blog Series

Stay tuned for follow-on blog posts which will be providing technical deep dives into different facets of the Foretify Evaluate Solution to get a first hand look at how it delivers scalable, data-driven, and transparent safety and performance evaluation.

References

1 For a deeper dive on the need for formal abstractions for evaluating AI-based Autonomous Driving Systems check out this recent blog post from Yoav Hollander, the CTO of Foretellix.

Short on time? Below is a podcast summary of this blog, generated by Google NotebookLM.

Neural reconstruction has emerged as a promising technology, enabling the creation of realistic 3D simulation from real-world drive data, to validate increasingly complex systems across a vast range of scenarios, including rare edge cases, without compromising safety while keeping to development schedules.

This blog highlights a powerful solution from Foretellix, leveraging Foretify’s built-in open architecture foundational approach, with integrations for AV developers to adopt neural reconstruction technologies, including NVIDIA NuRec and Parallel Domain Replica. By combining Foretellix’s scenario-based validation and automation toolchain, in particular the Foretellix Smart Replay technology, with neural reconstruction technology, developers can seamlessly generate, replay, and vary reconstructed scenes at scale.

Neural rendering of a reconstructed scene with NVIDIA’s NuRec orchestrated by Foretellix Foretify Toolchain

Why Neural Reconstruction Is Becoming Mission-Critical

AV development has entered an era where end-to-end models and AI-powered autonomy demand richer, more varied data than ever before. Yet real-world testing alone can’t meet these needs. It’s too slow, too expensive, and too dangerous for rare or critical situations. Neural reconstruction offers a compelling alternative by transforming raw sensor logs into fully rendered, simulation-ready environments with unmatched realism, flexibility to vary conditions, and reusability across development and testing pipelines.

Common Challenges in Neural Reconstruction Workflows

Before diving into how the solution addresses industry needs, it’s important to highlight the common challenges we see AV teams facing in neural reconstruction workflows:

  • Snippet discovery: Sifting through massive log datasets to find specific events (e.g. ego left turn with an oncoming agent at 30–40 mph) remains a slow, manual task.
  • Test case definition & maintenance: It’s difficult to define, validate, and maintain robust tests based on reconstructed scenes.
  • Dynamic scenario variation: Changing actor behaviors or environmental parameters across reconstructions is often tedious and error prone.
  • Hybrid scenario creation: Combining real-world snippets with inserted synthetic actors (e.g. inserting a cut-in vehicle) demands fine-grained control.
  • Coverage visibility: Teams lack insight into whether their reconstructed scenarios meet coverage goals or if they leave safety gaps that may limit the scalability and reusability of neural reconstruction in production AV workflows.

An Integrated Workflow from Raw Logs to Safety-Driven Simulation

To address these real-world challenges, Foretellix’s Foretify data-automation platform offers a cohesive, closed-loop solution enabling teams to transition from event detection to varied simulation quickly, bringing the promise of neural reconstruction into production-grade validation and training environments. The workflow is designed to move effortlessly from raw driving logs to simulation-ready, safety-focused scenarios, enabling teams to automate and scale what was once a manual, fragmented process:

At a high level, here’s how the workflow works:

1. Snippet Search and Selection: Using Foretify Evaluate, engineers query large scenario datasets to identify relevant real-world driving events based on scenario intent (e.g., occluded pedestrian crossing from the right side at a given speed) and extract snippets to be reconstructed, modified, and replayed.

Selection of an interesting snippet, and extraction of the corresponding scenario file – ready for modifications

2. Scene Reconstruction: The selected snippets are passed to NVIDIA NuRec or Parallel Domain Replica for neural reconstruction. These tools convert raw logs into 3D digital twins with sensor-level fidelity. As outlined in the following steps, engineers have the flexibility to modify and render novel scenes from the reconstructed environments.

A reconstructed scene with Parallel Domain

3. Scene Variation and Scenario Creation: The Foretellix toolchain allows users to modify reconstructed scenes, adjusting timing, agent behavior, or environmental conditions. For example:

Replay original or modified trajectories:
By leveraging Foretellix’s Smart Replay technology, users can configure the behavior of each actor in the scene with fine-grained control. Actors can be set to:

  • Follow the exact trajectories and behaviors recorded in the original real-world event
  • Follow a slightly modified trajectory to explore variations around the original behavior
  • Be controlled by a reactive behavioral model for dynamic response within the scenario

Additionally, individual actors can be selectively removed from the scene to simplify the environment or isolate specific interactions.

In the video below, a side-by-side comparison of two scenes rendered with NuRec is shown – the scene on the bottom half has been modified with Foretify Smart Replay

Modify trajectories, insert new actors:
Leveraging Foretify Smart Replay, in combination with the power of Foretellix’s abstract V-Suite libraries, users can also:

  • Insert new actors (either harvested from other scenes or synthetic)
  • Create novel scenarios with inserted actors, leveraging the Foretellix V-Suite libraries

Actors can also significantly deviate from original paths to explore what-if conditions, alternative maneuvers, or edge-case variations. In the video below, the rendering of a novel scene with PD Replica is shown; the inserted actors are synthetic:

Narrow oncoming scenario rendered in a neurally reconstructed scene with inserted synthetic actors

4. Validation and Coverage Feedback: Foretify Evaluate ingests both reconstructed scenarios and existing driving logs,  connecting them both to unified safety KPIs and ODD coverage metrics via built-in dashboards, allowing teams to close gaps and ensure validation completeness.

ODD Coverage Dashboard 

Use Cases and Applications

The integrated workflow unlocks a range of use cases across AV development:

  • Training AI Models: ML engineers can generate diverse training datasets by reconstructing rare scenarios and creating safe variations
  • Closed-Loop Simulation: V&V teams can replay real-world edge cases and simulate novel actor behaviors to assess AV safety responses
  • Hybrid Scenario Creation: Developers can insert new agents into existing scenes to test system robustness under complex interactions
  • Safety Evidence Generation: Teams can automatically link each test to scenario coverage dashboards and safety KPIs

For example, a developer might identify a left-turn event from a fleet log, reconstruct it using NuRec or PD Replica, insert a synthetic speeding vehicle using Foretify Generate, and assess how the AV planner responds, all within a single workflow.

What AV Teams Gain from This Integration

By closing the loop from real-world event to simulation-ready variation to KPI tracking, the combined solution from Foretellix with NVIDIA and Parallel Domain offers several benefits:

  • Dramatically faster scenario creation from real-world logs
  • High reusability of reconstructed data, reducing dependence on costly physical testing
  • Scalable variation of edge cases, enabling robust AI training and safety validation
  • Improved coverage tracking, ensuring confidence in both rare and routine scenarios

Laying the Groundwork for Safer AI-Driven Autonomy

As AV stacks become increasingly AI-driven, teams need simulation workflows that match the pace and complexity of development. These integrated solutions powered by Foretellix’s data automation platform and leading neural reconstruction tools from NVIDIA and Parallel Domain make it possible to go from real world drive logs to validated scenarios with unprecedented speed and control.

It’s a major step toward safer, more efficient AI AV development.

Want to see this integration in action? Contact us to learn more.

Developing autonomous vehicle (AV) stacks capable of safely navigating real-world environments involves rigorous training, testing, and validation against countless realistic scenarios. Each scenario must be represented in numerous variations to comprehensively assess and ensure AV stack safety and performance. This complexity underscores the critical importance of systematic scenario variation and management within the AV development process.

Effective scenario variation goes far beyond minor adjustments or changes in the weather. After truthfully replaying a real-world drive, or generating fully synthetic scenarios of rare and dangerous edge-cases, in a simulated environment, developers must thoughtfully generate diverse and realistic variations, encompassing both behavioral dynamics and environmental conditions. Achieving this level of scenario diversity requires intelligent automation and sophisticated simulation technology, particularly at higher levels of autonomy where AV stacks must perform reliably without human intervention, and complex operational design domains (ODD) such as urban driving environments.

Let’s examine ten examples of essential categories for scenario variation (listed below in no particular order), highlighting key considerations for AV developers aiming to achieve robust safety and performance validation:

1. Number and Types of Other Vehicles

To thoroughly test AV stacks, scenarios must include a variety of vehicles moving in multiple directions:

  • Additional vehicles traveling alongside or in the opposite direction to the AV.
  • Diverse vehicle types, from motorcycles and passenger cars to trucks, SUVs, bicycles, and even unconventional vehicles such as tractors or horse-drawn carts.

2. Variations in Maneuvering and Vehicle Dynamics

Realistic variations in vehicle dynamics are crucial:

  • Vehicles performing cut-ins, overtakes, merges, turning, crossing paths, and lane changes.
  • Varied speeds and realistic following distances between vehicles.
  • Constraints ensuring scenarios remain physically plausible, for example avoiding unrealistic events such as a semi-trailer truck overtaking the autonomous vehicle (Ego) at 300 mph and cutting in less than 5 meters in front.

3. Other Vehicle Behaviors

Human driver behaviors in other cars vary widely and significantly affect AV interactions:

  • Scenarios should include overly cautious, aggressive, distracted, or impaired drivers.
  • Variations involving drunk driving, distracted driving, or sudden erratic maneuvers to test AV reaction capabilities under challenging conditions.

4. Number and Diversity of Pedestrians

Pedestrian, also known as VRUs (Vulnerable Road Users), interactions are inherently complex and varied:

  • Scenarios should range from solitary pedestrians to groups, including individuals with pets, children, or elderly companions.
  • Representation of diverse demographics: varied age groups, genders, body types, race, clothing styles (dark or reflective clothing, traditional or religious attire), and accessories like backpacks or hoodies.

5. Pedestrian Behavior Variations

Pedestrian actions can dramatically influence AV decision-making:

  • Behaviors including running, loitering, sudden stops, and unexpected crossing movements.
  • Simulating unpredictable pedestrian behavior to evaluate AV adaptability and responsiveness such as a child chasing a ball from between parked cars.

6. Animal Encounters

Animals introduce additional unpredictability:

  • Testing scenarios may include encounters with domestic pets (dogs, cats), livestock (horses, cattle), and wild animals (deer, bears, elephants), although these often depend on the ODD.
  • Variations involving different sizes, speeds, and behaviors of animals are critical to ensure AV systems handle these interactions safely.
  • A scenario that is often overlooked is “flocks of birds”. When an AV in an urban environment drives up to a flock of pigeons who all then take off at the same time, the AV can face a literal and metaphorical shit-storm 🙂

7. Stationary Objects on the Road

Stationary obstacles present distinct perception challenges:

  • Variations may include parked vehicles, construction equipment, traffic signs, advertising boards, and various types of gates or barriers.
  • Inclusion of static or dynamic obstacles of varying shapes, sizes, colors, and reflective properties such as plastic bags, rocks or a couch that has fallen off a truck are equally relevant to validate the AV system.

8. Lane Configurations:

Roadway designs significantly impact AV performance:

  • Testing variations including single-lane rural roads, multi-lane highways, complex urban intersections, and dynamically changing lane conditions.
  • Junction variations from different locations on the ODD map – for example there may be multiple configurations of a right turn at different locations on the map.
  • Diverse road/lane markings and lack thereof, often a problem in my neighbourhood!

9. Weather and Visibility Conditions:

Environmental conditions substantially affect AV sensors and decision-making:

  • Scenarios should range from clear conditions to heavy rain, snow, fog, dust, glare, and low-light, twilight or nighttime situations.
  • Testing AV stack performance across diverse weather events ensures robust sensor performance and accurate object detection under varied visibility conditions.

10. Environmental Diversity:

Finally, varying the broader environmental context is essential:

  • Inclusion of urban, suburban, rural, and industrial environments.
  • Variations in landscape elements, such as trees, buildings, fences, and commercial signage, influence AV perception and decision-making.
  • Realistic shading and lighting variations to further test sensor systems.

The complexity outlined above clearly illustrates the non-trivial challenge faced by AV developers when generating meaningful scenario variations of their existing drive logs. It is evident that manual creation or superficial scenario changes fall drastically short of what’s required to adequately validate advanced AV stacks, a challenge that grows exponentially when requiring realistic data for AI-powered AV stack training.

Addressing these challenges effectively requires a closed-loop toolchain. By defining an ODD coverage plan, evaluating existing drive logs and measuring the current ODD coverage, Foretellix’s Foretify is able to reveal the safety gaps that need addressing and apply advanced scenario generation technology to close them. The Foretify toolchain incorporates intelligent constraint solvers that systematically create extensive yet feasible and relevant scenario variations. Automating the generation of robust, diverse, and physically plausible scenario variations either from the existing real-world drive logs or for fully synthetic simulations dramatically reduces development time while improving the system robustness. 

The scenarios generated by Foretellix’s Foretify can either be used to validate the AV stack or as training data for AI-powered AV stacks, greatly shortening the required development time and ensuring comprehensive coverage of the ODD. By leveraging Foretellix’s Foretify toolchain, AV developers and manufacturers can confidently achieve comprehensive training, testing and validation, essential steps toward realizing the full potential of safe and reliable autonomous mobility.

Contact us to get a demo and learn more about how we generate scenario variations within the Foretify toolchain.

As a company, Foretellix is particularly bullish about multi-modal, foundation-model-based V&V assistants to automate processes, doing a growing part of what a V&V engineer does. To help engineers move faster without disrupting their flow, Foretellix is now offering customers a new way to integrate their testing capabilities directly into their favorite AI developer assistants, such as Claude Desktop, Cursor and Forellix’s own Foretify AI-Assistant. This approach is made possible by the Model Context Protocol, or MCP, a new standard that connects AI clients and AI Agents to external tools in a secure and flexible way.

Instead of switching between environments, engineers can interact with Foretellix platforms such as Foretify and V-Suites using simple prompts inside the AI assistants they already trust. They can ask to search for relevant scenarios, launch specific tests, or analyze results using natural language. 

The assistant handles the request by talking to the MCP server, which connects locally to Foretellix tools running on the customer’s infrastructure. Rather than replacing existing tools, this is a new way to access and extend them, built to fit naturally into how engineers already work.

Part of a Broader Shift Toward AI-Native Engineering

The MCP server reflects a broader evolution in how engineers interact with complex systems. As AI assistants become a natural part of the developer toolkit, teams are looking for ways to connect those assistants to the tools they rely on every day.

Foretellix is at the forefront of this shift. Rather than requiring users to learn a new interface or adapt to a new workflow, we bring testing capabilities into the environments engineers already prefer. This approach makes it easier to adopt AI without introducing risk, overhead, or unnecessary complexity.

The Model Context Protocol is quickly gaining traction across the development community as a standard for integrating AI with software tools. As more AI clients and agent frameworks adopt MCP, Foretellix tools will remain compatible and easy to integrate into a growing ecosystem of developer-focused solutions.

This feature represents a step toward a future where testing, analysis, and even scenario creation can happen in conversation with an intelligent assistant that understands your toolchain and respects your constraints.

What You Can Do With the MCP Bridge

With the MCP server in place, engineers can already perform a wide range of tasks through their AI assistant without leaving their development environment. 

In the example below, a developer using their AI assistant of choice can ask it to find a scenario involving a vehicle cut-in with a pedestrian crossing. The assistant queries the Foretify scenario library, surfaces available options, and allows the user to select and launch a test run.

Once the test is complete, the assistant can retrieve the results and assist with post-run analysis. Engineers can ask questions like “What affected the time to collision?” or “Were there any collisions in this run?” The assistant pulls metrics, performs calculations, and even generates visualizations based on the data with no scripts or manual coding required.

MCP Bridge for AI AV Development

This approach is especially useful for ad hoc analysis and triage. Traditional dashboards are often too rigid or narrowly scoped to answer the specific questions engineers have in the moment. There is no one-size-fits-all solution for analyzing simulation runs, and teams frequently need custom views or insights depending on what they are investigating. 

With the MCP bridge, the AI assistant becomes an analysis co-pilot, suggesting how to approach the results, what metrics to examine, and how to visualize key findings. Engineers can quickly investigate anomalies, explore edge case behavior, and generate tailored dashboards in minutes, all within the same AI interface they already use for documentation and coding support.

By reducing the steps involved in testing and analysis, the MCP integration frees engineers to focus more on high-value problem-solving and less on managing tools.  It allows them to work at a higher level of abstraction, where they can explore complex questions, experiment more freely, and act on ideas without needing to pause for tool configuration or manual scripting. 

This also opens up new possibilities across the development lifecycle, empowering engineers to run simulations, analyze results, and create visualizations on their own before involving additional teams. It enables earlier exploration, faster iteration, and more confident decision-making at every stage of development.

Built to Grow with Your Workflow

The MCP server already enables a range of useful tasks, but its design allows for much more. Foretellix is continuing to expand the capabilities exposed through this integration to support deeper analysis and more advanced testing workflows.

Engineers will be able to perform coverage gap detection, generate complete test plans, and receive AI-assisted guidance on improving scenario coverage. Support for OpenSCENARIO 2.0 DSL (OSC 2.0) including the ability to explain or generate OSC code using natural language will also be possible.

As the interface matures, Foretellix’s own Foretify AI Assistant will be able to do more tasks and the MCP connection will allow increased access to Foretify’s exposed APIs for external AI assistants. This means engineers will eventually be able to query and visualize trends across runs, compare results over time, and automate larger testing pipelines all through the same assistant interface.

This sets a long-term foundation for enabling AI-augmented development within the AV testing stack, with practical use cases available today and a clear path to more advanced functionality.

Try the MCP Integration and Streamline Your AV Testing Workflow

The MCP server can enhance existing development and validation processes. This integration allows engineers to work faster, ask better questions, and extract more value from the tools they already use, all without compromising control over sensitive data.

If your team is using Claude Desktop, Cursor, or another MCP-compatible AI assistant and you’re interested in learning more or sharing your thoughts, then please contact us to stay in the loop.

Foretellix’s Foretify AI Assistant is available for beta testing and we are actively partnering with teams who want to push the boundaries of what AI-assisted testing can look like, contact us to learn more.

Short on time? Below is a podcast summary of this blog, generated by Google NotebookLM. You can also download the full PDF version or keep reading below.

In this blog, I will look at the near-term future of AI-based autonomy and will discuss:

  1. Some trends in AI-based autonomy – E.g. the move to “end-to-end”
  2. The growing role of V&V in autonomy – And the need for a common tool for both V&V and implementation of AI-based systems
  3. How AI will help V&V – In sensor simulation, behavioral models, text-to-scenario, AI assistants etc.
  4. Why abstractions will always be needed for V&V – And how they should be expressed

Before moving on, let me clarify what I mean by “near-term”, “AI-based autonomy”, “abstractions” and “V&V”.
Please note that this is a blog, not a scientific paper, so kindly forgive some informality.

Definitions

“Near term”: The future is arriving a lot faster lately, so it is hard to be sure about the timeline. In this document, when I refer to “near term” I mean, say, the next five years (but I expect much of this to happen over the next two years).

“AI-based autonomy”: I use “autonomy” as short-hand for safety-critical, embodied (physical) autonomy: Autonomous (or semi-autonomous) vehicles, robots, ships, drones and so on. I’ll use Automated Driving Systems (ADS), and especially Autonomous Vehicles (AVs) as my running example, but the discussion applies to all of the above. Some of the observations probably apply to safety-critical non-embodied autonomy, but I’ll ignore that for now.

I use “AI-based autonomy” for the increasingly ML-based autonomy. It can refer to full end-to-end systems, or to AV architectures where ML is taking a key role (e.g. Compound AI Systems as discussed here).

“Abstractions”: By that I mean “formal abstractions” – the explicit and exact definitions or encoding of specific human knowledge, e.g. from the domains of mathematics, physics, and driving rules. For example, the notions of “time to collision” and “four-way stop” are formal abstractions.

There are also “informal abstractions” – I’ll talk about those separately, and explicitly use the term “informal abstractions”.

“V&V”: Verification and Validation normally refers to the process of gaining confidence in the safety/quality of the system, finding and fixing bugs, and so on. It is an iterative process which uses what I am calling below “data automation facilities”.

“Data automation”: I’ll use the term “data automation facilities” to describe the facilities (machinery, methodologies, content) used in V&V. This includes defining metrics (coverage space, KPIs and checks), generating / matching scenarios, evaluating metrics, handling requirements, triaging, fixing the bugs and so on.

The reason I am calling them “data automation facilities” (rather than just “V&V facilities) is that these facilities turn out to be extremely useful not just for V&V, but also for the implementation process (e.g. to guide ML training), as we’ll see below.

Expected trends in autonomy

Here are some trends I expect over the next few years (details later):

  • Autonomy will be increasingly AI-based
    • Whether it will be fully end-to-end, or just more-ML-based, is unclear
  • The autonomy market will quickly grow
    • E.g. people are already working on generic, multi-task, foundation-model-based robotics frameworks (here is one)
  • The effort to deploy autonomy will be increasingly about data automation
  • AI will also play an increasing part in doing data automation
  • But data automation also needs transparency and abstractions

At any given time, autonomy builders will look for the most efficient way to achieve the desired level of safety. This will involve the right mix (for that time) of AI in the implementation, AI in the data automation, and non-AI pieces.

The growing role of data automation in AI-based autonomy

Much of system development is about V&V: There is an old joke saying that all system development is about V&V: You simply start with an empty system, and V&V discovers the “nothing works” bug.

How does that change when we move to safety-critical AI-based autonomy? I claim that data automation (of the kind used in V&V) is much more important in that world, because: AI makes it easier to construct complex systems, but they are often buggy (consider LLM hallucinations), and they are harder to verify (because they are mostly black-box). So the “V&V ratio” (the ratio of time spent on V&V vs. implementation) keeps growing.

But also, the implementation process itself needs data automation facilities. I’ll discuss this more below, but the short version is as follows: Designing AI-based autonomy is mostly about ML training. And to train the system correctly on the “full space of things it may encounter” you have to explore that space, and also codify what are “good” and “bad” behaviors in the space. This sounds a lot like coverage and checking.

Also, bug fixes / feature additions are mostly about adding (corrective and negative) examples to the training set. And all that exploring, finding bug examples and so on needs more-or-less the same V&V facilities (which we now call data automation facilities). Note that the AI training world uses the term “data flywheel” for much of this.

Perhaps it is no exaggeration to say that the majority of the effort of building and deploying such systems is about data automation. See below a high-level picture demonstrating that.

Data-Automation for AI-Centric AVs

Data Automation for AI Autonomy

Using data automation in the implementation process

Let’s use an end-to-end AV as our example. A huge part of making it work (and then making it safe enough) is about training the ML system. This involves data automation in two ways:

  • Use data automation to enumerate the various “areas” and train on examples of them
  • Use data automation to find bugs in the “current system”, then fix them via further training

Let me start with the second part (though it often happens later). This is really an extension of the “normal” V&V process:

  • Take the current trained system
  • Do very good V&V
    • Ideally use a detailed “verification plan” (VPlan) which is appropriate for the Operational Design Domain (ODD)
    • Fill coverage using real-world drive log evaluation and synthetic test generation
    • Do checking, failure triage and so on
  • Whenever you find a bug
    • Generalize it to “the problem area”
    • Find / create enough examples for that area
    • Use them to “fix” the training set (using corrective or negative feedback)
  • Loop until the targets (safety, legality, comfort etc.) are reached

But we also need the first part – the second part is not efficient enough: It takes a lot of V&V effort to encounter the bugs. It is more efficient to train on the various areas even before finding any specific bugs:

  • Enumerate the “areas” you can think of (including “difficult areas”)
    • E.g. “handle various other-actor illegal behavior”
  • Create a detailed VPlan accordingly
  • Create “enough” examples for the various areas
    • Using real-world drive log evaluation and synthetic test generation
  • Train on them

In reality, people will use both techniques iteratively, and both need more-or-less the same data automation facilities, and we’ll see below.

V&V and implementation need similar data automation facilities

So we need similar data automation facilities (machinery, methodologies, content) for both purposes. Here are some examples of that:

  • Finding examples in real world drive logs
    • Required for both V&V and training (where it is part of the curation process)
    • Enabled by data automation scenario matching
  • Augmenting real world drive logs to improve the coverage
    • Required for both V&V and training
    • Enabled by data automation smart replay of real-world drive snippets (+ variations)
  • Creating fully synthetic scenarios for safety critical, very rare areas
    • Required for both V&V and training
    • Enabled by data automation constrained random generation based on abstract scenarios
  • Using a formal abstraction language to capture scenario definitions, coverage goals, KPI etc.
    • Required for both V&V and training (see below)
    • Enabled by using a standard, Domain Specific Language (DSL) for abstract (evaluation and generation) scenario definitions – OpenScenario DSL
  • Doing triage and debug
    • To understand problems

The growing role of AI in data automation

AI-based autonomy depends on doing data automation well, but luckily AI can offer significant help in the data automation process itself. This includes things like:

  • AI-based improved sensor simulation
    • Including neural reconstruction and similar techniques
  • AI-based behavioral models
    • Which exhibit more natural behavior
  • AI-based text-to-scenario and text-to-matched-instances facilities
    • Often based on Vision-Language foundation models
  • AI-based V&V assistants
    • More on this below

V&V assistants: I am particularly bullish about multi-modal, foundation-model-based “V&V assistants”, which can access the computer and do a growing part of what a V&V engineer does. They will, of course, be wrong sometimes. But for any set of tasks where they are (say) “correct” in more than 80% of the cases, they will become extremely useful.

Consider a triage-and-analysis assistant: It runs in parallel to (say) the nightly multi-test execution flow, stops/corrects erratic simulation runs, does triage on its own and so on. But because it may sometimes be wrong, it is crucial that:

  • The assistant will leave a detailed, structured log of what it did (and why)
  • You should be able to chat with it about any detail in the log
  • The assistant’s “decisions” (e.g. failures clustering and categorization) will be “undo-able” (e.g. stored in Git so you can revert/modify them)

Over time, AI assistants will improve: They will be able to auto-create tests (e.g. to try and expose more instances of a suspected bug), and much more. Which raises the following question:

Can all V&V be done by AI?

The short answer is “no”: People are not going to be satisfied with the answer “The AI said it’s OK”. About 7 years ago, I wrote a blog post about a similar topic, and claimed that:

Regulators (and common sense) will insist that verification will be based upon human-supplied requirements, consisting of scenarios, coverage definitions, checks and so on – let’s call them all “rules” for short. So some rule-based specifications will always be here.

Much of this is about abstractions. The rest of this document will discuss in more detail why human-defined abstractions are essential for safe AI.

What do assessors need?

Both AI-based autonomy and AI-assisted V&V will arrive gradually: For instance, over time we’ll have better and better V&V assistants, but at any given time we’ll still need to somehow check that they did the right thing. So the work of V&V will always have some human part in it.

One good way to look at this is to start from the “assessors”: There will always be people whose job is to assess, in a very thorough way, whether the autonomous system is indeed trustworthy. “Assessing” here means making sure that:

  • We tested all the “required situations” (“coverage”)
  • We applied the right, context-dependent criteria to each situation (“checking”)
  • The results were “good enough” to meet the target safety

Who are those assessors? I use this term in a fairly general way: There are various categories of humans who will be required to assess the autonomous system, including:

  • Regulators
  • The jury in an AV-related accident trial (those are not going to disappear)
  • An AV fleet operator (before choosing a specific AV supplier)
  • An OEM (before accepting an external AV stack)
  • An AV company’s mgmt. team (before making a “deploy” decision)
  • An AV company’s V&V team (before accepting a change)
  • And so on

How is assessing done?

To do a good job, a reasonable starting point is a well-written safety case. And that safety case should (among other things) point at the various situations tested (coverage), the checking done, and the results.

The assessor should be able to dive into that information (safety case, verification plan / coverage results, checks / KPIs, triaged test results and so on) to any depth, to convince themselves that the implementation is now trustworthy (or to understand where it is not).

The assessor may use a V&V assistant (e.g. to understand what-was-tested at various levels of granularity). In fact, I expect this to be pretty helpful, perhaps using “debating assistants”, where one assistant tries to convince the assessor that the system is well-tested, and a second assistant looks for reasons why it is not.

But regardless of whether assessors are aided by human or AI assistants, eventually they need to convince themselves that a “reasonably comprehensive set” of the needed situations was covered and checked, and the results were “good enough”. And this implies transparency, human terminology and well-defined abstractions.

The role of abstractions

Assessing needs abstractions: You need precise (intuitive but formally-defined) terms for expressing the situations, checks and results. Terms like “time to collision”, “unprotected turn”, “rolling stop”, “left-driving-countries”, “minimal risk maneuver” and so on should be formal, precisely-defined abstractions.

Note that the actual end-to-end autonomous implementation is not limited to “thinking” in terms of these abstractions: This is actually a good thing for the implementation (allowing it to also handle vague cases). But you do need abstractions to assess the behavior (and for the implementation process, e.g. to influence training).

Also note that informal abstractions (e.g. AI-based text-to-scenario) are also very useful – more on this below. However, since they are not precisely-defined, they cannot replace the need for formal abstractions.

So abstractions will be crucial for V&V (and data automation in general). This will become clearer below, as I dive into the usage of abstractions in coverage and checking.

Coverage abstractions

Suppose you are starting to define a “coverage map” of your ODD (for either V&V or implementation, as discussed above). You need to consider maneuvers, weathers, faults and much else.

You then split e.g. the maneuvers space into scenarios (unprotected turns, four-way stops and so on). You then further split your scenarios into cases (coverage bins): Left and right unprotected turns, time from yellow light to start of crossing (split into specific time-range bins), speed of nearest vehicle when crossing (split into specific speed ranges), “close calls” where the Post Encroach Time (PET) was less than N seconds, and so on.

Note that these terms (“unprotected left turn”, “PET” etc.) are all abstractions which need to be precisely defined. And then you may want to combine (mix) each of these scenarios with other scenarios involving even more abstractions (“flat tire”, “sensors disabled”, “emergency vehicle enters junction”).

So you need some formal language to evaluate which of these scenarios happened, split them into bins according to which sub-cases happened, measure things like PET and so on. And this needs to be fully-transparent to avoid misinterpretations.

Checking and KPI abstractions

Checking is even more complicated, and needs even more abstractions:

  • You need to compute various KPIs
    • E.g. PET and time-to-collision
  • You need to check for obeying rules and conventions
    • E.g. “no rolling stop”
  • Checking often depends on context
    • E.g. don’t cross a solid dividing line, except if it must be done (e.g. to avoid a running pedestrian on the road)
    • So you also need formal definitions for those “context scenarios”
  • Checking is also country/ODD-dependent
    • E.g. not all countries have four-way stops
    • E.g. if you don’t support rain, you have to check how your vehicle does Minimal Risk Maneuver when it starts raining

And all these need to be defined transparently using formal abstractions.

Note that sometimes you need to combine this with “informal abstractions” (because some of the concepts are inherently vague). More on this below.

About finding unknowns: You may think that finding unknown unsafe situations (as per SOTIF) does not need abstractions – you don’t know what you are looking for, so you can’t express it abstractly.

But that’s not really the case. Looking for unknowns is best done as follows:

  • Complement real-world log data by generating new data (especially for corner cases)
    • While smartly mixing together known, abstract dimensions
    • While randomizing various parameters
    • Note that randomizing and still getting valid results requires abstractions – e.g. for creating an “unprotected left turn with PET < 3 seconds”
  • Look for indications of something starting to go “bad” (e.g. low time-to-collision)
    • This is also usually defined in abstract terms
  • Then maybe use various data-driven search techniques to home in on the problem
    • Thus turning it from unknown to known

The role of Informal abstractions

As I mentioned above, informal abstractions (especially multi-modal, foundation-model-based abstractions) can be extremely useful. The trick is when to use them, and how to combine them with formal abstractions.

One example is “text-to-generative-scenarios”: While it is hard to “direct” this precisely, it can still be very useful in many cases (to quickly-and-intuitively guide the generated scenario), as long as it is coupled with the ability to evaluate the results using formal abstractions.

Another example is “text-to-matched-instances”: Sometimes we do need instances of somewhat-vaguely-defined abstractions (“need to yield”, “be considerate to pedestrians” etc.), as part of our coverage and checking mechanisms.

There are complex situations where the “correct” behavior partially depends on informal abstractions. Often an abstraction has a clear, formal subset (for which a rule will give a simple answer), surrounded by a fuzzier, informal area (where the rule is less clear, and violating it should perhaps result in just a warning).

To summarize

These are the four key takeaways

  • AI-based autonomy is advancing much faster now
  • As a result, the relative importance of V&V is rising
  • Data automation (for both implementation and V&V) is gaining importance
  • Formal abstractions are a central part of that

Download the Blog

.soundRow {
display: flex;
justify-content: space-between;
align-items: center;
flex-direction: row;
}
body > div.elementor.elementor-17930.elementor-location-single.post-34606.post.type-post.status-draft.format-standard.hentry.category-cto-blog > section.elementor-section.elementor-top-section.elementor-element.elementor-element-efdd4be.elementor-reverse-mobile.elementor-section-boxed.elementor-section-height-default.elementor-section-height-default > div > div.elementor-column.elementor-col-66.elementor-top-column.elementor-element.elementor-element-05119b0 > div > div.elementor-element.elementor-element-55dda35.elementor-widget.elementor-widget-theme-post-content > div > div > div.soundCol > iframe {
min-height: 190px !important;
height:auto!important;
}

@media screen and (max-width: 768px){
.soundRow {
flex-direction: column;
}
}

The need for realistic and reliable driving data is imperative for training AI-powered AV stacks with end-to-end planning models, and yet, real-world data is limited in diversity and scale. The solution is to generate synthetic data that is both scalable and controllable.

This blog outlines the process our team followed using Foretellix’s Foretify development toolchain, integrated with CARLA and NVIDIA Cosmos, to create, select and render high-quality “right-of-way violator” scenarios for the purpose of generating synthetic sensor data. This data feeds into imitation learning pipelines for end-to-end autonomous driving stacks, ensuring diverse, intent-preserving, and scalable training scenarios for improved AV stack performance and safety.

Motivation: Addressing a Critical Data Gap

The process originated from a performance gap identified in an AV stack during validation. The system struggled with a specific traffic scenario: “Driving through a junction with right-of-way violators”.

Upon analysis, AV engineers discovered that:

  • Real-world driving logs lacked sufficient examples of this scenario, especially those where the autonomous vehicle under test (EGO) responds correctly
  • This data scarcity, specifically for safety-critical scenarios, made it impossible to train or evaluate models on this behavior using recorded fleet data alone

To address this, the team enacted the following process to generate synthetic sensor data representing such interactions. The goal was to systematically simulate plausible yet rare violations, ensuring the AV stack learns to handle them through imitation learning with high-quality, diverse, and intent-aligned synthetic data.

 

1. Requirement Analysis

We began with a natural language requirement document that described the “right of way violator” scenario class. This document detailed:

  • Functional objectives of the scenario
  • Expected behaviors of the EGO vehicle (e.g., slowing, stopping, maneuvering)
  • Actions of violating agents, such as a vehicle ignoring or violating stop/yield rules

Our team parsed this information to extract core scenario intents, actor roles and behaviors, and desired outcomes. This human-readable requirement formed the foundation for formal scenario modelling.

2. Abstract Scenario Formalization and Implementation with Foretify Developer and V-Suite Libraries

Using the OpenSCENARIO Domain-Specific Language (DSL) we translated the English-language requirements into a map-agnostic, formalized scenario description. This included:

  • Declarative agent roles (e.g., “violator”, “victim”, “other agents”)
  • Temporal relationships and constraints (e.g., “violator enters intersection 0.5s before EGO”)
  • Desired EGO reactions (e.g., “apply brakes with X deceleration if Time-to-Collision < threshold”)

Foretify V-Suites, a comprehensive library of pre-configured, reusable scenario components gave us the baseline to jumpstart our scenario definition. Specifically:

  • Junction scenarios from the library served as a starting point
  • These base scenarios were modified and combined to reflect “right of way violator” situations
  • This reuse allowed us to accelerate development and ensure consistency with validated scenario patterns

To guarantee correctness and physical plausibility, we applied our domain model:

  • It provides a set of foundational constraints (e.g., traffic rules, geometry constraints, actor capabilities)
  • These constraints ensure that all generated scenario variants are both valid and realistic, regardless of the underlying map

This abstract representation allowed us to separate scenario logic from geographical layout, enabling flexible and scalable reuse across different maps.

Our team then used Foretify Developer tools to implement the scenario. One of key enablers for this task here were Foretify’s controllable driver models. We defined a configurable EGO behavior model that “does the right thing” to avoid collisions (e.g., slowing down or yielding). In contrast, the violating actors were configured to ignore junction rules (e.g., blowing past a yield or stop sign). This dual-driver setup ensured functional fidelity and expressiveness in scenario execution.

Click here to watch the entire video

3. Runtime Execution with CARLA and Foretify

During runtime, the scenario execution was integrated with CARLA, the open-source autonomous vehicle simulator, to add vehicle dynamics to the co-simulation loop. The architecture included:

  • Foretify orchestrating scenario execution during the co-simulation runtime
  • CARLA simulating vehicle physics, road traction, inertia, and fine-grain movement

This hybrid runtime setup preserved intent, while embedding the richness of realistic dynamics and physical constraints. It ensured that the synthetic sensor data matched real-world driving behaviors under the specified scenario logic.

4. Constraint Solving and Large-Scale Generation

We leveraged Foretellix’s constraint solver technology to generate 6,000+ valid scenario instances that preserved the core intent:

  1. Violator agent always challenges right of way
  2. EGO must always be forced to evaluate and act under time pressure
  3. Other traffic conditions, map elements (differing junction layouts), and timing are randomized within defined bounds

This ensured diversity without compromising the semantic consistency of the scenario class.

Click here to watch the entire video

5. Diversity Analysis and Run Selection

Using our intuitive big data analytics platform, we conducted an in-depth analysis of the generated scenarios:

  • Examined distributions across key metrics (e.g., violator speed, EGO reaction time, collision rate)
  • Ensured broad coverage across corner cases and edge conditions
  • Selected the most representative and diverse runs that optimally meet the scenario intent as the basis for imitation learning

This data-driven approach allowed us to avoid overfitting while maximizing generalization in downstream models.

6. Sensor Simulation with NVIDIA Cosmos

Selected runs were processed through Cosmos Transfer, a multicontrol WFM, to generate hyper-realistic, physically-based, sensor simulation scenarios. We used prompt upsampling techniques to expand the dataset across:

  • Weather conditions (e.g., fog, rain, glare)
  • Geographic locations (e.g., urban grids, suburban roads, highway ramps)
  • Lighting variations (e.g., dusk, dawn, night-time)

Click here to watch the entire video

With this process, we have demonstrated how, with the use of the Foretify Development Toolchain, we have generated the high-fidelity “right-of-way violator” sensor simulation scenarios required for training the end-to-end AI-powered AV stack, with a solution that is both scalable and controllable.