A Unified Approach to Simulation Trustworthiness in AV Development

TL;DR: Trust in simulation is now a critical part of proving AV safety. This blog explores how Foretellix is creating a unified, measurable methodology to evaluate simulation fidelity and turn simulation from a black box into a defensible part of the validation process.

Autonomous vehicle safety depends on trust, both in the vehicle’s behavior and in the tools used to validate it. As simulation becomes the backbone of AV testing, the industry faces a critical challenge: proving that simulation results accurately reflect the real world. This blog explores Foretellix’s unified approach to making simulation trust measurable, explainable, and repeatable.

Simulation is essential for validating autonomous vehicles because real-world testing alone cannot cover the vast range of scenarios needed to demonstrate safety at scale. However, simulation is only useful if its results can be trusted.

That trust is hard to earn. By nature, simulation is a simplification of reality, built on models, assumptions, and abstractions. If we’re going to rely on it for safety decisions, we need to understand how well it reflects the real-world and where it falls short.

Today, there is no standard method to measure this. Regulations require simulation to be “fit for purpose,” but they do not define how to evaluate realism, coverage, or fidelity. Consequently, most AV teams are left improvising. While some companies, such as Waymo through its published safety reports, share data and transparency frameworks, there is still no unified industry method for measuring simulation trust.

Foretellix is working to change that. We’re developing a structured methodology for simulation trustworthiness, grounded in measurable criteria and repeatable processes. The goal is to equip AV teams with a clear foundation for using simulation with confidence, grounded in realism, transparency, and measurable fidelity.

The Challenge of Proving Simulation Realism

Simulation plays a central role in AV development, but the industry still lacks a common framework for evaluating whether simulation results are reliable. While safety standards like ISO 26262 and UNECE regulations acknowledge the need for simulation tools to be “fit for purpose,” they stop short of defining how to assess that fitness, especially when it comes to simulating complex, dynamic traffic environments.

In practice, most teams resort to custom-built comparisons between simulated and real-world drives. These efforts often lack consistency, and rely on fuzzy definitions of realism. One engineer might examine actor behavior, another might focus on system-level KPIs, and a third on sensor noise or scenario variety. Without shared terminology, metrics, or methods, it is challenging to know whether the simulation is good enough, or what “good enough” even means.

The result is uncertainty. Test engineers and developers don’t always know which gaps still exist in their testing. OEMs struggle to justify simulation-based validation to regulators or internal stakeholders. And teams are left questioning whether the insights they gain from simulation can be trusted in the real world.

Foretellix’s Vision for Simulation Trustworthiness

To address the ambiguity surrounding simulation credibility, Foretellix is developing a unified methodology that makes simulation trustworthiness measurable, explainable, and repeatable. Instead of treating simulation as a black box, this approach breaks it down into two distinct but interconnected areas: toolchain qualification and simulation credibility. Together, these two pillars form the basis of Foretellix’s effort to make simulation trustworthiness not just a feeling, but a framework.

Toolchain Qualification

Before simulation results can be trusted, the tools generating and evaluating those results must be proven reliable. That’s the role of toolchain qualification: to ensure that every component of the internal Foretellix toolchain works correctly, and is consistently proven by thorough testing.

This qualification process includes:

Robust development practices: Including modular design, secure coding, and continuous integration to prevent defects early in the development lifecycle.
Comprehensive QA processes: Combining unit tests, regression tests, and integration tests to validate both individual components and the toolchain as a whole.
Integration testing: Ensuring that behavioral, physical, and evaluation components interact correctly across abstraction levels and real-world usage scenarios.
Tool certification: Supporting formal assessments, such as ISO 26262 certification, to demonstrate alignment with industry safety and quality standards.

The primary objective of toolchain qualification is not to directly enhance simulation realism. Rather it aims to establish confidence in the underlying infrastructure, thereby ensuring that simulation results are not being distorted or misreported due to issues in the tools that produce, process, or evaluate them.

Simulation Fidelity

Simulation fidelity refers to the degree to which a simulation accurately replicates real-world driving behavior, conditions, and system performance. It’s a structured collection of evaluation categories, each targeting a different aspect of realism and trustworthiness.

Foretellix’s methodology distinguishes between system-level and sub-system-level fidelity. System-level fidelity assesses the end-to-end vehicle behavior in simulation, while sub-system-level fidelity evaluates the fidelity of an isolated simulation component. Both system- and sub-system-level simulation fidelity are further defined through the following dimensions.

1. System-Level Simulation Fidelity

1.1 System Performance Fidelity

Evaluates whether the autonomous system behaves similarly in simulation and in the real world under comparable conditions. This is typically assessed through statistical comparison of real-world drive logs and simulation logs, focusing on system responses and critical events.

1.2 Scenario Reproduction Fidelity

Measures how accurately a specific real-world scenario can be replayed in simulation.

Ego reproduction fidelity: How closely the ego vehicle’s trajectory, velocity, and other behaviors match between the real-world and re-simulated run.
Actor reproduction fidelity: How precisely other agents (e.g., vehicles, pedestrians) are recreated in simulation relative to the original scene.

1.3 Scenario Realism

Assesses how natural and believable the simulation looks and feels, both from a technical and human perspective.

Actor model realism: Do agents behave in ways consistent with real-world norms?
Physical maneuver realism: Are motions physically plausible (e.g., no instantaneous sideways jumps, acceleration within vehicle performance limits)?
Behavioral maneuver realism: Do agents make reasonable decisions within the given context (e.g., yielding, overtaking, obeying traffic signals)?
Scenario composition realism: Are the elements of the scene, such as the number of agents, placement, and diversity, consistent with actual road environments?
Interaction realism: Do agents respond to each other in socially and contextually realistic ways?
Event flow frequency and distribution realism: Do key events (e.g., cut-ins, near-misses) occur at rates similar to what’s observed in real-world data?

1.4 Scenario Coverage

Quantifies how comprehensively the simulation explores the Operational Design Domain (ODD) and performance space.

ODD coverage: Are we testing across the full range of relevant conditions (e.g., road types, curvatures, intersections)?
Performance coverage: Are we exposing the system to sufficiently challenging or critical conditions to assess safety and robustness?

2. Sub-System-Level Simulation Fidelity

2.1 Sensor Simulation Fidelity

Evaluates the realism of synthetic sensor outputs. This includes how accurately lidar, radar, and camera simulations reflect actual sensor limitations, noise, and environmental effects.

Synthetic sensor data fidelity: How close is the simulated raw sensor data to real data in terms of resolution, distortion, and latency?
Environmental realism: Are lighting, reflectivity, weather, and occlusions accurately modeled to influence sensor behavior?
Repeatability & determinism: Is the sensor output deterministic under the same conditions, enabling reliable comparisons and regression testing?

2.2. Vehicle Model Fidelity

Assesses the accuracy of the simulated vehicle’s physics and dynamics compared to a real vehicle. This includes:

Kinematic and dynamic fidelity: How accurately do the simulated vehicle’s acceleration, braking, steering, and suspension match a real vehicle’s performance under various conditions?
Tire-road contact model realism: Does the tire model accurately represent real-world tire behavior including surface conditions affecting grip, slip, and temperature effects?
Powertrain realism: Does the simulated powertrain (engine, engine type, transmission) accurately replicate the performance and characteristics of a real vehicle’s powertrain?

The Data Backbone for Measuring Simulation Credibility

To evaluate simulation fidelity in a meaningful way, you need comparable data across both simulated and real-world domains. Foretellix addresses this through a centralized database that brings together enriched simulation logs, drive logs, and re-simulations, providing a unified foundation for credibility assessment.

This infrastructure includes:

Ingest of drive logs: Real-world recordings from the AV system, which serve as the reference point for evaluating simulated performance and behavior.
Generation of simulation logs: Logs generated from Foretellix’s scenario-based simulation runs, including both abstract and concrete test cases.
Resimulations via Smart Replay: Specific real-world events are reconstructed in simulation using Smart Replay. These replays provide a one-to-one comparison between observed and simulated behavior, especially for ego and actor trajectories.
Enrichment and annotation with Foretify Evaluate: Both drive and simulation logs are processed through Foretify Evaluate to add critical context, such as scenario labels, KPI calculations, safety metrics, and coverage classifications.

Once unified, this annotated dataset enables both trajectory-based comparisons (for reproduction fidelity) and statistical analyses (for system-level and scenario-level fidelity). The result is a structured, traceable, and scalable way to assess whether simulated behavior aligns with the real world, and where deviations or uncertainties remain.

From Measurement to Confidence

The goal of Foretellix’s approach is not just to describe realism, but to make it actionable. By measuring simulation fidelity across well-defined dimensions, backing those measurements with data, and providing insightful visualizations, AV teams can begin to answer critical questions with clarity:

Where are the gaps in our current test coverage?
How much uncertainty is present in our simulation results?
Which deviations matter, and what’s causing them?
When can simulation-based findings be trusted to reflect real-world outcomes?

This framework supports both debugging (e.g., identifying why a simulated behavior diverges from its real-world counterpart) and strategic validation (e.g., determining whether enough of the operational domain has been explored to release a system safely).

Ultimately, this shift from intuition-driven evaluation to structured, data-backed assessment lays the foundation for simulation to become a defensible, auditable, and trusted part of the AV safety case.

Janine Golpashin

Director of Product Solutions, leads the development of validation technologies and data automation for AI-powered autonomy. Prior to Foretellix, Janine held leadership positions at Torc Robotics and Daimler Truck, contributing to innovations in simulation and testing for autonomous vehicle development. She holds a Ph.D. in Controls Engineering from the Institute for System Dynamics, University of Stuttgart.

A Unified Approach to Simulation Trustworthiness in AV Development

The Challenge of Proving Simulation Realism

Foretellix’s Vision for Simulation Trustworthiness

Toolchain Qualification

Simulation Fidelity

1. System-Level Simulation Fidelity

1.1 System Performance Fidelity

1.2 Scenario Reproduction Fidelity

1.3 Scenario Realism

1.4 Scenario Coverage

2. Sub-System-Level Simulation Fidelity

2.1 Sensor Simulation Fidelity

2.2. Vehicle Model Fidelity

The Data Backbone for Measuring Simulation Credibility

From Measurement to Confidence

Janine Golpashin

You might also like

How AI helps Keeping Test Scenarios Aligned With Their True Purpose

Foretellix & Voxel51 Bring Scalable 3D Neural Reconstruction to AV Development

Automated Scenario Curation for Safer ADS