Accelerating Automated Driving System Deployment with Scalable, Data-Driven Evaluation

Introduction: The Critical Need for Data-driven Evaluation of Autonomous Driving Systems

As advanced ADAS systems are doing more of the driving for us and fully autonomous vehicles hit the streets without safety drivers, e.g Waymo Robotaxis, Aurora Self-driving Trucks, the question isn’t just ‘can we build them?’—it’s ‘can we trust them?’ Significant advances in AI have accelerated ADS development, but their black-box nature makes formal, human-interpretable performance and safety evaluation even more critical1. Given the massive volume of simulation and drive log data, a highly automated, scalable evaluation pipeline is essential for ensuring ADS safety. Finally, as Fleet Operators and Logistics Companies are investigating how to optimize their businesses by leveraging Autonomous Driving Systems, they increasingly demand transparent metrics and KPIs to confidently deploy at scale. Waymo’s Safety Impact dashboard along with their recent publication “Determining Absence of Unreasonable Risk: Approval Guidelines for an Automated Driving System Deployment” provides key insights into why scalable, data-driven, and transparent safety and performance evaluation is now a make-or-break factor for ADS projects.

Key Challenges in ADS Evaluation Across Test Platforms

While evaluation pipelines have long guided ADS development, custom ‘home-grown’ solutions struggle with higher levels of autonomy such as SAE L3 and L4. Some of the main challenges include:

  • Volume and Variety of Test Data: The massive amount of data which needs to be systematically aggregated, evaluated, and analyzed against safety and performance metrics and KPIs coming from many different test platforms spanning SIL/HIL/VIL simulation to test track and public road drive logs, extracted in a variety of different formats.

  • Engineering Efficiency and Scalability: The huge engineering effort required to manually curate and harvest interesting scenarios and events from millions of miles of drive logs, triage issues, and perform scenario likelihood and criticality analysis. Often different teams using different test platforms (like real world drive logs vs synthetic simulations) are unable to share evaluation metrics, KPIs, and analysis tools – limiting reuse, causing effort duplication, and frequently resulting in inconsistent interpretations and implementations of the same metrics across platforms.

  • Developing the required evaluation content: Creating the required ADS “Evaluators” (KPIs, checks, and coverage metrics) in a reusable and extensible way requires significant engineering effort and this content is often not portable for evaluating data across test platforms. Additionally, if the evaluators are not captured with a good level of formal abstraction, it can be challenging for humans to easily interpret the intent of the evaluators.

  • Measuring testing completeness within the ODD: Lack of coverage metrics based on the requirements and risk dimensions of the ODD which can be used to determine when testing is considered complete based on aggregating, evaluating, and reporting the test results across all test platforms.

The Opportunity for a New Approach – Foretify Evaluate

Foretellix addresses these challenges with Foretify Evaluate: a test platform-agnostic, automated, scalable, and explainable evaluation framework delivering actionable insights for technical and management teams.

Why Foretify Evaluate Stands Out

Foretify Data-Driven ADS Development Platform

In the dynamic realm of self-driving technology, every mile driven—real or simulated—brings new learning opportunities and fresh risks. Foretify Evaluate is purpose-built to unlock those insights:

  • Curated Scenarios from Real-World Drive Logs: Extract, annotate, and evaluate scenarios from vast, real-world driving logs using a combination of AI and rule-based automation— enabling the analysis of performance and safety metrics in the context of key scenarios while also ensuring your testing reflects real-world complexity, not just theoretical models. This capability ensures validation and testing efforts stay grounded in reality, targeting the most relevant and impactful situations.

  • Extensive Evaluator Library: Access an ever-expanding library of ready-to-use evaluators within the “Evaluation V-Suite”. This library gives you configurable evaluation content to assess a wide spectrum of AV behaviors, metrics, and KPIs — accelerating time to insight and deployment readiness.

  • Comprehensive Analysis, Real and Virtual: Whether your scenarios play out on bustling city streets or synthetic simulations, Foretify Evaluate delivers structured, meaningful analytics of evaluation results. The same analytic tools can be used for both real-world and synthetic data, from detailed scenario analysis to aggregated metrics dashboards, the platform highlights performance or safety gaps and critical issues that might otherwise go undetected.

  • Unified ODD Coverage Metrics: Leverage OpenSCENARIO DSL coverage metrics to provide an objective measure of testing completeness within your target ODD. Aggregate and track test coverage across real-world and simulation test platforms, ensuring nothing falls through the cracks.

  • Focus Where It Matters: Foretify Evaluate provides advanced search and triage capabilities to prioritize the riskiest situations and the most significant issues, directing your engineering attention—and your resources—where they’ll have maximum impact.

Once Foretify Evaluate shines a light on every gap in the safety and performance of your ADS, or your ODD test coverage, Foretify Generate can be used to automatically generate targeted scenarios, close the loop on validation, and advance from insights to action—all in one platform.

What to Expect Next in This Blog Series

Stay tuned for follow-on blog posts which will be providing technical deep dives into different facets of the Foretify Evaluate Solution to get a first hand look at how it delivers scalable, data-driven, and transparent safety and performance evaluation.

References

1 For a deeper dive on the need for formal abstractions for evaluating AI-based Autonomous Driving Systems check out this recent blog post from Yoav Hollander, the CTO of Foretellix.

Additional content for you

Automated Scenario Curation for Safer ADS

This is the second blog in a series. In the first blog (Accelerating Automated Driving System Deployment with Scalable, Data-Driven Evaluation), Mike Stellfox pointed out that the real challenge in AV development has shifted from simply building systems to ensuring we can truly trust them....

Interview with Glen De Vos - Foretellix’s Newest Board Member

You’re joining our board as Temasek’s representative, what about our technology, team, and vision excites you? What excites me about working with the Foretellix team is that they are offering a comprehensive solution to a critical issue that OEM’s and Tier 1’s are facing during the development of Level 2++, 3 and 4 advanced mobility systems. ...

Why High-Fidelity Sensor Simulation Is Critical for AV Development and Testing

Autonomous vehicles (AVs) must safely navigate complex and unpredictable environments. Yet even the most advanced perception systems face limitations when an object or road user is temporarily hidden from view. These occlusion scenarios, blind spots caused by obstructing vehicles, are among the most critical and difficult to test....

Register to receive ALKS scenarios verification code examples

AI, Autonomy, V&V and Abstractions – Automating at Hyper Speed

Subscribe to our newsletter