How do you measure something that keeps changing? The challenge of evaluating generative AI

Thema:
Artificial intelligence
12 February 2026

In the early days, we mainly used AI to list facts or summarise text. Today, AI systems are taking on increasingly important roles in our lives and work. This raises a crucial question for organisations adopting these tools: how do you evaluate whether your AI system is doing what it’s supposed to do? The nature of language models makes this question far more complex than it is for traditional software. TNO is developing practical measurement tools together with organisations to help them get a grip on AI performance.

The AI that always has an answer

‘Imagine a municipality using an AI chatbot on its website, and a resident asks about the town hall’s opening hours,’ says Marianne Witte-Schaaphok, responsible AI consultant at TNO, describing a simple example. ‘If the chatbot doesn’t actually have that information, there’s still a good chance it will give an answer anyway, because the underlying language model simply invents one.’

This illustrates one of the fundamental differences between AI language models and traditional software. A database shows an error message when information is missing. A language model, however, simply outputs a plausible-sounding answer: one that might be correct, but could also be entirely wrong. This makes evaluating AI tools both crucial and highly complex.

How do you evaluate a black box you don’t understand?

‘The models are so large that we can’t really reason through how they arrive at an answer,’ Marianne explains. The town hall example may seem simple, but it shows the complexity of AI evaluation very clearly. The models are difficult to predict, and their results are not required to adhere to sources. This already makes evaluating correctness a major challenge. ‘If you also want to evaluate biases or discrimination, the question becomes even more complex,’ says Marianne.

The 6 obstacles to reliable evaluation

Together with her colleagues, Marianne identified six crucial challenges organisations face when evaluating their AI tools:

Even developers cannot precisely explain how their models work. They are too large and too complex to fully understand. You can only evaluate the output, not the reasoning behind it.

Ask the same question twice and you may get two different answers. Language models generate text by predicting the next word, which naturally introduces variation. This inconsistency complicates robust evaluation.

An LLM is not a calculator. Its answers may be summaries, explanations, or descriptions — and several different formulations may all be considered correct. Creating a dataset of questions with a single “right” answer is therefore difficult. How do you handle that in an evaluation?

Because of the complexity of AI, humans play a large role in the evaluation. But what counts as a good answer becomes more complicated when you consider that one person may find an answer acceptable while another does not. ‘Five experts can have five different opinions on what a ‘good’ answer is,’ says Marianne. This human subjectivity makes objective evaluation even harder.

People who follow AI news likely see regular updates about how a new model scores on numerous benchmarks. But what do these scores actually tell us? Dataset quality often falls short, it’s not always clear what the benchmarks measure. Not to mention that many are focused on the American context.

As models grow more capable, benchmarks also struggle to keep up. ‘Some models already score 100% simply because the benchmarks no longer challenge them,’ Marianne warns. In some cases, the model may even have encountered benchmark questions during training, making the test outdated.

There are many different benchmarks. Model producers often choose the benchmarks themselves and use the results as clever marketing. The context for which a model is intended is often ignored. In other words: a model with a low score on a certain benchmark might still be perfectly suitable for a specific task. ‘Because benchmarks vary so much in use and purpose, it’s difficult for organisations to judge what a model can actually do,’ Marianne notes.

Responsible AI that works

TNO is building responsible AI that works: systems that are not only effective but also ethical. Responsible AI empowers people, increases societal impact, and keeps control with the user. European values such as privacy and security make AI more usable and reliable, and ensure broad acceptance in society.

A repeatable evaluation toolkit for every context

TNO is working with organisations on an evaluation toolkit they can use to perform robust, context-appropriate evaluations of their AI systems. The goal is to build a testing “pipeline” with reliable methods that can be applied at the right moments. There is no single test or method that always works. The toolkit also helps organisations determine which evaluation method to use when.

You might use quantitative evaluation with indicators and benchmarks, but you can also conduct human evaluations of the output. ‘In many public-sector applications, this will be necessary because the context is so important,’ Marianne explains. Additionally, you can use language models to evaluate other language models. ‘You actually need all methods, depending on the context,’ Marianne concludes.

‘Once you have the right data, indicators, and benchmarks, you also need your tests to be consistent and repeatable,’ she says. ‘Only then can you keep monitoring performance and reliably compare different situations, such as new model versions.’

The 5 properties every evaluation should measure

A robust evaluation toolkit should look at five themes:

This is the core question: does it do what it’s supposed to do? It sounds simple, but language models make it complex. This also includes the broader question of whether the system truly contributes to the intended goal.

To what extent does the model avoid producing biased or discriminatory output? A universal standard is hard to create, since fairness is cultural and context-dependent. ‘We’ve created a Dutch bias benchmark to capture Dutch cultural biases,’ says Marianne. ‘It has been further improved through crowdsourcing.’ Such tools are essential, especially for public-sector AI systems.

How much energy and water does the model consume? ‘There’s been a lot of discussion about energy use for training models,’ says Marianne. ‘If AI becomes a standard part of daily workflows, this becomes an important factor.’

What was the model trained on? How does it generate answers? ‘Many models do not disclose their training data,’ Marianne notes. ‘This makes it difficult to trace certain outputs.’

Besides technical aspects, social, ethical, and organisational dimensions matter too. The balance between goals, values, resources, and risks is crucial. What is the impact of using the algorithm, and what should high-quality AI be allowed to cost for an organisation and for society?

Across all themes, consistency is key. ‘You need to compare apples to apples,’ says Marianne. Only with a reproducible approach can you truly measure whether a new model outperforms the previous one, or whether biases have actually been reduced.

More autonomy requires better evaluation

A resident arriving at a closed town hall because of invented opening hours will get over it quickly. But what if the consequences are more serious? ‘I recently read about a software developer whose AI coding tool deleted an entire database,’ Marianne says. The system didn’t hesitate; it simply executed the command. ‘Luckily, the database could be restored, but it’s easy to imagine how serious this could have been.’

‘When you give AI systems access to other systems, it becomes essential that they only do what they are allowed to do,’ Marianne warns. Think of systems with access to bank data, data management, or personal information. In such cases, rigorous evaluation is vital. And regulatory requirements are growing: the European AI Act obliges organisations to demonstrate how their models perform.

‘Context is everything,’ Marianne concludes. ‘A brainstorming tool you use with colleagues can have a lower bar than a tool that provides medical advice. Evaluation needs to scale with the impact.’

Building better evaluation together

TNO is eager to collaborate with new partners to further advance AI evaluation. Organisations facing complex challenges, such as in healthcare or security, where mistakes can have major consequences, are especially valuable collaborators.

Get in touch with TNO to explore how evaluation can be the key to successful AI implementation in your organisation.

Working together on responsible AI evaluation

TNO develops comprehensive tools to help organisations implement AI responsibly at every level. We are looking for partners to jointly develop practical instruments for:

  • Evaluating AI system performance - technical testing and quality assurance
  • Strengthening critical thinking for users - measurement tools and support functions for employees
  • Governance structures - frameworks for responsible AI policies within organisations

Together, we can ensure that generative AI becomes a force for positive transformation in your organisation.

Meet our expert

  • Marianne Witte-Schaaphok

    Responsible AI consultant at TNO

Get inspired

50 resultaten, getoond 1 t/m 5

Balancing skepticism and blind trust: critical thinking as the key to responsible and effective use of GenAI

Informatietype:
Insight
14 January 2026
TNO is working with major organisations to develop a ‘critical thinking toolbox’ that helps employees think critically while using GenAI.

From reactive to proactive: How organisations gain control over GenAI governance

Informatietype:
Insight
16 December 2025

How TNO is leading the drive towards sovereign, responsible Dutch AI

Informatietype:
Insight
23 October 2025

TNO’s Vision for Responsible AI That Works

Informatietype:
Article
10 October 2025

ObjectivEye: AI-assisted human recruitment

Informatietype:
Insight
5 September 2025