Learning from small datasets

Best-in-class AI algorithms depend on very large amounts of representative training data. This can be up to 100 million items and all too often this amount of data is simply not available. However, small datasets can lead to unreliable outcomes. It is therefore important to develop algorithms that can deal with this. 

Work with us in APPL.AI

Contact us about AI on small datasets


We offer various methods on how to effectively deal with small datasets. These include, transfer learning, online learning, and using high-fidelity models to generate simulated data. All this reduces the need for training data.

Addressing the challenges posed by small data sets

Modern machine-learning algorithms have millions of parameters that provide highly predictive values when trained with large datasets. Unfortunately, they perform much worse when trained with small datasets. However, often only small datasets are available as training data. Moreover, obtaining sufficient data is difficult, time-consuming and expensive. Legal and ethical constraints also limit the amount of data. For rare events, it might even be impossible to obtain sufficient data.

Running AI applications on small datasets involve reliability and performance risks. A bias can also occur. This involves numerous challenges:

1. Developing effective algorithms with small datasets that are reliable, unbiased and safe.
2. Combining small datasets with existing model-based approaches.
3. Coping with the issues of missing data and unreliable and changing data sources.

Small and limited datasets are strongly represented in the domains of artificial intelligence in healthcare, smart operations and predictive maintenance and autonomous vehicles.

“ The learning from small and limited data sets technology allows us to leverage the benefits of current Artificial Intelligence developments without needing unaffordable large annotated effort.” Klamer Schutte, Lead Scientist

Learning from limited and small data sets - What does TNO offer?

  • We develop transfer learning that makes it possible to exploit already available but less representative data.
  • We develop active and online learning that capitalises on the availability of scarce domain expertise to annotate only essential samples.
  • We supplement small datasets by using existing high-fidelity models to generate simulated training data.
  • By integrating domain knowledge, model-based reasoning and machine learning, we reduce the need for training data.