Traffic, working conditions, our food: everything is regulated for our health and safety and strictly supervised. And what about data, given that many decisions are made on the basis of big data and data analyses? In many cases, it is far from clear that the quality of data is adequate and we can blindly trust it. It’s time for new ideas. And TNO is taking up the gauntlet.
“The time has come for this topic to be firmly placed on the social agenda”, says Freek Bomhof, a data expert at TNO, the Netherlands Organization for Applied Scientific Research. “We are already in talks with major parties such as Statistics Netherlands (CBS) and the standards institute NEN to share views and experiences. We would welcome other large data processors, within government agencies and the business community, joining this initiative. The reliability of data processing is too important an issue to ignore.” According to TNO, a data analysis authority could be an option. Similar to how the Netherlands Authority for Consumers & Markets (ACM) protects consumer interests, or the Dutch Data Protection Authority (AP) watches over our privacy, a comparable body could monitor the quality and reliability of data. “If we, as a society, make ourselves so dependent on data, we have to take a serious and structured approach to it.”
“If we, as a society, make ourselves so dependent on data, we have to take a serious and structured approach to it”
Attention to data quality
TNO initiated the ‘Making sense of big data’ programme a few years ago. The aim was to create value by combining data from various sources and enriching it. Attention was – and still is being – paid to technical matters, such as standards, protocols and systems, for the correct exchange of data, as well as the quality of data. After all, defective data analyses can lead to erroneous decisions with significant consequences. The first fatal accident with a self-driving Tesla serves as an example of this. Bomhof also refers to another case from the US where scientists thought they were able to make accurate predictions based on big data. Based on consulted search terms, Google Flu Trends promised to be able to predict where and when a flu epidemic would break out in a large number of countries. Although the model initially seemed to work, the figures were ultimately far removed from reality.
Tracing and eliminating errors
“It indicates how careful you have to be”, says Bomhof. “Because no matter how well thought-out, a minor error at the start of the data pipeline can gradually lead to completely incorrect outcomes. Therefore, we have developed an analysis method to trace and eliminate those errors. This produces a checklist that parties can use to control the quality of their data pipeline”. TNO experts have carried out literature reviews and studied a considerable number of actual case studies. These have been used to create a list of sources of uncertainty that can be found in the data pipeline. Equipped with this list, the experts have interviewed those involved in around twenty cases. These interviews confirmed that the uncertainties the experts thought existed in practice did actually exist.
“A minor error at the start of the data pipeline can gradually lead to completely incorrect outcomes. For this reason, we have developed an analysis method to trace and eliminate those errors”
Big data and big lies
“We asked questions such as ‘Do you know for sure that the sensors you use for measuring are working flawlessly? Have you filtered the data properly? Is your data model still valid and correctly trained?’”, Bomhof continues. “It is then revealed that many analyses are based on data which is by no means reliable. Yet a manager or city councillor makes decisions on the back of that analysis. We have also reviewed the visualization of data analyses, because you have big data and big lies. As the saying goes: lies, damned lies, and statistics. What do you present to whom and how? I can incorporate the same results in different ways in graphs and suggest an outcome that the recipient wishes to hear.”
Assessing the reliability of data
According to Bomhof, it is possible to point out something that can go wrong in every step of the pipeline. And with the exponential growth in the volume of data, the probability of errors increases correspondingly. After TNO put a large number of cases through a scientific ‘grilling’, researchers are now working on the finer details of methods to test the reliability of data models. They are also developing a concept to already eradicate uncertainties from data models in the design stage. The first results were recently presented to experts at an international congress. “The reactions were very positive. Apparently nobody has approached the topic in this way before. Who knows, perhaps we’ll soon set an international standard to be able to assess the reliability of data.”
Spotlight on two cases
Two of the twenty cases referred to above are HERMESS and the municipality of Rotterdam. TNO is currently working with them on developing reliable data models.
Case 1: critical assessment of innovative ETA model
HERMESS is a company that develops innovative solutions for ports and the offshore sector. Its cooperation with TNO aims at being able to more accurately predict the arrival time (ETA) of container ships. Director Charles Calkoen: “Increasingly more public, but also more commercial data is becoming available, for which we can build useful applications. We use information that ships measure to make the whole underlying logistics chain more efficient. Shipping companies, terminals, hauliers, and port authorities are just some of the parties that benefit from this. We recently completed the prototype for an innovative ETA model, which TNO experts are now critically assessing. This enables us to optimise the system and get it up and running quicker. Knowing a precise arrival time is essential for the entire chain: it prevents waiting times, increases efficiency, and cuts costs.”
“Who knows, perhaps we’ll soon set an international standard to be able to assess the reliability of data”
Case 2: hybrid model for improving youth policy
TNO is working with the municipality of Rotterdam to shed better light on the factors that play a role in youth development. Denis Wiering, the municipal Youth Policy programme manager: “As we already have extensive knowledge about the protective and risk factors for our youth policy, that knowledge will be combined with TNO’s big data analysis method. In other words, this will be a hybrid model to both improve and make policy and its implementation more effective. We already know many of the factors that play a role in our youth growing up in a promising, safe, and healthy environment and have built a model for this purpose with scientists. For example, in the case of school absenteeism as a predictor of early school leaving, it is important to know how certain factors should be weighted and/or whether we could intervene at an earlier stage. So we are actually combining two approaches to form a hybrid model: that of the current state of science and the new manner of analysing that is becoming possible with big-data techniques. Better models help us to support our youth to grow up in a promising, safe, and healthy environment.”