New insights into cancer are needed to help improving care and prevention. This requires broad and rich data, for instance to develop machine-learning models that can evaluate treatment outcomes. However, bringing such data together in a traditional way could compromise privacy. In the LANCELOT project, TNO, IKNL and Janssen collaborate to develop open source Multi-Party Computation technologies to train logistic regression models on sensitive distributed data in a privacy-preserving way.
The healthcare sector is increasingly making use of the possibilities offered by machine learning and other advanced data analysis techniques. However, these techniques typically require usage of large amounts of data, and access to medical data is understandably heavily regulated and limited by legislations and concerns over privacy. A relevant example is given by treatment of non small cell lung cancer: given sufficient data, machine learning techniques such as logistic regression offer the possibility to predict some outcomes of the cancer condition – for instance, if a patient will survive after a given time depending on various patient and tumour characteristics. However, relevant data is typically scattered across several institutions, and gathering these data in one place often poses serious challenges: privacy concerns and regulations might prevent such gathering altogether, especially given that the added value of this type of analysis cannot be assessed in advance, hence making it difficult to justify the proportionality of data collection.
Privacy enhancing technologies, or PETs for short, are playing an increasingly important role in addressing this conflict. These technologies offer the possibility to reconciliate privacy and usage of data, through advanced mathematical and cryptographic techniques.
This blog post reports on one such initiative, the LANCELOT project. LANCELOT is a collaboration between TNO, the Netherlands Comprehensive Cancer Organisation (IKNL) and Janssen pharmaceutical, and studies the feasibility and added value of a PET solution to predict the evolution of the medical situation of cancer patients. In this project we closely collaborate with the more fundamental project SELECTED, which is part of TNOs Appl.AI program and the Netherlands AI Coalition (NLAIC).
TNO has developed a proof of concept for outcome prediction of non-small cell lung cancer. The proof of¬ concept works in two stages: first, a secure way of matching records in different datasets; and second, a solution to predict specific outcomes with logistic regression in a privacy preserving way.
Secure Approximate Matching
In many real-world scenarios where organisations parties wish to perform some joint data analysis, the underlying data is vertically partitioned: this means that different organisations will hold different types of data for each individual. Therefore, ensuring that different pieces of information belonging to the same individual are correctly matched is a typical challenge for scenarios of vertically-partitioned data; this aspect is often overlooked by proposed PET solutions, and is made worse by the fact that different data sources may use different identifiers to assess which person a given data point belongs to. In many countries, the Netherlands included, usage of national, uniquely-identifiable codes such as the social security number (known as BSN in the Netherlands) is heavily regulated, and most organisations do not have access to these identifiers. Therefore, other personal data such as name, date of birth, gender at birth and zip code or address are used.
The downside of this approach is that recording and maintaining this type of data is error prone, which makes it harder to find matching records across different organisations. This problem becomes particularly complex if privacy is an issue: in these cases, parties are typically not allowed to learn the identities of people in the overlap; this means that PET solutions have to operate automatically and “blindly”, without an operator that can visually access the data and identify possible mistakes based on common sense.
As part of the LANCELOT project, TNO has developed a protocol to address this issue, which we call secure approximate matching. This protocol consists of several components, that can identify various types of errors affecting identifiers of the same person. For instance, one component identifies spelling mistakes that come from homophone names and surnames, for example when the surname “Janssen” is incorrectly recorded as “Jansen” (which have the same pronunciation in Dutch). Yet another component identifies possible mistakes in birth date arising from a small shift in day, month, or year, and detects errors in zip codes.
An important remarks here is that these possible mistakes are not “highlighted” and made visible to personnel operating the system: instead, the system will compute a mathematical measure expressing how “close” one identifier is to another (where identifiers are close if they are the same or if they differ by minimal errors as above), and then extract the matched records. These matched records, however, are protected with a form of distributed encryption , and are thus not readable by the parties operating the system. The secure approximate matching library is published as an extension to the “secure inner join” library of TNO.
Secure Logistic Regression
Once a securely distributed and complete dataset has been obtained, other techniques can be used to assess evolution of medical condition in a privacy preserving way. More precisely, existing frameworks for secure multi party computation, or MPC for short, can be leveraged to this end. MPC frameworks allow several parties to evaluate a given function on distributed inputs in a secure way, meaning that no information on the inputs is revealed, and that only the output of the function will be made available to the participating parties. In the LANCELOT project, we make use of the MPyC framework, developed by Eindhoven University of Technology, which is written in Python and can thus use the vast machine learning functionalities that have been developed for this programming language.
TNO has used MPyC to create a privacy preserving distributed protocol that train a logistic-regression model. More precisely, the protocol takes as input a distributed dataset containing patient data, obtained with secure approximate matching, in which several pathological and non-pathological data are expressed for each patient, including an indicator of the patient survival at a given point of time; subsequently, the protocol “trains” the logistic-regression model, meaning that it computes a function that predicts the survival indicator for new patients. No other information on the patient data is revealed as part of this process.
This secure logistic regression protocol has been made available as part of the “secure learning” library published by TNO. With this addition, the library has become an even more powerful tool, being able to train different types of regression models (on top of the already-mentioned logistic, linear regression models are also included) and with several customisable option for the training of the machine learning, e.g. in terms of regularisation (L1/L2), class weights, and cross validation.
Current status and results
Both components described above have been tested within the project in a realistic scenario, involving all three participants in the project (TNO, IKNL and Janssen pharmaceuticals). The test has thus involved three geographically-separated machines, one per organisation, and has provided useful insights in operational challenges and scalability in a relatively high-latency setting.
Notice that, in theory, only the two data parties need to participate in the protocol; however, the addition of a third party (TNO in this case), which does not supply any input and does not obtain the output of the computation, allows us to make use of highly efficient techniques. Both the secure approximate matching and the secure logistic regression assume that at least three parties are participating, with more parties supported.
The experiments showed a good performance of the libraries: with one dataset containing 3 000 records and the other 30 000 records, it took roughly two and a half hours to execute both matching and training of the logistic model. While this is significantly higher than what a “plaintext” computation (with no privacy guarantees and PETs) would require, it is still deemed suitable for the goal, especially given that the solution has not been optimized for better performance yet.
Moreover, the experiments also validated the quality of the obtained prediction by comparing it with a model trained with “plaintext” techniques (using the state-of-the-art scikit-learn library) on the matched datasets. Once again, the experiment yielded satisfying results, with minimal differences across all the different metrics used (accuracy, precision, and recall of the model on test data).
As a final remark, we stress the fact that the experiment only made use of artificial data, although sampled based on the distribution and insights obtained from real patient data.
The results achieved in the LANCELOT project represents a step forward in the development of PETs in the medical sector by TNO, in collaboration with partners in healthcare (IKNL and Janssen). By bridging the gap between academical research and operational needs of healthcare organisations, the project helps in reconciliating privacy with the potential of modern machine learning techniques. In a follow-up project called HERACLES, TNO will collaborate with IKNL, Janssen and 10 other partners in healthcare and technology to bring the solutions to a higher technology readiness level and to test them using pilots on real data.
LANCELOT is a public private partnership project partially funded by PPS-surcharge for Research and Innovation of the Dutch Ministry of Economic Affairs and Climate Policy. There is close collaboration with TNOs Appl.AI program and the Netherlands AI Coalition (NLAIC).