Secure and private statistics with distributed Paillier

6 January 2022

In the last ten years, we have seen a staggering rise in the amount of organizations that collect our personal data. Often, the intentions for collecting such data are benign: examples include tailoring services to our personal needs, research using medical data, and combating poverty. However, it’s not hard to come up with disadvantages to this data collection — cybersecurity risks, privacy violations by curious workers, or ads that follow you around the internet come to mind.

To cope with this complex situation, governments have started to intervene.
For instance, in Europe, we have seen the introduction of the General Data Protection Regulation (GDPR), which describes how personal data must be safeguarded and handled by the organizations that collect it. Still, this provides no clear solution to organizations that would like to do research on data they (or someone else) collected. One possible solution to this problem is found in cryptography, and is called homomorphic encryption. This is a kind of cryptography that still allows you to perform certain computations on encrypted data, such as a statistical analysis.

In a recent project, we have used the homomorphic Paillier cryptosystem to do such statistical analyses. We were approached by the Zuyderland hospital and the health insurance company CZ, who wanted to use a ‘health coach’, an app that patients with IBS could install on their phones to help them manage their symptoms. CZ would like to reimburse costs of this app, but this is only possible if it provably works for all groups of patients. However, Zuyderland could not just share all of its data on doctor visits — such data is highly private. Moreover, they wanted to make additional inferences using data from Statistics Netherlands (CBS), which has additional (legal) requirements to adhere to.

With the help of a TKI grant, the Netherlands Organisation for Applied Scientific Research (TNO) developed a pilot platform that made these statistical inferences possible using homomorphic encryption. This approach was found by an external legal team to be an ideal example of privacy by design as demanded by the GDPR, and we were able to use the platform on real patient data.

In this blog post, we shed some light on a particular aspect of our solution: distributed key generation and decryption. We used this novel variation on the Paillier system to offer the required flexibility to our partners with respect to whom we trust to hold the key. After a quick recap of the ‘regular’ Paillier cryptosystem, we will show you how it works and how it performed for us.

Homomorphic encryption with Paillier

Homomorphic encryption is a kind of cryptography that has a very useful property: it is possible to perform certain computations using encrypted data, without having to decrypt it. For instance, this allows me to send some sensitive data to you in encrypted form, you could compute something using this data and perhaps add some sensitive data of your own, and then send the result back to me. Then, I can decrypt the result and view it, but I don’t know what exact computations you did to get that result. Meanwhile, you know what computations you did, but not the result. This allows people to, for instance, perform certain kinds of statistics on sensitive data without revealing that data to one another.

The homomorphic encryption system we will use in the remainder of this post is the Paillier cryptosystem. To refresh your memory we summarize the main ingredients:

To generate a Paillier keypair, choose two large prime numbers p and q. These are typically chosen such that, for instance, both have 1024 bits.
The public key consists of n ≔ pq and a random element g ∈ (ℤ/n²ℤ)*, usually g ≔ n + 1 (which we will adhere to in this post). The corresponding private key is then the tuple (λ, μ) where¹:
λ := lcm(p—1, q—1),
μ := (L(g^λ mod n²))^-1 mod n,
L(x) := (x—1)/n.
To encrypt a message m ∈ (ℤ/nℤ)*, the sender chooses a random element r ∈ (ℤ/nℤ)*, and computes the ciphertext c ≔ g^m rⁿ mod n².
The recipient then decrypts the ciphertext by computing m = L(c^λ mod n²) ⋅ μ mod n.

The homomorphic properties of the Paillier cryptosystem make it very useful for secure multi-party computations: if you encrypt two messages m₁ and m₂, and multiply the ciphertexts together, the result will decrypt to m₁ + m₂. Moreover, the ciphertexts are randomized: if you encrypt the same number twice the two ciphertexts will be different, only the person with the private key can see that they contain the same thing.

The main constraint of this system is that one single party must be chosen to hold the private key for each computation. Sometimes you can choose to give it to a party who won’t be involved in the computation until it is finished, but that’s not always possible. To be more flexible, the Paillier keys can be generated by a group, so everyone can encrypt data just as easily, but the group needs to cooperate to be able to decrypt anything.

Distributed key generation

We give a high-level overview of the key generation algorithm, leaving out some details; these can be found in the publication on distributed Paillier key generation (pdf) itself. Suppose you have k parties, let’s say k = 3. These will cooperate to execute a key generation protocol, after which everyone will have the same public Paillier key, but there will be no private key in the usual sense. Rather, all three of you will have a part of the private key.

The first task is to come up with n. This must still be the product of two primes, but any participant who knows p or q can also decrypt messages alone, which we don’t want. Therefore, we have to come up with a bi-prime n without allowing a single participant to know p and q.

We do this by using Shamir’s secret sharing: each participant i generates a term pᵢ of p (such that p = p_₁ + … + p_ₖ) and a term qᵢ of q, which they then secret-share with all participants. The chance that p and q end up being prime can be increased by setting all but one participant’s terms to 0 mod 4, and the remaining participant’s terms to 3 mod 4. Each participant then adds their shares, so the participants collectively share p and q. The shares are then multiplied², so the participants end up with a secret-shared n.

After recombining the shares and reconstructing n, they check that n is actually bi-prime, restarting the protocol if it is not. After all, p and q are just random numbers! This test can be done without actually discovering p and q, and will quickly tell you if n is probably bi-prime.

These restarts are the main source of delays in the key generation process, and if you want ‘probably’ to give you any kind of certainty, you might have to restart thousands of times. We were able to reduce the impact of these restarts by running many rounds of the key generation protocol in parallel, so each party’s infrastructure could help with the biprimality checks, and as little time as possible was lost waiting for messages to be sent through the network.

Next, the participants collaborate to compute a kind of secret-shared private key. That means that an encrypted message can be decrypted only if each holder of such a ‘piece of the private key’ collaborates. Whereas normally, decryption means you know the private key λ, our λ should not be known to any single participant. However, we can use a trick: if we choose λ = (p−1) ⋅ (q−1) = n − p − q + 1, all of our parties already know a term of λ: say, the first participant knows λ₁ = n + 1 − p₁ − q₁, and the others know λᵢ = −pᵢ − qᵢ for i > 1.

In the key derivation, we blind λ multiplicatively with a random number β, which is likewise composed from terms β_i chosen randomly by the participants. Then, a secret sharing of λβ is computed by the participants just like they did with pq, which they use to reconstruct θ = k!² λβ mod n. This θ takes the role that μ played in the ‘normal’ Paillier cryptosystem.

Collaborative decryption

The same trick as with regular decryption applies: for each g such that g = 1 mod n, it can be shown that L(g^m mod n²) / L(g mod n²) = m mod n, so it is possible to compute the discrete log quickly for this particular subgroup.³ In this case, each participant i has a share h(i) of λβ, and computes c^h⁽ⁱ⁾, which can be recombined into c^k^{! λ β} in much the same way as for normal secret shares. Then, any participant with the secret key can compute L(c^k^{!² λ β)} mod n²) θ⁻¹ = m mod n to reconstruct the message.

Of course, the participants can also designate just one party to compute the decryption by sending their ‘share’ of the decryption to only that participant. This allows that participant to perform, for instance, extra statistical checks to make sure that no private information is unduly revealed.

So does it work?

Though, yes, distributed key generation and decryption is great to have in theory, you’d be wise to ask if this ‘probabilistic key generation’ works at all in practice. It’s not very useful after all if your customers have to spend weeks just to compute their keys.

For this pilot project, we used the Paillier cryptosystem with distributed key generation with three organizations in the Netherlands. Each of them used their own IT infrastructure to run our software, so there was some network latency between each round of the key generation protocol as well. To be on the safe side, we decided to work with 2048-bits Paillier keys (i.e. ²log p ≈ 2048 ≈ ²log q, so ²log n ≈ 4096).

In this environment, key generation turned out to take about 3 hours on average, and was always finished within the same working day. Since keys are very long-lived, this was very acceptable for our purposes.

In the end, the partners involved were able to run all of their analyses on our pilot platform. This was very exciting — it was the first time in the Netherlands that real people’s data was used for a mathematically secure and private statistical analysis in this way. The success of this pilot made TNO decide to assist in founding a new TNO spin-off company called Linksight. They will bring this innovation to the market and develop it further, and hopefully help many more organizations upgrade the privacy of their data-driven analyses.

The TNO-MPC GitHub page hosts publicly available libraries used for this project. The Paillier cryptographic library, the distributed key generation algorithm and the Shamir secret sharing library are all available, so you can try it out yourself!

This post also appeared on medium.

[1] In the case that g = n + 1, this can be simplified to μ = λ, taking into account that g^x = (n + 1)^x = nx + 1 mod n². We use this same trick during the distributed decryption with θ later on.

[2] Share multiplication is possible under Shamir’s secret sharing, but a factor k! appears, which must be divided out. We recommend the paper for the details, which we omit here in the interest of space.

[3] To understand that this still works for our ciphertexts of the form g^m rⁿ, note that if you raise the ciphertext to the power λ, the random factor r^nλ disappears modulo n². This is because Euler’s totient function φ(n²) = nλ, and by Euler’s theorem r^φ(z) = 1 mod z if r and z are relatively prime.

To encrypt a message m ∈ (ℤ/nℤ)*, the sender chooses a random element r ∈ (ℤ/nℤ)*, and computes the ciphertext c ≔ g^m rⁿ mod n².
The recipient then decrypts the ciphertext by computing m = L(c^λ mod n²) ⋅ μ mod n.