The following are notes taken on lecture 11 of the Harvard CS229br seminar on machine learning theory, concerning privacy in machine learning. Thanks to Prof. Boaz Barak and his team for producing this seminar and sharing it freely.

Privacy

Background

Models are often trained on private data, in which case model outputs may be problematically compromising. Work by Latanya Sweeny demonstrated that simple, publicly-available demographic data can uniquely identify individuals, and managed to obtain the medical records of the governor of Massachusetts in the year $2000$. Narayanan and Shmatikov (2008) managed to de-anonymize movie-raters by cross-referencing Netflix and IMDb datasets. Carlini, Liu, Erlingsson, Kos and Song (2019) showed that large language models memorize private data and can be made to regurgitate sensitive information such as credit card and social security numbers.

With access to enough queries, one can partially reconstruct a model’s weights, and in turn partially reconstruct the training data. Carlini et al. (2021) demonstrated this weakness in GPT$-3$, and showed that as little as $10$ occurrences of a detail in the training data can lead to memorization. A 2018 VICE article showed that Google Translate memorizes available resources for languages where training data is sparse, often resulting in high scores for biblical phrases during translation.

Possible solutions come with tradeoffs:

Cryptographic methods can ensure privacy at a large cost to efficiency, and restriction on control of access to the information.
Differential privacy methods let us manually tune the privacy/utility tradeoff, but presently the tradeoff is not good enough to be desirable.
Heuristic approaches can get good privacy for minimal utility tradeoff, but do not offer guarantees.

Cryptography

Private Key Encryption

In private key encryption, some key and plain text message are encrypted to produce a cipher, which can be decrypted using the key.

\[\begin{matrix} E: &\underbrace{ \{ 0,1 \}^{n} }_{ \text{Key of length }n } &\times &\underbrace{ \{ 0,1 \}^{l} }_{ \text{Message of length }l } &\to &\underbrace{ \{ 0,1 \}^{m} }_{ \text{Cipher of length }m } \\ D: &\underbrace{ \{ 0,1 \}^{n} }_{ \text{Key} } &\times &\underbrace{ \{ 0,1 \}^{m} }_{ \text{Cipher} } &\to &\underbrace{ \{ 0,1 \}^{l} }_{ \text{Message} } \end{matrix}\]

An encryption is correct if encrypting and decrypting a message with the same key retrieves the message:

\[\forall_{k}\forall_{x \in \{ 0,1 \}^{l}}, \; D_{k}(E_{k}(x)) = x\]

One-Time Pad

The one-time pad is a private key encryption algorithm, performed by simply combining the message and the key using modular addition (the XOR bitwise operation), which is reversible when the key is known:

\[E_{k}(x) = x \oplus k, \; D_{k}(y) = y \oplus k\]

The one-time pad scheme is perfectly secret, meaning that an adversary with access to the cipher and not the key cannot guess the message any better than by chance:

Perfect Secrecy

\[\forall\text{ Algorithm } A, \; \underset{\substack{x \sim \{ 0,1 \}^{l} \\k \sim \{ 0,1 \}^{n}} }{ \Pr }[A(E_{k}(x)) = x] \leq 2^{-l}\]

Shannon proved that the one-time pad achieves perfect secrecy with $n=l$: acknowledge that for fixed message $x$, its modular addition with a uniformly random key $k$ is distributed uniformly, and thus the cipher $y$ is not correlated with the message $x$. For any arbitrary algorithm $A$:

\[\underset{\substack{x \sim \{ 0,1 \}^{l} \\k \sim \{ 0,1 \}^{n}} }{ \Pr }[A(x \oplus k) = x] = \Pr[A(y) = x]\leq 2^{-l}\]

The one-time pad can be extended to messages over some set of integers, setting the modular addition according to the minimum value:

\[k,x \in \mathbb{Z}_{t}^{n}, \; E_{k}(x)= (x_{1}+k_{1} \ \mathrm{mod} \ t, \dots, x_{n} + k_{n} \ \mathrm{mod} \ t)\]

Note that our observation that $x \oplus k$ is distributed uniformly only holds if we have at least as many possible keys as we have possible messages. This means the key also has to be sampled from the same integer range.

Shannon also proved that every perfectly-secret scheme requires $n\geq l$ (otherwise guessing the key would be easier than guessing the message). This entails that every message must have its own key, meaning that transmitting encrypted messages using such schemes is inefficient, and reusing keys is insecure.

The Venona Project

Historically, the reusing of keys allowed Gene Grabeel and her collaborators to decrypt Soviet communications and expose many of their operatives in the west. Modern encryption uses computational schemes where keys can be shared between messages, making the key exponentially smaller than the volume of data it can encrypt.

Fully Homomorphic Encryption

FHE schemes have the property that even without knowing the private key, known operations can be performed on the ciphers which compute ciphers of bitwise operations of messages:

\[\begin{align*} E_{k}(x) \times E_{k}(x') &\to E_{k}(x \ \mathrm{AND} \ x') \\ E_{k}(x) \times E_{k}(x') &\to E_{k}(x \ \mathrm{OR} \ x') \\ E_{k}(x) &\to E_{k}(\mathrm{NOT} \ x) \end{align*}\]

or equivalently,

\[E_{k}(x) \times E_{k}(x') \to E_{k}(x \ \mathrm{NAND} \ x')\]

which forms a universal basis for computation.

Encryption in FHE is a randomized algorithm. Each bit is encrypted and decrypted using a key, and any two ciphers can be evaluated using NAND. The correctness of evaluation is defined as an approximation such that the total variation between the NAND of the ciphers and the cipher or the NAND of the messages is exponentially small in $n$.

\[\forall_{k}\forall_{b,b' \in \{ 0,1 \}}, \; \mathrm{NAND}(E_{k}(x),E_{k}(x')) \equiv E_{k}(\neg(b \wedge b'))\]

The computational secrecy property shows that each bit cannot be guessed better than chance, considers only algorithms that run in polynomial time, a similar equivalency condition that the advantage over half is exponentially small in $n$. For each bit:

Computational Secrecy

\[\forall \text{ Algoritm of time }\ll \exp(n), \; \underset{\substack{b \sim \{ 0,1 \} \\k \sim \{ 0,1 \}^{n}} }{ \Pr }[A(E_{k}(b)) = b] \leq \frac{1}{2}+\exp(-n)\]

This holds even if an adversary has seen exponentially many samples of ciphers using the same key! This means perfect secrecy in practice, for cheap.

FHE in Practice

FHE was proven to exist under reasonable assumptions by Gentry (2009). Further work has limited this to standard assumptions of learning with error. Some implementations exist, although still slow.

FHD for Learning

FHD is promising for machine learning, and there’s active research on this topic. Since NAND is a universal basis for computation, we can express any algorithm $A$ on data $x$ in terms of NAND operations, and using the properties of FHE, $A$ can also be performed on the cipher $E_{k}(x)$ to produce $E_{k}(A(x))$. If $A$ is a training algorithm, we can run $A$ on encrypted data to produce an encrypted model $E_{k}(A(x))$. The drawbacks of FHD is the issue of querying an encrypted model, as well as the computational overhead associated with this kind of training, which is currently a factor in the order of one million.

Differential Privacy

Another candidate direction for enforcing privacy in machine learning is differential privacy (For more see The Algorithmic Foundations of Differential Privacy by Cynthia Dwork and Aaron Roth).

For dataset $\mathcal{X} = { x_{1},\dots,x_{i},\dots,x_{n} }$, algorithm $A: \; x \to h$ is called $\epsilon$-differentially private if after observing output $h_{i}$, the posterior probability that $x_{i}$ was in $\mathcal{X}$ is only a factor of $\pm \epsilon$ away from the prior.

Differential Privacy

More formally, considering two datasets that differ only in one coordinate, $\forall \mathcal{X},\mathcal{X}’ \; \text{s.t.} |\mathcal{X}\Delta \mathcal{X}’| = 1$ (where $\Delta$ stands for the symmetric difference), then for any set of outputs $S$: $\forall S, \; \Pr[A(\mathcal{X})\in S] \in e^{ \pm \epsilon }\Pr[A(\mathcal{X}) \in S]$

where the probability is over the randomness in $A$, which must be randomized. Sometimes an additive factor of $\delta$ is introduced, but we’ll deal with the usual case where $\delta$ is negligible compared to $\epsilon$.

In a context of data privacy, where there might be fear that the inclusion of some datapoint might lead to an undesirable decision, differential privacy ensures that the effect of the inclusion of any datapoint on the probability of any decision is smaller than a factor of $e^{\epsilon}$.

One might suggest that a weaker requirement of an additive factor is enough to satisfy privacy needs. However, let’s examine a worst case where the algorithm $A$ spits out some input data as-is.

\[A(\mathcal{X}) = \{ x_{i_{1}},\dots,x_{i_{k}} \},\text{ for random } i_{1},\dots,i_{k},k\ll n\]

While this algorithm is obviously not private, it does satisfy the additive definition. If $\mathcal{X},\mathcal{X}’$ differ by one coordinate, the probability of this coordinate to be chosen for the leak such that the output is sufficient to differentiate between the datasets is proportional to $\frac{k}{n}$, so $k$ can be chosen to satisfy an additive definition of privacy:

\[\|\Pr[A(\mathcal{X}) \in S] - \Pr[A(\mathcal{X}') \in S]\| \leq \frac{k}{n}\]

Properties of Differential Privacy

1) $\epsilon$-differential private algorithms can be composed together while retaining privacy. If $A$ is $\epsilon$-DP and $A’$ is $\epsilon’$-DP, then $B(\mathcal{X}) = A(\mathcal{X}),A’(\mathcal{X})$ is $(\epsilon+\epsilon’)$-DP. Proof: considering all outputs $\forall h,h’$ for datasets differing by at most one coordinate $|\mathcal{X}\Delta \mathcal{X}| \leq 1$:

\[\Pr[A(\mathcal{X}),A'(\mathcal{X}) = (h,h')] \leq e^{ \epsilon }\Pr[A(\mathcal{X}')=h]\cdot e^{\epsilon'}\Pr[A'(\mathcal{X}')=h']\]

2) Differential privacy is also retained under post-processing. If $A$ is $\epsilon$-DP and $B(\mathcal{X})=f(A(\mathcal{X}))$ then $B(\mathcal{X})$ is $\epsilon$-DP. Proof: considering all outputs $\forall h$ for datasets differing by at most one coordinate $|\mathcal{X}\Delta \mathcal{X}| \leq 1$:

\[\begin{align*} &\Pr[f(A(\mathcal{X})) = h]\\ &= \sum_{h' \in f^{-1}(h)} \Pr[A(\mathcal{X}) = h'] & \leq e^{ \epsilon }\sum_{h' \in f^{-1}(h)}\Pr[A(\mathcal{X}')=h'] \\ & &= e^{\epsilon}\Pr[f(A'(\mathcal{X}'))=h] \end{align*}\]

3) Differential privacy can guarantee that training mechanisms are not blatantly broken. Define a training mechanism $\mathcal{X} \to f_{w}$ as broken if there exists some transformation of $f$ that outputs a datapoint $x \in \mathcal{X}$. If the mechanism is $(\epsilon,\delta)$-DP then the probability its broken is bounded by $\leq \frac{\epsilon}{N}+\delta$, where $\frac{1}{N}$ is the probability of guessing $x$ randomly.

Differentially Private Statistics

Suppose we want to publish estimates of some functions summed over our dataset. We want the estimates to be as close as possible while making sure that we respect privacy.

\[\hat{f}_{1} \approx \sum_{x \in \mathcal{X}} f_{1}(x),\dots,\hat{f}_{k} \approx \sum_{x \in \mathcal{X}} f_{k}(x)\]

Why should we worry about privacy when publishing sums?

Suppose we have a dataset of COVID cases in Cambridge, Massachusetts. A sum statistic for example could be the number of cases corresponding to Harvard professors, or to people who own cats. It turns out that while innocuous on their own, multiple sum statistics can be used to infer private information. Take the following example publication:

There are 30 total cases in Cambridge.
There are 29 cases of people under the age of $70$.
There are 12 cases of people with preexisting liver diseases.
There are 11 cases of people under the age of $70$ with preexisting liver diseases.

From the first two, we can already deduce that only one positive case in Cambridge is aged $70$ or higher. With the other two, we can also now tell that that person has a liver disease. Anyone who knows the person over 70 that tested positive for COVID now knows they also have a liver disease. The same principle holds for larger datasets.

We can protect privacy by adding noise proportional to our desired $\epsilon$. This is known as the Laplace mechanism. We draw noise from the Laplace distribution, which is a symmetric exponential parameterized by some constant $b$ such that the probability of $x$ is exponentially small in $\frac{|x|}{b}$. The parameter $b$ is therefore proportional to the standard deviation.

\[\begin{align*} \Pr[\mathrm{Lap}(b) = x] &= \frac{1}{2b}\exp\left( -\frac{|x|}{b} \right) \\ \sigma^{2} &= 2b^{2} \end{align*}\]

Assume our functions are binary functions $f_{i \in [1,k]}(x) \in [0,1]$, such as ones that select datapoints for inclusion in the sum, the Laplace mechanism is then applied with a standard deviation proportional to $\frac{k}{\epsilon}$. The choice of acceptable threshold is in practice chosen to be around $\sigma \approx \sqrt{ n }$, as the natural error in the data is likely to be similar.

\[\hat{f}_{i} = \sum_{x \in \mathcal{X}}f_{i}(x)+\mathrm{Lap}\left( \frac{k}{\epsilon} \right)\]

The Laplace mechanism is $\epsilon$-differentially private. We prove this on a single function $f$ which can be generalized by the composition property. We write $f(\mathcal{X})$ to denote the sum of $f(x)$ over the dataset $x \in \mathcal{X}$. Due to our choice of selecting functions the difference $|f(\mathcal{X})-f(\mathcal{X}’)|\leq 1$.

\[\begin{align} \Pr[\hat{f}(\mathcal{X})=v] &= \Pr\left[f(\mathcal{X})+\mathrm{Lap\left( \frac{1}{\epsilon} \right) = v}\right] \\ &= \frac{1}{2\epsilon}\exp(-\epsilon|v-f(\mathcal{X})|) \\ &\leq \frac{1}{2\epsilon}\exp\left(-\epsilon(1+|v-f(\mathcal{X'})\|)\right) \\ & = \frac{1}{2\epsilon}\exp\left(\epsilon-\epsilon\|v-f(\mathcal{X'})\|\right) \\ &\boxed{\leq e^{ \epsilon }\cdot\Pr[\hat{f}(\mathcal{X}')=v]} \end{align}\]

Generalizing for functions outside of sum we can achieve $\epsilon$-DP for an estimator with standard variation $\approx \frac{k}{\epsilon}$ of any $f:\mathcal{X}\to \mathbb{R}^{m}$ if it’s true that the sensitivity of $f$ is under $k$: $|f(\mathcal{X})-f(\mathcal{X}’)|_{1}\leq k$ for all datasets that differ by a single element $|\mathcal{X} \Delta \mathcal{X}’| = 1$.

Definition versus Mechanism

The Laplace mechanism is so popular that it’s often confused to be synonymous with differential privacy, but in truth differential privacy is a definition of privacy, and the Laplace mechanism is but one approach to achieve it. Other noise distributions can be used, including Gaussian noise, usually achieving some $(\epsilon,\delta)$-DP. There are also mechanisms that don’t involve the addition of noise.

DP for SGD

Abadi et al. demonstrate differential privacy in the context of deep learning by injecting noise into the SGD step. They also introduce gradient clipping, where $\nabla_{\mathcal{L}_{i}^{C}}(w)$ are the gradients for every sample for weights $w$ according to loss $\mathcal{L}$ at batch $i$, clipped at $C$.

\[w_{t+1} \leftarrow w_{t} - \eta \left[ \nabla_{\mathcal{L}_{i}^{C}}(w) + N(0,\sigma^{2}C^{2}I) \right]\]

They show that for constant $\epsilon,C$ one can achieve $(\epsilon,\delta=O(1))$-DP with constant statistical error $\sigma$, with the limitation that the number of SGD steps is much smaller than $\ll \left( \frac{n}{b} \right)^{2}$, where $n$ is the number of samples and $b$ is the batch size.

They achieved 97% on MNIST with very large epsilon ($\epsilon=8$), which might not be considered private but in practice seems to prevent blatant privacy attack. For CIFAR$-10$ they could get as far as around 70% accuracy, demonstrating the high cost of privacy to performance as you add more and more noise. They also show that even huge values of epsilon give some bounds on memorization attacks.

Some approaches other than noise injection have shown promise for better scaling: Papernot et al. trained teacher models on disparate private datasets, and let a public student model query a noisy private aggregate of the teacher ensembles (PATE). Subsequent research was published on scaling this architecture.

Heuristics

Heuristic methods can provide better privacy, with lower cost to accuracy and faster execution. The price is abandoning absolute guarantees.

InstaHide

Huang et al. (2018) introduced InstaHide, inspired by FHE, and optimized for standard training procedures. The private data is mixed together with public data before training.

While the authors do not provide a definition of the privacy conditions the method satisfy, or proof thereof, but they attempt to justify a claim for privacy empirically. At the very least, one would hope that the encoding will be robust against reconstruction attacks. That is, that it would not be possible to efficiently reconstruct the private information from the encoded information.

Mixup Augmentation

The intuition for InstaHide is taken from Mixup data augmentation, where networks are trained on convex combinations of examples from different label to favor linear interpolation behavior, which has been shown to decrease memorization in the context of corrupted labels.

The procedure takes two private (secret) examples $x_{1}^{s},x_{2}^{s}$ along with two public examples $x_{1}^{p},x_{2}^{p}$ and combines them linearly. Then, every pixel in the resulting image is flipped randomly, similarly to the one-time-pad. In this case, however, instead of a binary variable, each pixel color is a point in the range $[-1,+1]$.

\[\begin{align*} x' & = \lambda_{1}x_{1}^{s}+\lambda_{2}x_{2}^{s}+\lambda_{3}x_{1}^{p}+\lambda_{4}x_{2}^{p}, & \lambda_{1},\dots,\lambda_{4} \sim [0,1] \text{ s.t.} \sum_{i=1}^{4} \lambda_{i} = 1\\ \tilde{x} & = (x_{1}'k_{1},\dots,x_{n}'k_{n}), & k \sim \{ \pm 1 \}^{n} \end{align*}\]

While the authors have shown that this method does not degrade accuracy much, Carlini at al. (2020) have shown that InstaHide is insecure. Their attack was able to retrieve visually-recognizable reconstructions of $100$ private images encoded into $5000$ “secure” images by InstaHide in under an hour on a single machine.

The attackers used that fact that the random flipping transformation, performed on floating point numbers, preserves their absolute value which retains a non-trivial amount of information, enough to even recognize by eye.

Since every encoded image contains information about two private images, many are bound to overlap and share information from the same private image, in which case the encoded images will have some agreement. Even with only absolute value data, the attackers managed to train a neural network to leverage that agreement to classify whether a pair of encoded images share an image in common with high accuracy. Clustering the encoded images, the average of the cluster will retrieve the absolute value of the original image.

The attackers could have stopped there, but they used the data to reconstruct the encoding graph, the values of $\lambda$ that produced each encoded image. Each encoded image is then reduced to a noisy combination of two unknown images.

\[\tilde{x} = \|\lambda_{1}x_{i}+\lambda_{2}x_{j}+N\|\]

The challenge answered by the attackers contained $5000$ encoded images constructed from only $100$ private images, which results in $5000n$ non-linear equations in $100n$ variables (where $n$ is the image size). Gradient descent was able to retrieve the images despite the non-linearities.

\[\arg \min_{X \in [-1,1]^{n \times t}} \\|\mathrm{abs}(AX)-\tilde{X}\\|^{2}\]

Ambiguous Privacy

This example demonstrates the need for a definition of security that can be proven or tested, which might be more relaxed than that of differential privacy to allow for efficiency.

Why not avoid repeated use of private datapoints?

We must remember that data providers always have a maximum privacy alternative: not to share the data. If we increased the proportion of public data in the encoding, we’d rapidly lose the benefit of our private data, whether if due to noise or distribution shift.

Black Box Recovery

Carlini et al. (2020) showed that even with limitations on queries from a black-box model, the model’s parameters can often be reconstructed.

Summary

Queries can allow to reconstruct models, knowledge of the model can allow to reconstruct training data and knowledge of the training data, even if encrypted, can allow to reconstruct original private data.

Therefore, definitions of security need to be proposed and mechanisms to ensure them need to be proven. The state of the field today can be summarized in three approaches:

Cryptographic approaches can ensure privacy, at great cost to performance and efficiency.
Differential privacy can allow for a tuned trade-off between privacy and utility, currently with high cost.
Heuristic approaches can be the best of both worlds, but lack guarantees and are in risk of attacks that can break them completely.