Asset Details

MbrlCatalogueTitleDetail

Dissertation

Differentially Private Synthetic Data

He, Yiyun

2025

Overview

Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. In this thesis, we present a highly effective algorithmic approach for generating ε-differentially private synthetic data in a bounded metric space with near-optimal utility guarantees under the 1-Wasserstein distance. In particular, for a dataset X in the hypercube [0, 1] d, our algorithm generates synthetic dataset Y such that the expected 1-Wasserstein distance between the empirical measure of X and Y is O((εn)−1/d) for d ≥ 2, and is O(log2 (εn)(εn)−1) for d = 1. The accuracy guarantee is optimal up to a constant factor for d ≥ 2, and up to a logarithmic factor for d = 1. Our algorithm has a fast running time of O(εdn) for all d ≥ 1 and demonstrates improved accuracy compared to former results for d ≥ 2.However, when the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. We further propose a differentially private algorithm to generate low-dimensional synthetic data efficiently from a high-dimensional dataset with a utility guarantee with respect to the 1-Wasserstein distance. A key step of our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound that circumvents the curse of dimensionality. Unlike the standard perturbation analysis, our analysis of private PCA works without assuming the spectral gap for the covariance matrix. For the data lying on a d′ -dimensional linear subspace, we successfully overcome the curse of high dimensionality and improve the accuracy to O(n−1/d′ ).We also consider the synthetic data generation with differential privacy under the online setting where data is continually released. For a data stream within the hypercube [0, 1]d and an infinite time horizon, we develop an online algorithm that generates a differentially private synthetic dataset at each time t. This algorithm achieves a near-optimal accuracy bound of O(log(t)t −1/d) for d ≥ 2 and O(log4.5 (t)t −1) for d = 1 in the 1-Wasserstein distance. This result extends the previous work on the continual release model for counting queries to Lipschitz queries. Compared to the offline case, where the entire dataset is available at once, our approach requires only an extra polynomially logarithmic factor in the accuracy bound.

Share this book

Add to My Shelf

Publisher

ProQuest Dissertations & Theses

Subject

Information science

/ Mathematics

/ Mathematics education

ISBN

9798288800795