How much of my dataset did you use? Quantitative Data Usage Inference in Machine Learning

Sep 20, 2024·

Yao Tong

Equal contribution

Jiayuan Ye

Equal contribution

Sajjad Zarifzadeh

Reza Shokri

· 0 min read

PDF Code Video

Abstract

How much of a given dataset was used to train a machine learning model? This is a critical question for data owners assessing the risk of unauthorized data usage and protecting their right (United States Code, 1976). However, previous work mistakenly treats this as a binary problem—inferring whether all or none or any or none of the data was used—which is fragile when faced with real, non-binary data usage risks. To address this, we propose a fine-grained analysis called Dataset Usage Cardinality Inference (DUCI), which estimates the exact proportion of data used. Our algorithm, leveraging debiased membership guesses, matches the performance of the optimal MLE approach (with a maximum error <0.1) but with significantly lower (e.g., $300 imes$ less) computational cost.

Type

Conference paper

Publication

In Internatinal Conference of Learning Representations, 2025. [Oral Presentation (Top ∼1.5% among submissions)]

Last updated on Sep 20, 2024

Authors

Yao Tong (she/her)

← Cut the Deadwood Out: Training-Free Backdoor Purification via Guided Module Substitution Feb 20, 2025

The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breaches Without Adjusting Finetuning Pipeline Mar 20, 2024 →