Language Models Producing Tabular Data May Reveal Numerical Information from Training Data

Large language models, which have recently begun to be used for generating synthetic tabular data, may inadvertently reveal numerical information contained in their training data. A presentation published on the ArXiv service demonstrates that in popular implementations, models repeat numerical sequences they have memorized instead of inventing entirely new data.

Synthetic versions of tabular data, such as customer records or measurement tables, are of interest to companies and authorities because they allow for analysis without revealing individual people's information. In practice, two methods of utilizing large language models have become common: either fine-tuning a smaller model directly with tabular data or feeding example rows to a large model as part of a query.

The authors show that in both approaches, models can "leak" numerical strings, such as long numerical sequences, originating from the training data. This phenomenon is considered a privacy risk: if a synthetic table contains too accurate copies of the original data, an outsider may infer whether a particular observation was included in the model's training material.

To illustrate this, researchers present a simple so-called membership inference attack, which they name LevAtt. The attack assumes that the attacker has access only to the synthetic data produced by the model, not the model itself. LevAtt specifically targets the occurrence of numerical strings in synthetic observations and assesses, based on them, what the model might have remembered from the original data.

The work emphasizes that even synthetic data is not automatically private if the underlying language model repeats numerical sequences it has learned directly from the training data.

Source: When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation, ArXiv (AI).