Simulating data

If you've used logp, you already understand how to simulate data. In both logp and simulate you define a distribution. In logp the output is an evaluation of a specific point (or points) in the distribution; in simulate you generate from the distribution.

We can simulate from joint distributions

from lace.examples import Animals

animals = Animals()

swims = animals.simulate(['swims'], n=10)

Output:

shape: (10, 1)
┌───────┐
│ swims │
│ ---   │
│ u32   │
╞═══════╡
│ 1     │
│ 0     │
│ 0     │
│ 0     │
│ ...   │
│ 0     │
│ 0     │
│ 0     │
│ 0     │
└───────┘

Or we can simulate from conditional distributions

swims = animals.simulate(['swims'], given={'flippers': 1}, n=10)

Output:

shape: (10, 1)
┌───────┐
│ swims │
│ ---   │
│ u32   │
╞═══════╡
│ 1     │
│ 1     │
│ 1     │
│ 1     │
│ ...   │
│ 1     │
│ 0     │
│ 1     │
│ 0     │
└───────┘

We can simulate multiple values

animals.simulate(
    ['swims', 'coastal', 'furry'],
    given={'flippers': 1},
    n=10
)

Output:

shape: (10, 3)
┌───────┬─────────┬───────┐
│ swims ┆ coastal ┆ furry │
│ ---   ┆ ---     ┆ ---   │
│ u32   ┆ u32     ┆ u32   │
╞═══════╪═════════╪═══════╡
│ 1     ┆ 1       ┆ 0     │
│ 0     ┆ 0       ┆ 1     │
│ 1     ┆ 1       ┆ 0     │
│ 1     ┆ 1       ┆ 0     │
│ ...   ┆ ...     ┆ ...   │
│ 1     ┆ 1       ┆ 0     │
│ 1     ┆ 1       ┆ 0     │
│ 1     ┆ 1       ┆ 1     │
│ 1     ┆ 1       ┆ 1     │
└───────┴─────────┴───────┘

If we want to create a debiased dataset we can do something like this: There are too many land animals in the animals dataset, we'd like an even representation of land and aquatic animals. All we need to do is simulate from the conditionals and concatenate the results.

import polars as pl

n = animals.n_rows

target_col = 'swims'
other_cols = [col for col in animals.columns if col != target_col]

land_animals = animals.simulate(
    other_cols,
    given={target_col: 0},
    n=n//2,
    include_given=True
)

aquatic_animals = animals.simulate(
    other_cols,
    given={target_col: 1},
    n=n//2,
    include_given=True
)

df = pl.concat([land_animals, aquatic_animals])

Output:

# polars df
shape: (50, 85)
┌───────┬───────┬──────┬───────┬─────┬──────────┬──────────┬──────────┬───────┐
│ black ┆ white ┆ blue ┆ brown ┆ ... ┆ solitary ┆ nestspot ┆ domestic ┆ swims │
│ ---   ┆ ---   ┆ ---  ┆ ---   ┆     ┆ ---      ┆ ---      ┆ ---      ┆ ---   │
│ u32   ┆ u32   ┆ u32  ┆ u32   ┆     ┆ u32      ┆ u32      ┆ u32      ┆ i64   │
╞═══════╪═══════╪══════╪═══════╪═════╪══════════╪══════════╪══════════╪═══════╡
│ 1     ┆ 0     ┆ 0    ┆ 1     ┆ ... ┆ 0        ┆ 0        ┆ 0        ┆ 0     │
│ 1     ┆ 0     ┆ 0    ┆ 1     ┆ ... ┆ 1        ┆ 1        ┆ 0        ┆ 0     │
│ 1     ┆ 0     ┆ 0    ┆ 1     ┆ ... ┆ 0        ┆ 0        ┆ 0        ┆ 0     │
│ 0     ┆ 1     ┆ 0    ┆ 0     ┆ ... ┆ 0        ┆ 0        ┆ 0        ┆ 0     │
│ ...   ┆ ...   ┆ ...  ┆ ...   ┆ ... ┆ ...      ┆ ...      ┆ ...      ┆ ...   │
│ 1     ┆ 1     ┆ 0    ┆ 1     ┆ ... ┆ 0        ┆ 1        ┆ 1        ┆ 1     │
│ 1     ┆ 1     ┆ 0    ┆ 1     ┆ ... ┆ 1        ┆ 0        ┆ 0        ┆ 1     │
│ 1     ┆ 1     ┆ 0    ┆ 1     ┆ ... ┆ 0        ┆ 0        ┆ 0        ┆ 1     │
│ 0     ┆ 0     ┆ 0    ┆ 0     ┆ ... ┆ 0        ┆ 0        ┆ 1        ┆ 1     │
└───────┴───────┴──────┴───────┴─────┴──────────┴──────────┴──────────┴───────┘

That's it! We introduced a new keyword argument, include_given, which includes the given conditions in the output so we don't have to add them back manually.

The draw method

The draw method is the in-table version of simulate. draw takes the row and column indices and produces values from the probability distribution describing that specific cell in the table.

otter_swims = animals.draw('otter', 'swims', n=5)

Output:

shape: (5,)
Series: 'swims' [u32]
[
	1
	1
	1
	1
	1
]

Evaluating simulated data

There are a number of ways to evaluate the quality of simulated (synthetic) data:

  • Overlay histograms of synthetic data over histograms of the real data for each variable.
  • Compare the correlation matrices emitted by the real and synthetic data.
  • Train a classifier to classify real and synthetic data. The better the synthetic data, the more difficult it will be for a classifier to identify synthetic data. Note that you must consider the precision of the data. Lace simulates full precision data. If the real data are rounded to a smaller number of decimal places, a classifier may pick up on that. To fix this, simply round the simulated data.
  • Train a model on synthetic data and compare its performance on a real-data test set against a model trained on real data. Close performance to the real-data-trained model indicates higher quality synthetic data.

If you are concerned about sensitive information leakage, you should also measure the similarity each synthetic record to each real record. Secure synthetic data should not contain records that are so so to the originals that they may reproduce sensitive information.