Documentation: User guide | Rust API | Python API |
Installation: Rust | Python | CLI
Contents: Problem | QUICK START | License


Lace is a probabilistic cross-categorization engine written in rust with an optional interface to python. Unlike traditional machine learning methods, which learn some function mapping inputs to outputs, Lace learns a joint probability distribution over your dataset, which enables users to...

  • predict or compute likelihoods of any number of features conditioned on any number of other features
  • identify, quantify, and attribute uncertainty from variance in the data, epistemic uncertainty in the model, and missing features
  • determine which variables are predictive of which others
  • determine which records/rows are similar to which others on the whole or given a specific context
  • simulate and manipulate synthetic data
  • work natively with missing data and make inferences about missingness (missing not-at-random)
  • work with continuous and categorical data natively, without transformation
  • identify anomalies, errors, and inconsistencies within the data
  • edit, backfill, and append data without retraining

and more, all in one place, without any explicit model building.

import pandas as pd
import lace

# Create an engine from a dataframe
df = pd.read_csv("animals.csv", index_col=0)
engine = lace.Engine.from_df(df)

# Fit a model to the dataframe over 5000 steps of the fitting procedure
engine.update(5000)

# Show the statistical structure of the data -- which features are likely
# dependent (predictive) on each other
engine.clustermap("depprob", zmin=0, zmax=1)

Animals dataset dependence probability

The Problem

The goal of lace is to fill some of the massive chasm between standard machine learning (ML) methods like deep learning and random forests, and statistical methods like probabilistic programming languages. We wanted to develop a machine that allows users to experience the joy of discovery, and indeed optimizes for it.

Short version

Standard, optimization-based ML methods don't help you learn about your data. Probabilistic programming tools assume you already have learned a lot about your data. Neither approach is optimized for what we think is the most important part of data science: the science part: asking and answering questions.

Long version

Standard ML methods are easy to use. You can throw data into a random forest and start predicting with little thought. These methods attempt to learn a function f(x) -> y that maps inputs x, to outputs y. This ease-of-use comes at a cost. Generally f(x) does not reflect the reality of the process that generated your data, but was instead chosen by whoever developed the approach to be sufficiently expressive to better achieve the optimization goal. This renders most standard ML completely uninterpretable and unable to yield sensible uncertainty estimate.

On the other extreme you have probabilistic tools like probabilistic programming languages (PPLs). A user specifies a model to a PPL in terms of a hierarchy of probability distributions with parameters θ. The PPL then uses a procedure (normally Markov Chain Monte Carlo) to learn about the posterior distribution of the parameters given the data p(θ|x). PPLs are all about interpretability and uncertainty quantification, but they place a number of pretty steep requirements on the user. PPL users must specify the model themselves from scratch, meaning they must know (or at least guess) the model. PPL users must also know how to specify such a model in a way that is compatible with the underlying inference procedure.

Example use cases

  • Combine data sources and understand how they interact. For example, we may wish to predict cognitive decline from demographics, survey or task performance, EKG data, and other clinical data. Combined, this data would typically be very sparse (most patients will not have all fields filled in), and it is difficult to know how to explicitly model the interaction of these data layers. In Lace, we would just concatenate the layers and run them through.
  • Understanding the amount and causes of uncertainty over time. For example, a farmer may wish to understand the likelihood of achieving a specific yield over the growing season. As the season progresses, new weather data can be added to the prediction in the form of conditions. Uncertainty can be visualized as variance in the prediction, disagreement between posterior samples, or multi-modality in the predictive distribution (see this blog post for more information on uncertainty)
  • Data quality control. Use surprisal to find anomalous data in the table and use -logp to identify anomalies before they enter the table. Because Lace creates a model of the data, we can also contrive methods to find data that are inconsistent with that model, which we have used to good effect in error finding.

Who should not use Lace

There are a number of use cases for which Lace is not suited

  • Non-tabular data such as images and text
  • Highly optimizing specific predictions
    • Lace would rather over-generalize than over fit.

Quick start

Installation

Lace requires rust.

To install the CLI:

$ cargo install --locked lace-cli

To install pylace

$ pip install pylace

Examples

Lace comes with two pre-fit example data sets: Satellites and Animals.

>>> from lace.examples import Satellites
>>> engine = Satellites()

# Predict the class of orbit given the satellite has a 75-minute
# orbital period and that it has a missing value of geosynchronous
# orbit longitude, and return epistemic uncertainty via Jensen-
# Shannon divergence.
>>> engine.predict(
...     'Class_of_Orbit',
...     given={
...         'Period_minutes': 75.0,
...         'longitude_radians_of_geo': None,
...     },
... )
('LEO', 0.023981898950561048)

# Find the top 10 most surprising (anomalous) orbital periods in
# the table
>>> engine.surprisal('Period_minutes') \
...     .sort('surprisal', reverse=True) \
...     .head(10)
shape: (10, 3)
┌─────────────────────────────────────┬────────────────┬───────────┐
│ index                               ┆ Period_minutes ┆ surprisal │
│ ---                                 ┆ ---            ┆ ---       │
│ str                                 ┆ f64            ┆ f64       │
╞═════════════════════════════════════╪════════════════╪═══════════╡
│ Wind (International Solar-Terres... ┆ 19700.45       ┆ 11.019368 │
│ Integral (INTErnational Gamma-Ra... ┆ 4032.86        ┆ 9.556746  │
│ Chandra X-Ray Observatory (CXO)     ┆ 3808.92        ┆ 9.477986  │
│ Tango (part of Cluster quartet, ... ┆ 3442.0         ┆ 9.346999  │
│ ...                                 ┆ ...            ┆ ...       │
│ Salsa (part of Cluster quartet, ... ┆ 3418.2         ┆ 9.338377  │
│ XMM Newton (High Throughput X-ra... ┆ 2872.15        ┆ 9.13493   │
│ Geotail (Geomagnetic Tail Labora... ┆ 2474.83        ┆ 8.981458  │
│ Interstellar Boundary EXplorer (... ┆ 0.22           ┆ 8.884579  │
└─────────────────────────────────────┴────────────────┴───────────┘

And similarly in rust:

use lace::prelude::*;
use lace::examples::Example;

fn main() {	
    // In rust, you can create an Engine or and Oracle. The Oracle is an
    // immutable version of an Engine; it has the same inference functions as
    // the Engine, but you cannot train or edit data.
    let mut engine = Example::Satellites.engine().unwrap();
	
    // Predict the class of orbit given the satellite has a 75-minute
    // orbital period and that it has a missing value of geosynchronous
    // orbit longitude, and return epistemic uncertainty via Jensen-
    // Shannon divergence.
    engine.predict(
        "Class_of_Orbit",
        &Given::Conditions(vec![
            ("Period_minutes", Datum:Continuous(75.0)),
            ("Longitude_of_radians_geo", Datum::Missing),
        ]),
        Some(PredictUncertaintyType::JsDivergence),
        None,
    )
}

Fitting a model

To fit a model to your own data you can use the CLI

$ lace run --csv my-data.csv -n 1000 my-data.lace

...or initialize an engine from a file or dataframe.

>>> import pandas as pd  # Lace supports polars as well
>>> from lace import Engine
>>> engine = Engine.from_df(pd.read_csv("my-data.csv", index_col=0))
>>> engine.update(1_000)
>>> engine.save("my-data.lace")

You can monitor the progress of the training using diagnostic plots

>>> from lace.plot import diagnostics
>>> diagnostics(engine)

Animals MCMC convergence

License

Lace is licensed under the Business Source License v1.1, which restricts commercial use. See LICENSE for full details.

If you would like a license for use in commercial please contact lace@redpoll.ai

Academic use

Lace is free for academic use. Please cite lace according the the CITATION.cff metadata.

Installation

Installation requires rust, which you can get here.

CLI

The lace CLI is installed with rust via the command

$ cargo install --locked lace-cli

Rust crate

To use the lace crate in a rust project add the following line under the dependencies block in your Cargo.toml:

lace = "<version>"

Python

The python library can be installed with pip

pip install

The lace workflow

The typical workflow consists of two or three steps:

  1. Create a codebook
  2. Run/fit/train a model
  3. Ask questions

Step 1 is optional in many cases as Lace usually does a good job of inferring the types of your data. The condensed workflow looks like this.

import pandas as pd
import lace

df = pd.read_csv("mydata.csv", index_col=0)

# 1. Create a codebook (optional)
codebook = lace.Codebook.from_df(df)

# 2. Initialize a new Engine from the prior. If no codebook is provided, a
# default will be generated
engine = lace.Engine.from_df(df, codebook=codebook)

# 3. Run inference
engine.run(5000)
use polars::prelude::{SerReader, CsvReader};
use lace::prelude::*;

let df = CsvReader::from_path("mydata.csv")
  .unwrap()
  .has_header(true)
  .finish()
  .unwrap();

// 1. Create a codebook (optional)
let codebook = Codebook::from_df(&df, None, None, False).unwrap();

// 2. Build an engine
let mut engine = EngineBuilder::new(DataSource::Polars(df))
  .with_codebook(codebook)
  .build()
  .unwrap();

// 3. Run inference
// Use `run` to fit with the default transition set and update handlers; use
// `update` for more control.
engine.run(5_000);

You can also use the CLI to create codebooks and run inference. Creating a default YAML codebook with the CLI, and then manually editing is good way to fine tune models.

$ lace codebook --csv mydata.csv codebook.yaml
$ lace run --csv data.csv --codebook codebook.yaml -n 5000 metadata.lace

Create and edit a codebook

The codebook contains information about your data such as the row and column names, the types of data in each column, how those data should be modeled, and all the prior distributions on various parameters.

The default codebook

In the lace CLI, you have the ability to initialize and run a model without specifying a codebook.

$ lace run --csv data -n 5000 metadata.lace

Behind the scenes, lace creates a default codebook by inferring the types of your columns and creating a very broad (but not quite broad enough to satisfy the frequentists) hyper prior, which is a prior on the prior.

We can also create the default codebook in code.

import polars as pl
from lace import Codebook
from lace.examples import ExamplePaths

# Here we get the path to an example csv file, but you can use any file that
# can be read into a polars or pandas dataframe
path = ExamplePaths("satellites").data
df = pl.read_csv(path)

# Infer the default codebook for df
codebook = Codebook.from_df("satellites", df)
use polars::prelude::{CsvReader, SerReader};
use lace::codebook::Codebook;
use lace::examples::Example;

// Load an example file
let paths = Example::Satellites.paths().unwrap();
let df = CsvReader::from_path(paths.data)
    .unwrap()
    .has_header(true)
    .finish()
    .unwrap();

// Create the default codebook
let codebook = Codebook::from_df(&df, None, None, false).unwrap();

Creating a template codebook

Lace is happy to generate a default codebook for you when you initialize a model. You can create and save the default codebook to a file using the CLI. To create a codebook from a CSV file:

$ lace codebook --csv data.csv codebook.yaml

Note that if you love quotes and brackets, and hate being able to comment, you can use json for the codebook. Just give the output of codebook a .json extension.

$ lace codebook --csv data.csv codebook.json

If you use a data format with a schema, such as Parquet or IPC (Apache Arrow v2), you make Lace's work a bit easier.

$ lace codebook --ipc data.arrow codebook.yaml

If you want to make changes to the codebook -- the most common of which are editing the Dirichlet process prior, specifying whether certain columns are missing not-at-random, adjusting the prior distributions and disabling hyper priors -- you just open it up in your text editor and get to work.

For example, let's say we wanted to make a column of the satellites data set missing not-at-random, we first create the template codebook,

$ lace codebook --csv satellites.csv codebook-sats.yaml

open it up in a text editor and find the column of interest

- name: longitude_radians_of_geo
  coltype: !Continuous
    hyper:
      pr_m:
        mu: 0.21544247097911842
        sigma: 1.570659039531299
      pr_k:
        shape: 1.0
        rate: 1.0
      pr_v:
        shape: 6.066108090103747
        scale: 6.066108090103747
      pr_s2:
        shape: 6.066108090103747
        scale: 2.4669698184613824
    prior: null
  notes: null
  missing_not_at_random: false

and change the column metadata to something like this:

- name: longitude_radians_of_geo
  coltype: !Continuous
    hyper:
      pr_m:
        mu: 0.21544247097911842
        sigma: 1.570659039531299
      pr_k:
        shape: 1.0
        rate: 1.0
      pr_v:
        shape: 6.066108090103747
        scale: 6.066108090103747
      pr_s2:
        shape: 6.066108090103747
        scale: 2.4669698184613824
    prior: null
  notes: "This value is only defined for GEO satellites"
  missing_not_at_random: true

Sometimes, we have a bit of knowledge that we can transfer to lace in the form of a more-specific prior distribution. To set the prior we remove the hyper prior and set the prior. Note that doing this disabled prior parameter inference.

- name: longitude_radians_of_geo
  coltype: !Continuous
    hyper: null
    prior: 
        m: 0.0
        k: 1.0
        v: 1.0
        s2: 3.0
  notes: "This value is only defined for GEO satellites"
  missing_not_at_random: true

For a complete list of codebook fields, see the reference.

Run/train/fit a model

Lace is a Bayesian tool so we do posterior sampling via Markov chain Monte Carlo (MCMC). A typical machine learning model will use some sort of optimization method to find one model that fits best; the objective for fitting is different in Lace.

In Lace we use a number of states (or samples), each running MCMC independently to characterize the posterior distribution of the model parameters given the data. Posterior sampling isn't meant to maximize the fit to a dataset, it is meant to help understand the conditions that created the data.

When you fit to your data in Lace, you have options to run a set number of states for a set number of iterations (limited by a timeout). Each state is a posterior sample. More states is better, but the run time of everything increases linearly with the number of states; not just the fit, but also the OracleT operations like logp and simulate. As a rule of thumb, 32 is a good default number of states. But if you find your states tend to strongly disagree on things, it is probably a good idea to add more states to fill in the gaps.

As for number of iterations, you will want to monitor your convergence plots. There is no benefit of early stopping like there is with neural networks. MCMC will usually only do better the longer you run it and Lace is not likely to overfit like a deep network.

A state under MCMC

The above figure shows the MCMC algorithm partitioning a dataset into views and categories.

A (potentially useless) analogy comparing MCMC to optimization

At the risk of creating more confusion than we resolve, let us make an analogy to mountaineering. You have two mountaineers: a gradient ascent (GA) mountaineer and an MCMC mountaineer. You place each mountaineer at a random point in the Himalayas and say "go". GA's goal is to find the peak of Everest. Its algorithm for doing so is simply always to go up and never to go down. GA is guaranteed to find a peak, but unless it is very lucky in its starting position, it is unlikely ever to summit Everest.

MCMC has a different goal: to map the mountain range (posterior distribution). It does this by always going up, but sometimes going down if it doesn't end up too low. The longer MCMC explores, the better understanding it gains about the Himalayas: an understanding which likely includes a good idea of the position of the peak of Everest.

While GA achieves its goal quickly, it does so at the cost of understanding the terrain, which in our analogy represents the information within our data.

In Lace we place a troop of mountaineers in the mountain range of our posterior distribution. We call individuals mountaineers states or samples, or chains. Our hope is that our mountaineers can sufficiently map the information in our data. Of course the ability of the mountaineers to build this map depends on the size of the space (which is related to the size of the data) and the complexity of the space (the complexity of the underlying process)

In general the posterior of a Dirichlet process mixture is indeed much like the Himalayas in that there are many, many peaks (modes), which makes the mountaineer's job difficult. Certain MCMC kernels do better in certain circumstances, and employing a variety of kernels leads to better result.

Our MCMC Kernels

The vast majority of the fitting runtime is updating the row-category assignment and the column-view assignment. Other updates such as feature components parameters, CRP parameters, and prior parameters, take an (relatively) insignificant amount of time. Here we discuss the MCMC kernels responsible for the vast majority of work in Lace: the row and column reassignment kernels:

Row kernels

  • slice: Proposes reassignment for each row to an instantiated category or one of many new, empty categories. Slice is good for small tweaks in the assignment, and it is very fast. When there are a lot of rows, slice can have difficulty creating new categories.
  • gibbs: Proposes reassignment of each row sequentially. Generally makes larger moves than slice. Because it is sequential, and accesses data in random order, gibbs is very slow.
  • sams: Proposes mergers and splits of categories. Only considers the rows in one or two categories. Proposes large moves, but cannot make the fine tweaks that slice and gibbs can. Since it proposes big moves, its proposals are often rejected as the run progresses and the state is already fitting fairly well.

Column kernels

The column kernels are generally adapted from the row kernels with some caveats.

  • slice: Same as the row kernel, but over columns.
  • gibbs: The same structurally as the row kernel, but uses random seed control to implement parallelism.

Gibbs is a good choice if the number of columns is high and mixing is a concern.

Fitting models in code

Though the CLI is a convenient way to fit models and generate metadata files outside of python or rust, you may often times find yourself wanting to fit in code. Lace gives you a number of options in both rust and python.

Rust

We first initialize a new Engine:

use rand::SeedableRng;
use rand_xoshiro::Xoshiro256Plus;
use polars::prelude::{CsvReader, SerReader};
use lace::prelude::*;
use lace::examples::Example;

// Load an example file
let paths = Example::Satellites.paths().unwrap();
let df = CsvReader::from_path(paths.data)
    .unwrap()
    .has_header(true)
    .finish()
    .unwrap();

// Create the default codebook
let codebook = Codebook::from_df(&df, None, None, false).unwrap();

// Build an rng
let rng = Xoshiro256Plus::from_entropy();

// This initializes 32 states from the prior
let mut engine = Engine::new(
    32,                         
    codebook,
    DataSource::Polars(df),
    0,
    rng,
).unwrap();

Now we have two options for fitting. We can use the Engine::run method, which uses a default set of transition operations that prioritizes speed.

engine.run(1_000);

We can also tell lace exactly which transitions to run.

// Run for 1000 iterations. Use the Gibbs column reassignment kernel, and
// alternate between the merge-split (Sams) and slice row kernels
let run_config = EngineUpdateConfig::new()
    .n_iters(100)
    .transitions(vec![
        StateTransition::ColumnAssignment(ColAssignAlg::Gibbs),
        StateTransition::StateAlpha,
        StateTransition::RowAssignment(RowAssignAlg::Sams),
        StateTransition::ComponentParams,
        StateTransition::RowAssignment(RowAssignAlg::Slice),
        StateTransition::ComponentParams,
        StateTransition::ViewAlphas,
        StateTransition::FeaturePriors,
    ]);

engine.update(run_config.clone(), ()).unwrap();

Note the second argument to engine.update. This is the update handler, which allows users to do things like attach progress bars, handle Ctrl+C, and collect additional diagnostic information. There are a number a built-ins for common use case, but you can implement UpdateHandler for your own types if you need extra capabilities. () is the null update handler.

If we wanted a simple progressbar

use lace::prelude::update_handler::ProgressBar;

engine.update(run_config.clone(), ProgressBar::new()).unwrap();

Or if we wanted a progress bar and a Ctrl+C handler, we can use a tuple of UpdateHandlers.

use lace::prelude::update_handler::CtrlC;

engine.update(
    run_config,
    (ProgressBar::new(), CtrlC::new())
).unwrap();

Python

Let's load an Engine from an example and run it with the default transitions for 1000 iterations.

from lace.examples import Satellites

engine = Satellites()
engine.update(100)

As in rust, we can control which transitions are run. Let's just update the row assignments a bunch of times.

from lace import RowKernel, StateTransition

engine.update(
    500,
    timeout=10,              # each state can run for at most 10 seconds
    checkpoint=250,          # save progress every 250 iterations
    save_path="mydata.lace",
    transitions=[
        StateTransition.row_assignment(RowKernel.slice()),
        StateTransition.view_alphas(),
    ],
)

Convergence

When training a neural network, we monitor for convergence in the error or loss. When, say, we see diminishing returns in our loss function with each epoch, or we see overfitting in the validation set, it is time to stop. Convergence in MCMC is a bit different. We say our Markov Chain has converged when it has settled into a situation in which it is producing draws from the posterior distribution. In the beginning state of the chain, it is rapidly moving away from the low probability area in which it was initialized and into the higher probability areas more representative of the posterior.

To monitor convergence, we observe the score (which is proportional to the likelihood) over time. If the score stops increasing and begins to oscillate, one of two things has happened: we have settled into the posterior distribution, or the Markov Chain has gotten stuck on an island of high likelihood. When a model is identifiable (meaning that each unique parameter set creates a unique model) the posterior distribution is unimodal, which means there is only one peak, which is easily mapped.