Preparing your data for Lace

Compared with many other machine learning tools, lace has very few requirements for data: data columns may be integer, continuous, or categorical string types; empty cells do not not need to be filled in; and the table must contain a string row index column labeled ID or Index (case-insensitive).

Supported data types for inference

Lace supports several data types, and more can be supported (with some work from us).

Continuous data

Continuous columns are modeled as mixtures of Gaussian distributions. Find an explanation of the parameters in the codebook

Categorical data

Note: For categorical columns, lace currently supports up to 256 unique values.

Categorical columns are modeled as mixtures of categorical distributions. Find an explanation of the parameters in the codebook.

Count data

Support exists for a count data type, which is modeled as a mixture of Poission distributions, but there are some drawbacks, which make it best to convert the data to continuous in most cases.

  • The Poisson distribution is a single parameter model so the location and variance of the mixture components cannot be controlled individually. In the Poisson model, higher magnitude means higher variance.
  • The hyper prior for count data is finicky and can often cause underflow/overflow errors when the underlying data do not look like Poisson distributions.

Note: If you use Count data, do so because you know that the underlying mixture components will be Poisson-like and be sure to set the prior and unset the hyperprior in the codebook

Preparing your data for Lace

Lace is pretty forgiving when it comes to data. You can have missing values, string values, and numerical values all in the same table; but there are some rules that your data must follow for the platform to pick up on things. Here you will learn how to make sure Lace understands your data properly.

Accepted formats

Lace currently accepts the following data formats

  • CSV
  • CSV.gz (gzipped CSV)
  • Parquet
  • IPC (Apache Arrow v2)
  • JSON (as output by the pandas function df.to_json('mydata.json))
  • JSON Lines

Using a schemaless data format

Formatting your data properly will help the platform understand your data. Under the hood, Lace uses polars for reading data formats into a DataFrame. For more information about i/o in polars, see the polars API documentation.

Here are the rules:

  1. Real-valued (continuous data) cells must have decimals.
  2. Integer-valued cells, whether count or categorical, must not have decimals.
  3. Categorical data cells may be integers (up to 255) or string values
  4. In a CSV, missing cells should be empty
  5. A row index is required. The index label should be 'ID'.

Not following these rules will confuse the codebook and could cause parsing errors.

Row and column names

Row and column indices or names must be strings. If you were to create a codebook from a csv with integer row and column indices, Lace would convert them to strings.

Tips on creating valid data with pandas

When reading data from a CSV, Pandas will convert integer columns with missing cells to float values since floats can represent NaN, which is how pandas represents missing data. You have a couple of options for saving your CSV file with both missing cells and properly formatted integers:

You can coerce the types to Int64, which is basically Int plus NaN, and then write to CSV.

import pandas as pd
df = pd.DataFrame([10,20,30], columns=['my_int_col'])

df['my_int_col'] = df['my_int_col'].astype('Int64')
df.to_csv('mydata.csv', index_label='ID')

If you have a lot of columns or particularly long columns, you might find it much faster just to reformat as you write to the csv, in which case you can use the float_format option in DataFrame.to_csv

df.to_csv('mydata.csv', index_label='ID', float_format='%g')