Compared with many other machine learning tools, lace has very few requirements
for data: data columns may be integer, continuous, or categorical string types;
empty cells do not not need to be filled in; and the table must contain a
string row index column labeled
Lace supports several data types, and more can be supported (with some work from us).
Note: For categorical columns, lace currently supports up to 256 unique values.
Support exists for a count data type, which is modeled as a mixture of Poission distributions, but there are some drawbacks, which make it best to convert the data to continuous in most cases.
- The Poisson distribution is a single parameter model so the location and variance of the mixture components cannot be controlled individually. In the Poisson model, higher magnitude means higher variance.
- The hyper prior for count data is finicky and can often cause underflow/overflow errors when the underlying data do not look like Poisson distributions.
Note: If you use Count data, do so because you know that the underlying mixture components will be Poisson-like and be sure to set the prior and unset the hyperprior in the codebook
Lace is pretty forgiving when it comes to data. You can have missing values, string values, and numerical values all in the same table; but there are some rules that your data must follow for the platform to pick up on things. Here you will learn how to make sure Lace understands your data properly.
Lace currently accepts the following data formats
- CSV.gz (gzipped CSV)
- IPC (Apache Arrow v2)
- JSON (as output by the pandas function
- JSON Lines
Formatting your data properly will help the platform understand your data.
Under the hood, Lace uses
polars for reading data formats into a
For more information about i/o in
polars, see the polars API
Here are the rules:
- Real-valued (continuous data) cells must have decimals.
- Integer-valued cells, whether count or categorical, must not have decimals.
- Categorical data cells may be integers (up to 255) or string values
- In a CSV, missing cells should be empty
- A row index is required. The index label should be 'ID'.
Not following these rules will confuse the codebook and could cause parsing errors.
Row and column indices or names must be strings. If you were to create a codebook from a csv with integer row and column indices, Lace would convert them to strings.
When reading data from a CSV, Pandas will convert integer columns with missing
cells to float values since floats can represent
NaN, which is how pandas
represents missing data. You have a couple of options for saving your CSV file
with both missing cells and properly formatted integers:
You can coerce the types to
Int64, which is basically Int plus
then write to CSV.
import pandas as pd
df = pd.DataFrame([10,20,30], columns=['my_int_col'])
df['my_int_col'] = df['my_int_col'].astype('Int64')
If you have a lot of columns or particularly long columns, you might find it
much faster just to reformat as you write to the csv, in which case you can
float_format option in
df.to_csv('mydata.csv', index_label='ID', float_format='%g')