User Guide

TL;DR These are notes and highlights. For usage see the Workflow Outline, Examples Gallery, and API

`fitgrid` `Epochs`

Fitting linear regression models in Python and R with formulas like y ~ a + b + a:b in Python (patsy, statsmodels.formula.api) or R (lm, lme4::lmer) assumes the data are represented as a 2D array with the variables in named columns, y, a, b and values (“observations”, “scores”) in rows.

fitgrid follows this format with the additional assumption that the data are vertically stacked fixed-length time-series “epochs”, so the user must specify two additional row indexes that together uniquely identify the epoch and timestamp of each data row.

Epochs data format

Data for fitgrid modeling should be prepared as a single pandas.DataFrame with a pandas.MultiIndex as follows:

MultiIndex names and data types

Each epoch must have a unique integer identifier in the epoch_id index. The index values need not be ordered and gaps are allowed but duplicate epoch indices are not. The time index integer timestamps must be the same for all epochs. Gaps and irregular intervals between timestamps are allowed, duplicate timestamps within an epoch are not.

epoch_id (default): integer
time (default): integer

The MultiIndex names default to epoch_id and time but any index column may be designated as the epoch or time index, provided the conditions are met. Event-relative epochs are often time-stamped so the event is at time=0 but this is not a fitgrid requirement and the temporal resolution of the timeseries is not specified. For data epochs with time index values -10, 0, 10, 20, the FitGrid[-10:20, channels] object will be the same whether the timestamps denote milliseconds or months.

Data columns

channel data columns: numeric
predictor variable columns: numeric, string, boolean

Example:

A canonical source of data epochs for fitgrid are multichannel “strip charts” as in EEG and MEG recordings. In this case, the epochs are regularly sampled fixed-length segments extracted from the strip chart and time-stamped relative to an experimental event.

Data Ingestion 

Rows and columns epochs data can be loaded into a fitgrid Epochs object from a pandas.DataFrame in memory or read from files in feather or HDF5 format. For details on using these data formats see pandas.read_feather() and and pandas.read_hdf().

Data Simulation 

fitgrid has a built-in method fitgrid.fake_data.generate() that returns a Epochs data objects or pandas.DataFrame for testing.

EEG Sample Data

Optional sample files containing previously prepared EEG data epochs are availble for download from the Zenodo archive at eeg-workshops/mkpy_data_examples/data.

The files can be installed into the fitgrid package with fitgrid.sample_data.get_file() and accessed thereafter as fitgrid.DATA_DIR(<filename>) or downloaded manually to another location.

The sample data with msNNN.epochs.* filenames contain segmented EEG epochs and the msNNNN infix gives the length of the epoch in millesconds. Files with the shortest epochs (100 ms) are suitable for testing, those with longer epochs (1500, 3000 ms) are more representative of actual experimental EEG data. The feather format versions *.epochs.feather are recommended for use with pandas pandas.read_feather().

Additional information about the experimental designs for these sample files is available online at https://eeg-workshops.github.io/mkpy_data_examples.

Fitting a model

The following methods populate the FitGrid[times,channel] object with statsmodels results for OLS model fits and lme4::lmer for linear mixed-effects fits.

Ordinary least squares: fitgrid.lm

lm_grid = fitgrid.lm(
    epochs_fg,
    RHS='1 + categorical + continuous'
)

Linear mixed-effects: fitgrid.lmer

lmer_grid = fitgrid.lmer(
    epochs_fg,
    RHS='1 + continuous + (continuous | categorical)'
)

User-defined (experimental): fitgrid.run_model

The `FitGrid[times, channels]`

Slice by time, channel

Slice the FitGrid[times, channels] with pandas.DataFrame range : and label slicers. The range includes the upper bound.

lm_grid[:, ["MiCe", "MiPa"]]
lm_grid[-100:300, :]
lm_grid[0, "MiPa"]

Access results

Query the FitGrid[times, channels] results like a single fit object. Result grids are returned as a pandas.DataFrame or another FitGrid[times, channels] which can be queried the same way.

lmg_grid.params
lmg_grid.llf

Slice and access

lm_grid[-100:300, ["MiCe", "MiPa"].params

`fitgrid.LMFitGrid()` methods

The fitted OLS grid provides time-series plots of selected model results: estimated coefficients fitgrid.lm.plot_betas() and adjusted \(R^2\) fitgrid.lm.plot_adj_rsquared() (see also fitgrid.utils() for additional model summary wrappers).

Saving and loading grids

Running models on large datasets can take a long time. fitgrid lets you save your grid to disk so you can restore them later without having to refit the models. However, saving and loading large grids may still be slow and generate very large files.

Suppose you run lmer like so:

grid = fitgrid.lmer(epochs, RHS='x + (x|a)')

Save the grid:

grid.save('lmer_results')

Later you can reload the grid:

grid = fitgrid.load_grid('lmer_results')

Warning

Fitted grids are saved and loaded with Python pickle which is not guaranteed to be portable across different versions of Python. Unpickling unknown files is not secure (for details see the Python docs). Only load grids you trust such as those you saved yourself. For reproducibility and portability fit the grid, collect the results you need, and export the dataframe to a standard data interchange format.

Model comparisons and summaries

To reduce memory demands when comparing sets of models, fitgrid provides a convenience wrapper, fitgrid.utils.summarize, that iteratively fits a list of models and collects a lightweight summary dataframe with key results for model interpretation and comparison. Unlike FitGrid[times, channels] objects, the summary dataframe format is the same for fitgrid.lm and fitgrid.lmer. Some helper functions are available for visualizing selected summary results.

Model and data diagnostics (WIP)

Model and data diagnostics in the fitgrid framework is work in progress. For ordinary least squares fitting, there is some support for the native statsmodels OLS diagnostic measures. Diagostics that can be computed analytically from a single model fit, e.g., via the hat matrix diagonal, may be useable but many are not for realistically large data sets. The per-observation diagnostic measures, e.g., the influence of observations on estimated parameters, are the same size as the original data multiplied by the number of model parameters which may overload memory and measures that require on leave-one-observation-out model refitting take intractably long for large data sets. A minimal effort is made to guard the user from known trouble but the general policy is fitgrid stays out of the way so you can try what you want. If it works great, if it chokes, that’s the nature of the beast you are modeling.

Support for linear-mixed effects diagnostics in fitgrid is limited to a variance inflation factor computation implemented in Python as a proof-of-concept. fitgrid does not interface with mixed-effect model diagnostics libraries in R and plans are to improve support for mixed-effects modeling in Python rather than expand further into the R ecosystem.

fitgrid under the hood

How mixed effects models are run

Mixed effects models do not have a complete implementation in Python, so we interface with R from Python and use lme4 in R. The results that you get when fitting mixed effects models in fitgrid are the same as if you used lme4 directly, because we use lme4 (indirectly).

Multicore model fitting

On a multicore machine, it may be possible to significantly speed fitting by computing the models in parallel especially for fitgrid.lmer. For least squares models, fitgrid.lm uses statsmodels under the hood, which in turn employs numpy for calculations. numpy itself depends on linear algebra libraries that might be configured to use multiple threads by default. This means that on a 48 core machine, common linear algebra calculations might use 24 cores automatically, without any explicit parallelization. So when you explicitly parallelize your calculations using Python processes (say 4 of them), each process might start 24 threads. In this situation, 96 CPU bound threads are wrestling each other for time on the 48 core CPU. This is called oversubscription and results in slower computations.

To deal with this when running fitgrid.lm, we try to instruct the linear algebra libraries your numpy distribution depends on to only use a single thread in every computation. This then lets you control the number of CPU cores being used by setting the n_cores parameter in fitgrid.lm and fitgrid.lmer.

If you are using your own 8-core laptop, you might want to use all but one core, so set something like n_cores=7. On a shared machine, it’s a good idea to run on half or 3/4 of the cores if no one else is running heavy computations.

Note that fitgrid parallel processing counts the “logical” cores available to the operating system and this may differ from the number of physical cores, depending on the system hardware and setting, e.g., Intel CPUs with hyperthreading enabled. The Python package psutil and psutil.cpu_count(logical=True) and psutil.cpu_count(logical=False) may be useful for interrogating the system about the available resources.