User Guide

TL;DR These are notes and highlights. For usage see the Workflow Outline, Examples Gallery, and API

fitgrid.Epochs

Fitting linear regression models in Python and R with formulas like y ~ a + b + a:b in Python (patsy, statsmodels.formula.api) or R (lm, lme4::lmer) assumes the data are represented as a 2D array with the variables in named columns, y, a, b and values (“observations”, “scores”) in rows.

fitgrid follows this format with the additional assumption that the data are vertically stacked fixed-length time-series “epochs”, so the user must specify two additional columns of values that together uniquely identify the epoch and time of the data row.

Format

Specification: Data for fitgrid modeling should be prepared as a single pandas.Dataframe with these columns and data types:

  • epoch_id: integer

  • time: integer

  • a set of channel data columns: numeric

  • a set of predictor variable columns: numeric, string, boolean

Each epoch must have a unique integer identifier in the epoch_id column. Gaps are OK, duplicates are not.

The integer time-stamps must be the same for all epochs. Gaps are OK, duplicates are not.

Column names may be chosen freely (epoch and time index column names default to epoch_id, time).

Notes: The index column names default to epoch_id and time but any column may be designated as the epoch or time index, provided the conditions are met. Standard practice is to sequence the rows so time stamps are nested within epochs. Event-relative epochs are generally time-stamped so the event is at time=0 but this is not a fitgrid requirement.

Example: A canonical source of data epochs for fitgrid are multichannel “strip charts” as in EEG and MEG recordings. In this case, “epochs” are fixed-length segments extracted from the strip chart and time-stamped relative to an experimental event.

_images/eeg_epochs.png

Data Ingestion

Rows and columns epochs data can be loaded into a fitgrid.Epochs object from a pandas.DataFrame in memory or read from files in feather or HDF5 format.

For details on these data formats see pandas.read_feather and pandas.read_hdf).

Data Simulation.

fitgrid has a built-in function that generates data and creates Epochs data for testing.

Fitting a model

The following methods populate the FitGrid[time, channel] object. with statsmodels results for OLS model fits and lme4::lmer for linear mixed-effects fits.

  • Ordinary least squares: fitgrid.lm()

    lm_grid = fitgrid.lm(
        epochs_fg,
        RHS='1 + categorical + continuous'
    )
    
  • Linear mixed-effects: fitgrid.lmer()

    lmer_grid = fitgrid.lmer(
        epochs_fg,
        RHS='1 + continuous + (continuous | categorical)'
    )
    
  • User-defined (experimental): fitgrid.run_model()

The FitGrid[time, channel]

Slice by time, channel

Slice the FitGrid with pandas.DataFrame range : and label slicers. The range includes the upper bound.

lm_grid[:, ["MiCe", "MiPa"]]
lm_grid[-100:300, :]
lm_grid[0, "MiPa"]

Access results

Query the FitGrid results like a single fit object. Result grids are returned as as pandas.DataFrame or another FitGrid which can be queried the same way.

lmg_grid.params
lmg_grid.llf

Slice and access

lm_grid[-100:300, ["MiCe", "MiPa"].params

LMFitGrid methods

The fitted OLS grid provides time-series plots of selected model results: estimated coefficients fitgrid.lm.plot_betas() and adjusted \(R^2\) fitgrid.lm.plot_adj_rsquared() (see also fitgrid.utils() for additional model summary wrappers).

Saving and loading grids

Running models on large datasets can take a long time. fitgrid lets you save your grid to disk so you can restore them later without having to refit the models. However, saving and loading large grids may still be slow and generate very large files.

Suppose you run lmer like so:

grid = fitgrid.lmer(epochs, RHS='x + (x|a)')

Save the grid:

grid.save('lmer_results')

Later you can reload the grid:

grid = fitgrid.load_grid('lmer_results')

Warning

Fitted grids are saved and loaded with Python pickle which is not guaranteed to be portable across different versions of Python. Unpickling unknown files is not secure (for details see the Python docs). Only load grids you trust such as those you saved yourself. For reproducibility and portability fit the grid, collect the results you need, and export the dataframe to a standard data interchange format.

Model comparisons and summaries

To reduce memory demands when comparing sets of models, fitgrid provides a convenience wrapper, fitgrid.utils.summarize, that iteratively fits a list of models and collects a lightweight summary dataframe with key results for model interpretation and comparison. Unlike the primary FitGrid, the summary dataframe format is the same for fitgrid.lm and fitgrid.lmer. Some helper functions are available for visualizing selected summary results.

Model and data diagnostics (WIP)

Model and data diagnostics in the fitgrid framework is work in progress. For ordinary least squares fitting, there is some support for the native statsmodels OLS diagnostic measures. Diagostics that can be computed analytically from a single model fit, e.g., via the hat matrix diagonal, may be useable but many are not for realistically large data sets. The per-observation diagnostic measures, e.g., the influence of observations on estimated parameters, are the same size as the original data multiplied by the number of model parameters which may overload memory and measures that require on leave-one-observation-out model refitting take intractably long for large data sets. A minimal effort is made to guard the user from known trouble but the general policy is fitgrid stays out of the way so you can try what you want. If it works great, if it chokes, that’s the nature of the beast you are modeling.

Support for linear-mixed effects diagnostics in fitgrid is limited to a variance inflation factor computation implemented in Python as a proof-of-concept. fitgrid does not interface with mixed-effect model diagnostics libraries in R and plans are to improve support for mixed-effects modeling in Python rather than expand further into the R ecosystem.

fitgrid under the hood

How mixed effects models are run

Mixed effects models do not have a complete implementation in Python, so we interface with R from Python and use lme4 in R. The results that you get when fitting mixed effects models in fitgrid are the same as if you used lme4 directly, because we use lme4 (indirectly).

Multicore model fitting

On a multicore machine, it may be possible to significantly speed fitting by computing the models in parallel. fitgrid.lm uses statsmodels under the hood to fit a linear least squares model, which in turn employs numpy for calculations. numpy itself depends on linear algebra libraries that might be configured to use multiple threads by default. This means that on a 48 core machine, common linear algebra calculations might use 24 cores automatically, without any explicit parallelization. So when you explicitly parallelize your calculations using Python processes (say 4 of them), each process might start 24 threads. In this situation, 96 CPU bound threads are wrestling each other for time on the 48 core CPU. This is called oversubscription and results in slower computations.

To deal with this when running fitgrid.lm, we try to instruct the linear algebra libraries your numpy distribution depends on to only use a single thread in every computation. This then lets you control the number of CPU cores being used by setting the n_cores parameter in fitgrid.lm() and fitgrid.lmer().

If you are using your own 8-core laptop, you might want to use all cores, so set something like n_cores=7. On a shared machine, it’s a good idea to run on half or 3/4 of the cores if no one else is running heavy computations.

Note that fitgrid parallel processing counts the “logical” cores available to the operating system and this may differ from the number of physical cores, depending on the system hardware and setting, e.g., Intel CPUs with hyperthreading enabled. The Python package psutil and psutil.cpu_count(logical=True) and psutil.cpu_count(logical=False) may be useful for interrogating the system about the available resources.