This is a list of tools available in fitgrid
.
Data Ingestion¶
Functions that read epochs tables and create Epochs
and load FitGrid
objects.
- fitgrid.epochs_from_dataframe(dataframe, time, epoch_id, channels)[source]
Construct Epochs object from a Pandas DataFrame epochs table.
The DataFrame should contain columns with names defined by epoch_id and time as index columns.
- Parameters
dataframe (pandas DataFrame) – a pandas DataFrame object
time (str) – time column name
epoch_id (str) – epoch identifier column name
channels (list of str) – list of string channel names
- Returns
epochs – an Epochs object with the data
- Return type
- fitgrid.epochs_from_hdf(filename, key, time, epoch_id, channels)[source]
Construct Epochs object from an HDF5 file containing an epochs table.
The HDF5 file should contain columns with names defined by epoch_id and time either as index columns or as regular columns. This is added as a convenience, in general, input epochs tables should contain these columns in the index.
- Parameters
filename (str) – HDF5 file name
key (str) – group identifier for the dataset when HDF5 file contains more than one
time (str) – time column name
epoch_id (str) – epoch identifier column name
channels (list of str) – list of string channel names
- Returns
epochs – an Epochs object with the data
- Return type
Data Simulation¶
fitgrid
has a built-in function that generates data and creates Epochs
:
- fitgrid.generate(n_epochs=10, n_samples=100, n_categories=2, n_channels=32, time='time', epoch_id='epoch_id', seed=None, return_type='epochs')[source]
Return Epochs object or pandas.DataFrame with fake EEG data.
- Parameters
n_epochs (int) – number of epochs per category to be generated
n_samples (int) – number of samples in a single epochs
n_categories (int) – number of levels of the categorical variable
n_channels (int) – number of time series representing EEG channels
time (str, defaults to defaults.TIME) – time column name
epoch_id (str, defaults to defaults.EPOCH_ID) – epoch identifier column name
seed=None ({None, int, array_like}, optional) – Random number generation seed. Default=None lets data vary from run to run. Set seed to a 32-bit unsigned integer to generate the same fake data run to run. See numpy.random.RandomState for details.
return_type (str {epochs, dataframe}) – return fitgrid.Epochs or the fitgrid.Epochs.table dataframe
- Returns
epochs – Epochs object or just the data
- Return type
fitgrid.Epochs or pandas.DataFrame
Notes
n_epochs
andn_categories
interact in the sense thatn_epochs
epochs are generated for each level of the categorical variable. In other words, the true number of epochs in the generated data is equal ton_epochs
*n_categories
.For example, the default
n_epochs = 10
andn_categories = 2
produces 20 epochs, 10 per category.
Epochs
methods¶
Models and plotting.
- fitgrid.epochs.Epochs.plot_averages(self, channels=None, negative_up=True)
Plot grand mean averages for each channel, negative up by default.
- Parameters
channels (list of str, optional, defaults to all channels) – list of channel names to plot the averages
negative_up (bool, optional, default True) – by convention, ERPs are plotted negative voltage up
- Returns
fig (matplotlib.figure.Figure) – figure containing plots
axes (numpy.ndarray of matplotlib.axes.Axes) – axes objects
Model running¶
- fitgrid.lm(epochs, LHS=None, RHS=None, parallel=False, n_cores=4, quiet=False, eval_env=4)[source]
Run ordinary least squares linear regression on the epochs.
- Parameters
epochs (Epochs) – epochs object on which regression is to be run
LHS (list of str, optional, defaults to all channels) – list of channels for the left hand side of the regression formula
RHS (str) – right hand side of the regression formula
parallel (bool, defaults to False) – change to True to run in parallel
n_cores (int, defaults to 4) – number of processes to use for computation
quiet (bool, defaults to False) – set to True to disable fitting progress bar
eval_env (int or patsy.EvalEnvironment, defaults to 4) – environment to use for evaluating patsy formulas, see patsy docs
- Returns
grid – LMFitGrid object containing the results of the regression
- Return type
- fitgrid.lmer(epochs, LHS=None, RHS=None, family='gaussian', conf_int='Wald', factors=None, permute=None, ordered=False, REML=True, parallel=False, n_cores=4, quiet=False)[source]
Fit lme4 linear mixed model by interfacing with R.
- Parameters
epochs (Epochs) – epochs object on which lmer is to be run
LHS (list of str, optional, defaults to all channels) – list of channels for the left hand side of the lmer formula
RHS (str) – right hand side of the lmer formula
family (str, defaults to ‘gaussian’) – distribution link function to use
conf_int (str, defaults to ‘Wald’)
factors (dict, optional) – Keys should be column names in data to treat as factors. Values should either be a list containing unique variable levels if dummy-coding or polynomial coding is desired. Otherwise values should themselves be dictionaries with unique variable levels as keys and desired contrast values (as specified in R!) as keys.
permute (int, defaults to None) – if non-zero, computes parameter significance tests by permuting test stastics rather than parametrically. Permutation is done by shuffling observations within clusters to respect random effects structure of data.
ordered (bool, defaults to False) – whether factors should be treated as ordered polynomial contrasts; this will parameterize a model with K-1 orthogonal polynomial regressors beginning with a linear contrast based on the factor order provided
REML (bool, defaults to True) – change to False to use ML estimation
parallel (bool, defaults to False) – change to True to run in parallel
n_cores (int, defaults to 4) – number of processes to use for computation
quiet (bool, defaults to False) – set to True to disable fitting progress bar
- Returns
grid – LMERFitGrid object containing the results of lmer fitting
- Return type
- fitgrid.run_model(epochs, function, channels=None, parallel=False, n_cores=4, quiet=False)[source]
Run an arbitrary model on the epochs.
- Parameters
epochs (Epochs) – the epochs object on which the model is to be run
function (Python function) – function that runs a model, see Notes below for details
channels (list of str) – list of channels to serve as dependent variables
parallel (bool, defaults to False) – set to True in order to run in parallel
n_cores (int, defaults to 4) – number of processes to run in parallel
quiet (bool, defaults to False) – set to True to disable progress bar display
- Returns
grid – a FitGrid object containing the results
- Return type
Notes
The function should take two parameters,
data
andchannel
, run some model on the data, and return an object containing the results.data
will be a snapshot across epochs at a single timepoint, containing all channels of interest.channel
is the name of the target variable that the function runs the model against (uses it as the dependent variable).Examples
Here’s an example of a function that can be passed to
run_model
:def regression(data, channel): formula = channel + ' ~ continuous + categorical' return ols(formula, data).fit()
FitGrid
methods¶
- fitgrid.fitgrid.FitGrid.save(self, filename)
Save FitGrid object to file (reload with
fitgrid.load_grid
).- Parameters
filename (str) – file name to use
LMFitGrid
methods¶
Plotting and statistics.
- fitgrid.fitgrid.LMFitGrid.influential_epochs(self, top=None)
Return dataframe with top influential epochs ranked by Cook’s-D.
- Parameters
top (int, optional, default None) – how many top epochs to return, all epochs by default
- Returns
top_epochs – dataframe with epoch_id as index and aggregated Cook’s-D as values
- Return type
pandas DataFrame
Notes
Cook’s distance is aggregated by simple averaging across time and channels.
- fitgrid.fitgrid.LMFitGrid.plot_betas(self, legend_on_bottom=False)
Plot betas of the model, one plot per channel, overplotting betas.
- Parameters
legend_on_bottom (bool, defaults to False) – set to True to plot single legend below all channel plots
- Returns
fig (matplotlib.figure.Figure) – figure containing plots
axes (numpy.ndarray of matplotlib.axes.Axes) – axes objects
- fitgrid.fitgrid.LMFitGrid.plot_adj_rsquared(self)
Plot adjusted \(R^2\) as a heatmap with marginal bar and line.
- Returns
fig (matplotlib.figure.Figure) – figure containing plots
gs (matplotlib.gridspec.GridSpec) – grid specification that determines locations and sizes of subplots
bar, heatmap, colorbar, line (matplotlib.axes._subplots.AxesSubplot) – subplot objects
Utilities¶
model fit summaries¶
- fitgrid.utils.summary.summarize(epochs_fg, modeler, LHS, RHS, parallel=True, n_cores=4, **kwargs)[source]
Fit the data with one or more model formulas and return summary information.
Convenience wrapper, useful for keeping memory use manageable when gathering betas and fit measures for a stack of models.
- Parameters
epochs_fg (fitgrid.epochs.Epochs) – as returned by fitgrid.epochs_from_dataframe() or fitgrid.from_hdf(), NOT a pandas.DataFrame.
modeler ({‘lm’, ‘lmer’}) – class of model to fit, lm for OLS, lmer for linear mixed-effects. Note: the RHS formula language must match the modeler.
LHS (list of str) – the data columns to model
RHS (model formula or list of model formulas to fit) – see the Python package patsy docs for lm formula langauge and the R library lme4 docs for the lmer formula langauge.
parallel (bool)
n_cores (int) – number of cores to use. See what works, but golden rule if running on a shared machine.
**kwargs (key=value arguments passed to the modeler, optional)
- Returns
summary_df – indexed by timestamp, model_formula, beta, and key, where the keys are ll.l_ci, uu.u_ci, AIC, DF, Estimate, P-val, SE, T-stat, has_warning, logLike.
- Return type
pandas.DataFrame
Examples
>>> lm_formulas = [ '1 + fixed_a + fixed_b + fixed_a:fixed_b', '1 + fixed_a + fixed_b', '1 + fixed_a, '1 + fixed_b, '1', ] >>> lm_summary_df = fitgrid.utils.summarize( epochs_fg, 'lm', LHS=['MiPf', 'MiCe', 'MiPa', 'MiOc'], RHS=lmer_formulas, parallel=True, n_cores=4 )
>>> lmer_formulas = [ '1 + fixed_a + (1 + fixed_a | random_a) + (1 | random_b)', '1 + fixed_a + (1 | random_a) + (1 | random_b)', '1 + fixed_a + (1 | random_a)', ] >>> lmer_summary_df = fitgrid.utils.summarize( epochs_fg, 'lmer', LHS=['MiPf', 'MiCe', 'MiPa', 'MiOc'], RHS=lmer_formulas, parallel=True, n_cores=12, REML=False )
- fitgrid.utils.summary.plot_betas(summary_df, LHS, alpha=0.05, fdr=None, figsize=None, s=None, df_func=None, **kwargs)[source]
Plot model parameter estimates for each data column in LHS
- Parameters
summary_df (pd.DataFrame) – as returned by fitgrid.utils.summary.summarize
LHS (list of str) – column names of the data fitgrid.fitgrid docs
alpha (float) – alpha level for false discovery rate correction
fdr (str {None, ‘BY’, ‘BH’}) – Add markers for FDR adjusted significant \(p\)-values. BY is Benjamini and Yekatuli, BH is Benjamini and Hochberg, None supresses the markers.
df_func ({None, function}) – plot function(degrees of freedom), e.g., np.log10, lambda x: x
s (float) – scatterplot marker size for BH and lmer decorations
kwargs (dict) – keyword args passed to pyplot.subplots()
- Returns
figs
- Return type
list
- fitgrid.utils.summary.plot_AICmin_deltas(summary_df, figsize=None, gridspec_kw=None, **kwargs)[source]
plot FitGrid min delta AICs and fitter warnings
Thresholds of AIC_min delta at 2, 4, 7, 10 are from Burnham & Anderson 2004, see Notes.
- Parameters
summary_df (pd.DataFrame) – as returned by fitgrid.utils.summary.summarize
figsize (2-ple) – pyplot.figure figure size parameter
gridspec_kw (dict) – matplotlib.gridspec key : value parameters
kwargs (dict) – keyword args passed to plt.subplots(…)
- Returns
f, axs
- Return type
matplotlib.pyplot.Figure
Notes
[BurAnd2004] p. 270-271. Where \(AIC_{min}\) is the lowest AIC value for “a set of a priori candidate models well-supported by the underlying science \(g_{i}, i = 1, 2, ..., R)\)”,
\[\Delta_{i} = AIC_{i} - AIC_{min}\]“is the information loss experienced if we are using fitted model \(g_{i}\) rather than the best model, \(g_{min}\) for inference.” …
“Some simple rules of thumb are often useful in assessing the relative merits of models in the set: Models having \(\Delta_{i} <= 2\) have substantial support (evidence), those in which \(\Delta_{i} 4 <= 7\) have considerably less support, and models having \(\Delta_{i} > 10\) have essentially no support.”
lm diagnostics¶
- fitgrid.utils.lm.get_vifs(epochs, RHS, quiet=False)[source]
- fitgrid.utils.lm.list_diagnostics()[source]
Display statsmodels diagnostics implemented in fitgrid.utils.lm
- fitgrid.utils.lm.get_diagnostic(lm_grid, diagnostic, do_nobs_loop=False)[source]
Fetch statsmodels diagnostic as a Time x Channel dataframe
statsmodels implements a variety of data and model diagnostic measures. For some, it also computes a version of a recommended critical value or \(p\)-value. Use these at your own risk after careful study of the statsmodels source code. For details visit statsmodels.stats.outliers_influence.OLSInfluence.html
For a catalog of the measures available for fitgrid.lm() run this in Python
>>>fitgrid.utils.lm.list_diagnostics()
Warning
Data diagnostics can be very large and very slow, see Notes for details.
By default all values of the diagnostics are computed, this dataframe can be pruned with
fitgrid.utils.lm.filter_diagnostic()
function.By default slow diagnostics are not computed, this can be forced by setting do_nobs_loop=True.
- Parameters
lm_grid (fitgrid.LMFitGrid) – As returned by
fitgrid.lm()
.diagnostic (string) – As implemented in statsmodels, e.g., “cooks_distance”, “dffits_internal”, “est_std”, “dfbetas”.
do_nobs_loop (bool) – True forces slow leave-one-observation-out model refitting.
- Returns
diagnostic_df (pandas.DataFrame) – Channels are in columns. Model measures are row indexed by time; data measures add an epoch row index; parameter measures add a parameter row index.
sm_1_df (pandas.DataFrame) – The supplemenatary values statsmodels returns, or None, same shape as diagnostic_df.
Notes
Size: diagnostic_df values for data measures like cooks_distance and hat_matrix_diagonal are the size of the original data plus a row index and for some data measures like dfbetas, they are the size of the data multiplied by the number of regressors in the model.
Speed: Leave-one-observation-out (LOOO) model refitting takes as long as it takes to fit one model multiplied by the number of observations. This can be intractable for large datasets. Diagnostic measures calculated from the original fit like cooks_distance and dffits_internal are tractable even for large data sets.
Examples
# fake data epochs_fg = fitgrid.generate() lm_grid = fitgrid.lm( epochs_fg, LHS=epochs_fg.channels, RHS='continuous + categorical', parallel=True, n_cores=4, ) # data diagnostic, one dataframe with the values ess_press, _ = fitgrid.utils.lm.get_diagnostic( lm_grid, 'ess_press' ) # Cook's D dataframe AND the p-values statsmodels computes cooks_Ds, sm_pvals = fitgrid.utils.lm.get_diagnostic( lm_grid, 'cooks_distance' ) # this fails because it requires LOOO loop dfbetas_df, _ = fitgrid.utils.lm.get_diagnostic( lm_grid, 'dfbetas' ) # this succeeds by forcing LOOO loop calculation dfbetas_df, _ = fitgrid.utils.lm.get_diagnostic( lm_grid, 'dfbetas', do_nobs_loop=True )
- fitgrid.utils.lm.filter_diagnostic(diagnostic_df, how, bound_0, bound_1=None, format='long')[source]
Select a subset of a fitgrid statsmodels diagnostic dataframe by value.
Use this to identify time ponts, epochs, parameters, channels with outlying or potentially influential data.
- Parameters
diagnostic_df (pandas.DataFrame) – As returned by
fitgrid.utils.lm.get_diagnostic()
how ({‘above’, ‘below’, ‘inside’, ‘outside’}) – slice diagnostic_df above or below bound_0 or inside or outside the closed interval (bound_0, bound_1).
bound_0 (scalar or array-like) – bound_0 is the mandatory boundary for all how. See pandas.DataFrame.gt and pandas.DataFrame.lt documents for binary comparisons with dataframes.
bound_1 (scalar or array-like) – bound_1 is the mandatory upper bound for `how=”inside” and how=”outside”.
format ({‘long’, ‘wide’}) – The long format pivots the channel columns into a row index and returns just those times, (epochs, parameters), channels that pass the filter. The wide format returns filtered_df with the same shape as diagnostic_df, those datapoints that pass the filter in their original row, column location, nans elsewhere.
- Returns
selected_df
- Return type
pandas.DataFrame
lmer diagnostics¶
- fitgrid.utils.lmer.get_lmer_dfbetas(epochs, factor, **kwargs)[source]
Fit lmers leaving out factor levels one by one, compute DBETAS.
- Parameters
epochs (Epochs) – Epochs object
factor (str) – column name of the factor of interest
**kwargs – keyword arguments to pass on to
fitgrid.lmer
, likeRHS
- Returns
dfbetas – dataframe containing DFBETAS values
- Return type
pandas.DataFrame
Examples
Example calculation showing how to pass in model fitting parameters:
dfbetas = fitgrid.utils.lmer.get_lmer_dfbetas( epochs=epochs, factor='subject_id', RHS='x + (x|a) )
Notes
DFBETAS is computed according to the following formula [NieGroPel2012]:
\[DFBETAS_{ij} = \frac{\hat{\gamma}_i - \hat{\gamma}_{i(-j)}}{se\left(\hat{\gamma}_{i(-j)}\right)}\]for parameter \(i\) and level \(j\) of
factor
.