fitgrid.utils.lm module¶
- fitgrid.utils.lm.filter_diagnostic(diagnostic_df, how, bound_0, bound_1=None, format='long')[source]¶
Select a subset of a fitgrid statsmodels diagnostic dataframe by value.
Use this to identify time ponts, epochs, parameters, channels with outlying or potentially influential data.
- Parameters
diagnostic_df (pandas.DataFrame) – As returned by
fitgrid.utils.lm.get_diagnostic()
how ({‘above’, ‘below’, ‘inside’, ‘outside’}) – slice diagnostic_df above or below bound_0 or inside or outside the closed interval (bound_0, bound_1).
bound_0 (scalar or array-like) – bound_0 is the mandatory boundary for all how. See pandas.DataFrame.gt and pandas.DataFrame.lt documents for binary comparisons with dataframes.
bound_1 (scalar or array-like) – bound_1 is the mandatory upper bound for `how=”inside” and how=”outside”.
format ({‘long’, ‘wide’}) – The long format pivots the channel columns into a row index and returns just those times, (epochs, parameters), channels that pass the filter. The wide format returns filtered_df with the same shape as diagnostic_df, those datapoints that pass the filter in their original row, column location, nans elsewhere.
- Returns
selected_df
- Return type
pandas.DataFrame
- fitgrid.utils.lm.get_diagnostic(lm_grid, diagnostic, do_nobs_loop=False)[source]¶
Fetch statsmodels diagnostic as a Time x Channel dataframe
statsmodels implements a variety of data and model diagnostic measures. For some, it also computes a version of a recommended critical value or \(p\)-value. Use these at your own risk after careful study of the statsmodels source code. For details visit statsmodels.stats.outliers_influence.OLSInfluence.html
For a catalog of the measures available for fitgrid.lm() run this in Python
>>>fitgrid.utils.lm.list_diagnostics()
Warning
Data diagnostics can be very large and very slow, see Notes for details.
By default all values of the diagnostics are computed, this dataframe can be pruned with
fitgrid.utils.lm.filter_diagnostic()
function.By default slow diagnostics are not computed, this can be forced by setting do_nobs_loop=True.
- Parameters
lm_grid (fitgrid.LMFitGrid) – As returned by
fitgrid.lm()
.diagnostic (string) – As implemented in statsmodels, e.g., “cooks_distance”, “dffits_internal”, “est_std”, “dfbetas”.
do_nobs_loop (bool) – True forces slow leave-one-observation-out model refitting.
- Returns
diagnostic_df (pandas.DataFrame) – Channels are in columns. Model measures are row indexed by time; data measures add an epoch row index; parameter measures add a parameter row index.
sm_1_df (pandas.DataFrame) – The supplemenatary values statsmodels returns, or None, same shape as diagnostic_df.
Notes
Size: diagnostic_df values for data measures like cooks_distance and hat_matrix_diagonal are the size of the original data plus a row index and for some data measures like dfbetas, they are the size of the data multiplied by the number of regressors in the model.
Speed: Leave-one-observation-out (LOOO) model refitting takes as long as it takes to fit one model multiplied by the number of observations. This can be intractable for large datasets. Diagnostic measures calculated from the original fit like cooks_distance and dffits_internal are tractable even for large data sets.
Examples
# fake data epochs_fg = fitgrid.generate() lm_grid = fitgrid.lm( epochs_fg, LHS=epochs_fg.channels, RHS='continuous + categorical', parallel=True, n_cores=4, ) # data diagnostic, one dataframe with the values ess_press, _ = fitgrid.utils.lm.get_diagnostic( lm_grid, 'ess_press' ) # Cook's D dataframe AND the p-values statsmodels computes cooks_Ds, sm_pvals = fitgrid.utils.lm.get_diagnostic( lm_grid, 'cooks_distance' ) # this fails because it requires LOOO loop dfbetas_df, _ = fitgrid.utils.lm.get_diagnostic( lm_grid, 'dfbetas' ) # this succeeds by forcing LOOO loop calculation dfbetas_df, _ = fitgrid.utils.lm.get_diagnostic( lm_grid, 'dfbetas', do_nobs_loop=True )