Introduction

The mkpy Python package provides transparent open access to human and machine readable continuous and epoched EEG data for reproducible analysis in scientific computing frameworks. mkpy is built around an unopinionated EEG data interchange format (mkh5) that smoothly and flexibly merges multichannel EEG data with the many other other kinds of heterogenous experimental data required for single trial EEG data analyses. The format is implemented in HDF5 which makes it portable across scientific computing platforms within and between labs.

Familiar problems

mkpy addresses problems that EEG researchers working with open-source EEG data analysis toolboxes such as EEGLAB, ERPLAB, Brainstorm, MNE Python likely have encountered or will.

The functional unit for EEG research is the experiment: subjects, stimuli, responses, dependent and independent variables. There are few principled bounds on the electrophysiological investigation of human information processing: what happens before, during, and after individuals are stimulated (if they are) and respond (if they do). The space of possible experimental designs and variables is vast and defined by the diversity of scientific aims and creativity of researchers.

The functional unit for EEG data analysis is the discrete time-series, longer or shorter stretches of digitized EEG data stored in one or more computer files. Analysis of discrete time-series data includes time-domain averaging along with other types of analysis: Fourier, wavelet, linear discriminant, principled component, independent component, multiple regression, multivariate pattern recognition, etc., etc., etc..

The functional unit for statistical analysis at present is the 2-D rows x columns data table e.g., data.frame, tibble, data.table for modeling in R; numpy.ndarray and pandas.DataFrame for modeling with statsmodels in python.

In the context of an EEG experiment, measures derived from signal processing transformations are but one variable among many. The problem, then, is how to knit all the relevant variables together into a data table so the statistical analyses can be conducted to answer the research question. This is not conceptually difficult, it is a form of database merge that combines and aligns values from heterogenous sources and exposes the result in a useful format. At present, however, the process is encumbered by the wide variety of binary EEG data file formats and data structures that have emerged over time as EEG data acquisition systems and open-source signal processing toolboxes have evolved.

Toward a solution

The mkpy framework shifts the focus in EEG data analysis away from EEG signal processing data files and data structures and toward experimental design and analyses as the functional unit.

Rather than trying to decide in advance what may, must, and cannot go in EEG data file+header formats or in-memory EEG data structures, the approach is to let experimenters make those decisions and concentrate instead on making the files (mkh5) easy to work with regardless of what data they contain and what scientific computing environment they find themselves in. This approach is familiar from the concept of human-readable text data interchange formats like Javascript Object Notation (JSON) and YAML Ain’t Markup Language (YAML): simple but flexible (recursive) containers for arbitrarily complex content. The Hierarchical Data Format v. 5 (HDF5) file specification is a (less-simple) recursive container framework for binary data.

Likewise the mkh5 data model is also simple recursive container format for data from EEG experiments: a single format for continuous EEG, epoched single trial EEG, and the cornucopia of non-EEG variables that must be knit together with the EEG data when analyzing and interpreting designed experiments.

The format is familar: table-like blocks of time-series data travel with a dictionary-like header of keys and values. Simplification comes from enforcing a single, flexible format on the structure of the datablock and headers. Human and cross-platform machine readability comes from implementing the datablock as an HDF5 compound data type and the header as a single JSON string.

an interval of time is modeled by an uninterrupted series of fixed-rate digital samples with no temporal pauses, discontinuities, gaps, or boundaries. The interval may be short, e.g., 1 sample (a.k.a. “event”) or longer, e.g., 1000 samples (a.k.a “epoch”), or much longer (100000 samples, a.k.a. “continuous EEG”). These differ only in length not kind.
for a given interval of n samples there two kinds of experimental variables: those that change during the interval (the data streams) and those that do not (constants).

Data streams

Values that change during the interval are stored in a table-like block: rows (samples) x columns (data streams). Each data stream column is named, and each stream is of a single scalar data type: floating point, integer, unsigned integer, string, boolean. The number and data type of columns may vary ad lib.

Constants

Values that do not change during the interval are stored as a recursive key:value map terminating in data-typed scalars. The header travels with the data block and is encoded as a single utf-8 JSON string. Maps may be added to the header ad lib.

That’s it.

There is no “EEG data file format” per se. Instead there are a few simple rules (syntax) governing how the data in the data block and header are organized. The data streams that appear in the data block columns and the information embedded in the header are whatever the researcher finds useful for a particular purpose. And as these purposes change, the information that travels with the EEG data can change as well.

mkh5 files are constructed by translating a native EEG data recording file into a sequence of one or more data blocks. The data blocks can be arranged in the HDF5 file or files to suit the requirements of the experiment. The data blocks from accidentally split sessions with multiple data files can be glued back together into a single file if needed. Long sessions can be split into different mkh5 files. The data blocks can be organized into separate files for each subject and analyzed separately or combined into a single large file and analyzed together. Within a single file, multi-session designs can be organized by nesting subjects in sessions or sessions in subjects or not nested at all.

The tabular data block and the headers are both “stretchy” by design. Columns and header entries can be added or deleted provided only that the that the time-delta is constant between rows and the header tracks the name, index, and data-type of the data block columns. The JSON header can hold anything that JSON can encode which covers most of what experimenters generally need to track by way of experimental variables.

Once an mkh5 file is constructed, continuous or single-trial data analysis of single subjects or entire experiments is greatly simplified since there is no principled distinction between parts of data files and whole data files or subjects and experiments. All of these are the same data structure: a sequence of one or more well-formed data blocks + header.

After the initial conversion to mkh5 no further import/export parsers are required and cross-platform portability is immediate. By using built-in functions or 3rd party hdf5 library wrappers, the mkh5 HDF5 files can be read and written with a few lines of code in Python, MATLAB, and R on linux, Mac OS, and Windows.

The mkh5 data block maps directly to data frames/tables in R, Pandas, and MATLAB. The JSON header maps directly to native structures, e.g., R named lists, Python dicts, and MATLAB structs. This makes merging continuous and single-trial EEG data with arbitrary non-EEG variables from other sources entirely straightforward by taking advantage of existing table transformation functions: row and column slicing operations by name or index to access parts of a single table; table row and column stacking operations to construct new tables; function application by row, by column, and by group using column variables as the grouping factor.

Furthermore, any data the experimenter has embedded in the header or imported from an external source can be readily merged with the data block column EEG time-series and this can be whatever heterogenous information the experimenter deems useful, from recording session parameters and free-form experimenter notes to electrode locations, pre- and post-test scores, biomarkers, demographics, artifact screening criteria, etc., etc.. The JSON header format allows ready access to whatever information is traveling with the EEG data in the data block but does not require any particular header fields or content beyond the column index. Likewise the external data import also allows easy access to heterogenous information without requiring any of it. This flexibility allows the experimenter to smoothly marry the sampled EEG data to whatever sorts of experimental variables are useful for whatever sort of analysis is needed to answer research question.

In sum, the mkh5 approach is to define a simple, consistent structure that is extensible in simple, consistent ways. The flexible data blocks and headers can stretch as needed to accommodate very different experimental designs and analyses while the consistent format streamlines the development of what must inevitably be semi-custom analysis pipelines.

`mkpy` overview

Although originally designed as a data interchange format for Kutas lab binary ERPSS .crw and .log files, the file format is unopinionated and EEG data from any system could be stored as mkh5.

In addition to database-like operations to create, update, and retrieve mkh5 format HDF5 files, the mkpy package provides utilities for visualizing and screening continuous and epoched EEG data (pygarv), for tagging EEG data with experimental variables of interest from the header and external data sources, and for exporting single trial epochs is various data interchange formats for convenient analysis.

mkh5: EEG data (floating point), timestamps (unsigned int), event codes (ints), trigger lines, position sensor data, etc.., values change from sample to sample. These and other such data streams go in the tabular data block columns. Subject information, apparatus settings, etc. don’t change from sample to sample. That goes in the header.
pygarv/mkh5viewer: EEG artifact screening tests are stored on disk in human and machine-readable YAML. When the tests are applied, pass-fail results are logged in a datablock column so it travels with the data from then on. The tests and parameters themselves are stored in the datablock header. Tests can be viewed and edited interactively.
codetagger: Arbitrary experimental design factors and levels are pulled from the header and/or imported from external YAML, Excel .xlsx, or tab-separated text and anchored to pattern-matched time-stamped integer event codes in the continuous EEG data before extracting single trial epochs and data reduction.
export_epochs: The tagged single-trial EEG data epochs can be exported in tabular format as HDF5, feather, or tab-separated text for downstream analysis.

Introduction

Familiar problems

Toward a solution

mkpy overview

`mkpy` overview