.yhdr aand codemap file formats

User input files provide information in addition to the .crw and .log data.

There are three kinds:

  1. YAML format header files supplement the .crw header with optional recording session details, notes, subject-specific data, and apparatus settings including electrode locations. The .yhdr data is merged with with .crw and .log files when new data is added to the .h5 file.

  2. Codemap files are used to tag event codes with experimental variables associated with the event, e.g., experimental conditions, stimulus-specific properties for stimulus events, responses.

  3. YAML format .yhdx header extractor formats, retrieve header information to merge with code mapped event information and tag event codes, e.g., subject, experiment, apparatus specific variables.

YAML header files: .yhdr

The YAML header is an open-ended mechanism for storing extra nuggets of information with the EEG data that are useful for record keeping or subsequent data analysis. For instance

  • recording session information:

  • subject variables, e.g., DOB, meds, neuropsych scores

  • instrument settings, e.g., bioamp gain and filter, electrode locations

YAML header specification

  1. Must conform to YAML syntax.

  2. Must contain at least one YAML document, may contain more.

  3. Each YAML document must contain the key name and a string label for a value, and may contain more.

Note

For portability between Python, MATLAB, and R all types of missing data should coded with the JSON value “null” (string). However, for use in Python/Pandas only, “null” for strings and .NaN for numeric data is OK and may be more efficent.

Further specifications can adopted as needed for special purposes, e.g., importing mkh5 data into other applications.

Silly Example:

## I am a minimal, legal YAML header file
---
name: i_am_pointless

Slightly Less Silly Example:

## I have some genuinely useful information
---
name: runsheet
dob: 11/17/92
adrc_id: M001A1
mood_vas:
  pre: 4
  post: 3

Sample .yhdr (0.2.7)

---
name: runsheet

experiment: sample
subid: s000

license: Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) 4.0

notes: This work is licensed under a
  Creative Commons Attribution-NonCommercial-ShareAlike 4.0
  International
  License. https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode


---
# Recording apparatus information used to populate MNE python and
# EEGLAB data structures. For reproducibility this yaml doc should be
# included in the mkpy.mkh5 .yhdr so it is baked into the mkh5 header
# and travels with the data.

name: apparatus

# set for MNE python data import ... 26 "eeg", eog, misc are extra
mne_montage_name: 26chan


# ALL CHAN settings via YAML anchor-reference syntax
common_ref: &A1 A1
gain20K: &20K 20000
gain10K: &10K 10000
lp: &lp 100.0
hp: &hp 000.01


# Notes:
#   * pos = positive input to differential bioamp
#   * neg = negative input
#   * #n indicates original dig header channel index

# digitized EEG bioamp output data streams
streams:
  # 0
  lle:
    pos: lle
    neg: *A1
    gain: *10K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 1
  lhz:
    pos: lhz
    neg: *A1
    gain: *10K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 2
  MiPf:
    pos: MiPf
    neg: *A1
    gain: *10K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 3
  LLPf:
    pos: LLPf
    neg: *A1
    gain: *10K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 9
  LLFr:
    pos: LLFr
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 23
  LLTe:
    pos: LLTe
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 25
  LLOc:
    pos: LLOc
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 27
  MiOc:
    pos: MiOc
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 7
  LDFr:
    pos: LDFr
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 17
  LDCe:
    pos: LDCe
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 19
  LDPa:
    pos: LDPa
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 5
  LMPf:
    pos: LMPf
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 11
  LMFr:
    pos: LMFr
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 13
  LMCe:
    pos: LMCe
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 21
  LMOc:
    pos: LMOc
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 15
  MiCe:
    pos: MiCe
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 16
  MiPa:
    pos: MiPa
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 30
  rle:
    pos: rle
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 31
  rhz:
    pos: rhz
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 28
  A2:
    pos: A2
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 4
  RLPf:
    pos: RLPf
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 10
  RLFr:
    pos: RLFr
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 24
  RLTe:
    pos: RLTe
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 26
  RLOc:
    pos: RLOc
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 8
  RDFr:
    pos: RDFr
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 18
  RDCe:
    pos: RDCe
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 20
  RDPa:
    pos: RDPa
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 6
  RMPf:
    pos: RMPf
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 12
  RMFr:
    pos: RMFr
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 14
  RMCe:
    pos: RMCe
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 22
  RMOc:
    pos: RMOc
    neg: *A1
    gain: *20K
    hphz: *hp
    lphz: *lp
    mne_type: eeg

  # 29
  HEOG:
    pos: lhz
    neg: rhz
    gain: *10K
    hphz: *hp
    lphz: *lp
    mne_type: eog


# 3D RAS based on a measured red cap

space:
  coordinates: cartesian
  distance_unit: cm
  orientation: ras
fiducials:
  lpa:
    x: -6.9
    y: 0.0
    z: 0.0
  nasion:
    x: 0.0
    y: 8.5
    z: 0.0
  rpa:
    x: 6.9
    y: 0.0
    z: 0.0
sensors:
  A1:
    x: -6.2
    y: -3.4
    z: -0.1
  A2:
    x: 6.2
    y: -3.4
    z: -0.1
  GND:
    x: 0.0
    y: 9.6
    z: 8.5
  LDCe:
    x: -7.0
    y: 0.6
    z: 10.2
  LDFr:
    x: -5.7
    y: 4.8
    z: 10.0
  LDPa:
    x: -5.8
    y: -3.3
    z: 10.1
  LLFr:
    x: -7.7
    y: 2.5
    z: 5.1
  LLOc:
    x: -5.0
    y: -6.6
    z: 6.1
  LLPf:
    x: -5.7
    y: 7.2
    z: 5.0
  LLTe:
    x: -7.8
    y: -2.2
    z: 5.5
  LMCe:
    x: -3.9
    y: -0.3
    z: 13.3
  LMFr:
    x: -2.7
    y: 4.3
    z: 13.2
  LMOc:
    x: -2.3
    y: -5.8
    z: 10.2
  LMPf:
    x: -2.3
    y: 8.2
    z: 9.9
  MiCe:
    x: 0.0
    y: 1.7
    z: 14.7
  MiOc:
    x: 0.0
    y: -8.3
    z: 6.0
  MiPa:
    x: 0.0
    y: -2.7
    z: 13.2
  MiPf:
    x: 0.0
    y: 9.8
    z: 5.6
  RDCe:
    x: 7.0
    y: 0.6
    z: 10.2
  RDFr:
    x: 5.7
    y: 4.8
    z: 10.0
  RDPa:
    x: 5.8
    y: -3.3
    z: 10.1
  RLFr:
    x: 7.7
    y: 2.5
    z: 5.1
  RLOc:
    x: 5.0
    y: -6.6
    z: 6.1
  RLPf:
    x: 5.7
    y: 7.2
    z: 5.0
  RLTe:
    x: 7.8
    y: -2.2
    z: 5.5
  RMCe:
    x: 3.9
    y: -0.3
    z: 13.3
  RMFr:
    x: 2.7
    y: 4.3
    z: 13.2
  RMOc:
    x: 2.3
    y: -5.8
    z: 10.2
  RMPf:
    x: 2.3
    y: 8.2
    z: 9.9
  lhz:
    x: -6.7
    y: 5.5
    z: 1.1
  lle:
    x: -4.5
    y: 7.7
    z: -1.7
  rhz:
    x: 6.7
    y: 5.5
    z: 1.1
  rle:
    x: 4.5
    y: 7.7
    z: -1.7

YAML header data extractors: .yhdx

A .yhdx YAML header extraction file is used to extract information from the stored header so it can be included in an event table.

The format of the .yhdx is identical to the YAML header except that in the extractor file the terminal values are replaced with variable names that will be the column names for the extracted value

Wherever the extractor key: value path exactly matches the structure of the header document, the data will be extracted.

The column name may but need not be the same as the key.

For example, suppose a .yhdr YAML header file is used to inject additional information into the mkh5 data file that looks like this:

---
name: runsheet
dob: 11/17/92
adrc_id: M001A1
mood_vas:
  pre: 4
  post: 3

For this header, a .ydx header extractor like so

---
name: runsheet

adrc_id: adrc_id
mood_vas:
  pre: mood_pre
  post: mood_post

pulls out this data in (wide) tabular format.

adrc_id

mood_pre

mood_post

M001A1

4

3

Event codemap files: .xlsx, .ytbl, .txt

These user defined helper files contain information about how to tag certain (sequences of) integer event codes in the data stream with experimental design information.

Codemap specification

  • Regardless of the file format, the codemap is always tabular: rows x columns.

  • Two special columns control what (sequences) of codes are matched.

    1. regexp (mandatory)

      Specifies the pattern of the code sequence to match, at least one code, possibly flanked by others.

    2. ccode (optional)

      [New in v0.2.4] If present in the code map, the ccode column restricts matches to data where the log_ccode in the HDF5 data block equals ccode in the code map. This emulates the familar behavior of Kutas Lab ERPSS cdbl code sequence pattern matching where, e.g., event code 1 with ccode==0 is a calibration pulse and event code 1 with ccode==1 is an experimental stimulus.

  • The rest of the columns in the codemap are user-defined tags that attach to the codes that matches the pattern. These may be for general information or a experimental variables factor levels, or numeric co-variates. There may be a few or many, though the latter multiply the storage requirements in RAM or on disk when processing time series of continuous data instead of the (typically) relatively small numbers of event codes.

Warning

Code sequence patterns are matched within each mkh5 data block and cannot span data block boundaries by design.

How it works: event code sequence pattern matching

Event codes are numbers and regular expressions match strings so behind the scenes the sequence of event codes in a datablock is mapped to a space-separated string representation.

Example:

log code sequence: [1, 1, 1, 11, 1024, 1]

stringified: ' 1 1 1 11 1024 1'

Pattern matching definitions:

numerals

0 1 2 3 4 5 6 7 8 9

event code

one or more numerals with or without a preceding minus sign.

Examples: '1', '27', '12172', '-13864'

code sequence

one or more event codes separated by a single whitespace patterns (wild card, quantifed pattern matches)

Example: '1 27 12172 -13864'

code pattern

a regular expression that matches a code or code sequence

Examples: '1', '(12)', 12\d+, (1[23]\d+) 1 (1024)

anchor pattern

a code pattern of the form (#...) where the ... is a non-anchor code_pattern.

Examples: '(#1)', (#12\d+), (#1[23]\d+), (#1024)

search pattern

a regular expression with exactly one anchor pattern flanked by zero or more code patterns on either side, separated by a single whitespace

code_pattern* anchor_pattern code_pattern*

Code map specification

In addition

  • The regexp column must be a regular expression string pattern

  • The remaining columns can numeric or string values with or without missing data provided the values in a column all have the same data type

    floating point: 17.222, 0.287, 10e27, 10e-3, etc..

    integer: -1, 0, 1, 27 10001729

    unsigned integer: 0, 1, 312,

    boolean: True, False

    string-like: ‘hi’, ‘hi/short’, ‘abra/ca/dabra’

    This is not enforced, violate at your own risk.

  • missing data, None, and NaN values are supported in event tables and epoch table for all data types except boolen.

    Warning

    All numeric data columns containing NaNs, None, or missing data missing data are converted to floating point in the epochs tables stored in the mkh5 data file. There are alternatives, but they are worse.

    NaN, None conversions as follows:

    Series type

    from

    to hdf5

    float-like

    np.NaN, None

    np.nan

    int-like

    pd.NaN, None

    np.nan, int coerced to float_

    uint-like

    pd.NaN, None

    np.nan, int coerced to float_

    string-like

    pd.NaN, None

    b’NaN’

    boolean-like

    pd.NaN, None

    not allowed

File types

A codemap can be any of these file types:

Excel

an .xlsx file and (optional) named worksheet readable by pandas.DataFrame.read_excel()

CodeTagger(‘myexpt/code_tag_table.xlsx’) CodeTagger(‘myexpt/code_tag_table.xlsx!for_evoked’) CodeTagger(‘myexpt/code_tag_table.xlsx!for_mixed_effects’)

Tabbed text

a rows x columns tab-delimited text file readable by pandas.Data.read_csv(…, sep=”t”).

YAML

a yaml map readable by yaml.load(), mock-tabular format described below.

File formats

Excel and Tabbed-text
  1. the data must be tabular in n rows and m columns (i,j >= 2)

  2. column labels must be in the first row

  3. the columns must includel, ‘regexp’, by convention the first column

  4. there must be at least one tag column, there may be more

regexp

col_label_1

<col_label_m>*

pattern_1

code_tag_11

<code_tag_1m>*

pattern_n

code_tag_n1

<datum_nm>*

YAML files

The YAML can be any combination of inline (JSON-ic) and YAML indentation that PyYAML yaml.load can handle.

  1. must have one YAML document with two keys: columns and rows.

  2. the first column item must be regexp.

  3. the columns may continue ad lib.

  4. each row must be a YAML sequence with the same number of items as there are columns

Example

---
'columns':
  [regexp, probability, frequency, source]
'rows':
  - ['(#1)', hi,   880, oboe]
  - ['(#2)', hi,   440, oboe]
  - ['(#3)', lo,   880, oboe]
  - ['(#4)', lo,   440, oboe]
  - ['(#5)', hi,   880, tuba]
  - ['(#6)', hi,   440, tuba]
  - ['(#7)', lo,   880, tuba]
  - ['(#8)', lo,   440, tuba]

Row value data types

regexpregular expresion (flanker* (#anchor) flanker*)

This is the code sequence search pattern. Log codes are separated by a single whitespace for matching. The regexp has exactly one anchor pattern capture group (# ) optionally flanked by zero or more code patterns.

Flanking code patterns may capture groups (…) and non-capture groups (?:…)

All matched capture event code patterns and their sample index (and other book-keeping info) are extracted and loaded into the returned code_tag_table for all capture groups. anchor code_pattern is always captured

Additional columns: scalar (string, float, int, bool)

this is not checked, violate at your own risk

That’s it. All the real work is done by 1) specifying regular expressions that match useful patterns and sequences of codes and 2) specifying the optional column values that label the matched codes in useful ways, e.g., by specifying factors and levels of experimental design, or numeric values for regression modeling or …

Notes

  • Missing data are allowed as values but discouraged b.c. 1) they are handled differently by the pandas csv reader vs. yaml and Excel readers. 2) the resulting NaNs and None coerce np.int and np.str dtype columns into np.object dtype and incur a performance penalty and 3) np.object dtypes are not readily serialized to hdf5 … h5py gags and pytables pickles them. 4) It may lead to other unknown pathologies.

  • For yaml files, if missing values are unavoidable, coding them with the yaml value .NAN is recommended for all cases … yes, even string data. The yaml value null maps to None and behaves differently in python/numpy/pandas. This is not enforced, violate at your own risk

  • Floating point precision. Reading code tag maps from yaml text files and directly from Excel .xlsx files introduces the same rounding errors for floating point numbers, e.g. 0.357 -> 0.35699999999999998. Reading text files introduces a different rounding error, e.g.,0.35700000000000004.

  • There is no provision for <t:n-m> time interval constraints on code patterns. Maybe someday.

YAML Notes

The same information can be formatted in various way, with different tradeoffs.

  • key: value are handy when order doesn’t matter.

---
name: runsheet
dob: 11/17/92
adrc_id: M001A1
mood_vas:
  pre: 4
  post: 3
---
MiPf:
  pos:   MiPf
  neg:   A1
  gain:  10000
  hphz:  0.01
  lphz:  100.0

sequences are handy when order matters

block_order:
  - A
  - B
  - B
  - A

They are not mutually exclusive:

Example: three ways to encode the following mixed tabular data:

index

pos

neg

gain

hphz

lphz

MiPf

MiPf

A1

20000

0.01

100.0

HEOG

lle

lhz

10000

0.01

100.0

  • pure nested sequences (YAML “flow” syntax, c.f. JSON).

table:[ [index, pos,   neg,  gain,   hphz, lphz ],
        [MiPf,  MiPf,   A1,  20000,  0.01, 100.0],
        [HEOG,  lhz,   lhz,  10000,  0.01, 100.0] ]

Virtues: The structure of the table is obvious. Data is visually
compact, fairly easy to read, type, and proofread.

Vices: There's no way to tell column headings from data except by
the order convention.  Departures break the processing pipeline or
corrupt the data.
  • key: sequence maps (c.f. headed .csv)

columns:
    [index, pos,   neg,  gain,   hphz, lphz ]
rows:
  - [MiPf,  MiPf,   A1,  20000,  0.01, 100.0]
  - [HEOG,  lhz,   lhz,  10000,  0.01, 100.0]

Virtues: The structure of the table is obvious, data is fairly
compact, easy to read, type, and proofread. Columns headings
are explicitly tagged and segregated from data.

Vices: Data retrieval is by implicit index, *i*-th row of *j*-th column.
  • Nested key: value maps:

    MiPf:
      pos:   MiPf
      neg:   A1
      gain:  10000
      hphz:  0.01
      lphz:  100.0
    
    HEOG:
      pos:   lhz
      neg:   rhz
      gain:  10000
      hphz:  0.01
      lphz:  100.0
    

    Virtues: Each data point is explicitly labeled and can be extracted by a unique slash-path tag: MiPf/gain.

    Vices: The structure of the table is not obvious. Explicit key:value labelling increases storage overhead. Retrieval by tag is slow compared to retrieval by index.

Hint

For header data you plan to automatically extract with get_event_table('some_data', 'some_yhdx')() shallow key:value maps are likely the easiest to work with.