mkpy.codetagger module
- class mkpy.codetagger.CodeTagger(cmf)[source]
Bases:
object
Tag pattern-matched sequences of time-indexed integer with key:value metadata.
In the intended use case
the integer values are event-markers recorded on an event or timing track of a digital data acquisition system.
Each integer maps to a specific event or event-type, e.g., a stimulus onset/offset, a response, a timing mark.
A sequence of integers corresponds to a sequence of events or event types.
The metadata (tabular rows x columns) add further information above and beyond the numeric value of the integer and may include are mapped to tuples of mixed data types: strings, integers, floats, suitable for decorating events and epochs of data with useful information such experimental design factor levels (Easy, Hard), continuous covariates (age of acquisition).
The original use case was to find and decorate ERPSS log event codes with experimental information.
The mechanism here is general, abstracting away from the source of the integer 1-D arrays and the intended purpose of the metadata.
Note
Starting with mkpy v0.2.1 codemaps allow but do not require an Index.
Notes
The UI for specifying a code tag map can be any of these file types
- Excel
an .xlsx file and (optional) named worksheet readable by pandas.DataFrame.read_excel()
CodeTagger(‘myexpt/code_tag_table.xlsx’) CodeTagger(‘myexpt/code_tag_table.xlsx!for_evoked’) CodeTagger(‘myexpt/code_tag_table.xlsx!for_mixed_effects’)
- Tabbed text
a rows x columns tab-delimited text file readable by pandas.Data.read_csv(…, sep=” “).
- YAML
a yaml map readable by yaml.load(), mock-tabular format described below.
File formats
- Excel and Tabbed-text
the data must be tabular in n rows and m columns (i,j >= 2)
column labels must be in the first row
the column labels must include ‘regexp’
there must be at least one tag column, there may be more
regexp
col_label_1
<col_label_m>*
pattern_1
code_tag_11
<code_tag_1m>*
…
…
…
pattern_n
code_tag_n1
<datum_nm>*
- YAML files
The YAML can be any combination of inline (JSON-ic) and YAML indentation that PyYAML yaml.load can handle.
must have one YAML document with two keys:
columns
androws
.the columns must start with regexp
the columns may continue ad lib.
each row must be a YAML sequence with the same number of items as there are columns
Example
--- 'columns': [regexp, bin, probability, frequency, source] 'rows': - ['(#1)', 1, hi, 880, oboe] - ['(#2)', 2, hi, 440, oboe] - ['(#3)', 3, lo, 880, oboe] - ['(#4)', 4, lo, 440, oboe] - ['(#5)', 5, hi, 880, tuba] - ['(#6)', 6, hi, 440, tuba] - ['(#7)', 7, lo, 880, tuba] - ['(#8)', 8, lo, 440, tuba]
Row value data types
- regexpregular expresion pattern* (#pattern) pattern*
This is the code sequence search pattern. Log codes are separated by a single whitespace for matching. The
regexp
has one or more time-locking pattern capture groups of which one begins with the time-marking anchor symbol,#
.Flanking code patterns may be capture groups (…) and non-capture groups (?:…)
All matched time-locking codes, the anchor code, and distance between them are extracted from the mkpy.mkh5 datablocks and merged with the code tags in the returned event_table for all capture groups.
Additional columns: scalar (string, float, int, bool)
That’s it. All the real work is done by 1) specifying regular expressions that match useful (sequences of) integer event codes and 2) specifying the optional column values that label the matched codes in useful ways, e.g., by specifying factors and levels of experimental design, or numeric values for regression modeling or …
Notes
The values in any given column should all be the same data type: string, integer, boolean, float. This is not enforced, violate at your own risk.
Missing data are allowed as values but discouraged b.c. 1) they are handled differently by the pandas csv reader vs. yaml and Excel readers. 2) the resulting NaNs and None coerce np.int and np.str dtype columns into np.object dtype and incur a performance penalty and 3) np.object dtypes are not readily serialized to hdf5 … h5py gags and pytables pickles them. 4) It may lead to other unknown pathologies.
For yaml files, if missing values are unavoidable, coding them with the yaml value .NAN is recommended for all cases … yes, even string data. The yaml value null maps to None and behaves differently in python/numpy/pandas. This is not enforced, violate at your own risk
Floating point precision. Reading code tag maps from yaml text files and directly from Excel .xlsx files introduces the same rounding errors for floating point numbers, e.g. 0.357 -> 0.35699999999999998. Reading text files introduces a different rounding error, e.g.,0.35700000000000004.
There is no provision for <t:n-m> time interval constraints on code patterns. Maybe someday.