`spudtr` epochs dataframe format

spudtr epochs are pandas.DataFrame objects.

There are three key elements:

epoch_id an index-like integer column, where each value designates a unique epoch
time an index-like column of integer timestamps, the same in each epoch
the rest of the data columns

There must be at least one epoch.

There must be at least one timepoint.

All the epochs must be timestamped exactly the same way.

NOTE: timestamps are positive and negative integers, the units are unspecified: milliseconds, months, nanoseconds, hours.

[1]:

import pandas as pd
from matplotlib import pyplot as plt

from spudtr import get_demo_df, P3_1500_FEATHER
from spudtr import epf
import spudtr.fake_epochs_data as fake_data

Example: simulated categorical and continuous data

The epoch_id column is “epoch_id”, there are four epochs: 0, 1, 2, 3.

The time column is “days”, there are 31 days in each epoch, 0, 1, 2, …, 31.

The rest of the columns are the data recorded in each epoch at each time stamp.

[2]:

n_epochs_per_category = 2
sim_epochs_df, channels = fake_data._generate(
    n_epochs=n_epochs_per_category,
    n_samples=32,
    n_categories=2,
    n_channels=4,
    time="days",
    epoch_id="epoch_id",
    seed=10,
)
display(sim_epochs_df)

	epoch_id	days	categorical	continuous	channel0	channel1	channel2	channel3
0	0	0	cat0	0.771321	-13.170787	-30.197057	19.609869	43.177612
1	0	1	cat0	0.020752	4.233125	-7.726009	-65.298259	41.464399
2	0	2	cat0	0.633648	8.191480	21.915223	18.568468	27.639613
3	0	3	cat0	0.748804	-48.557122	-50.952045	14.317029	-17.186617
4	0	4	cat0	0.498507	-17.193401	50.222266	0.782896	38.251473
...	...	...	...	...	...	...	...	...
123	3	27	cat1	0.744603	33.167254	-7.658414	14.630878	14.329468
124	3	28	cat1	0.469785	-60.531560	0.774228	1.689442	0.882024
125	3	29	cat1	0.598256	16.216221	66.028993	16.373534	4.854384
126	3	30	cat1	0.147620	-43.268966	26.531028	-20.493672	-12.327708
127	3	31	cat1	0.184035	-48.265511	-41.604676	-19.770519	27.925069

128 rows × 8 columns

Example: EEG data

The epoch index column is epoch_id, there are 600 epochs numbered: 0, 1, 2, …, 600. There are 600 not 601 epochs here because epoch_id 392 was excluded: the relevant event marked a pause in the recording not a stimulus. The epoch ids must be unique but they can be gappy and out of order.

The time column is time_ms, there are 375 digital samples in each epoch at 4 ms intervals, -748, -744, …, 744, 748

The rest of the columns are the data recorded in each epoch at each time stamp.

[3]:

eeg_epochs_df = get_demo_df(P3_1500_FEATHER)
display(len(eeg_epochs_df["epoch_id"].unique()))
eeg_epochs_df

[3]:

	epoch_id	time_ms	sub_id	eeg_artifact	dblock_path	log_evcodes	log_ccodes	dblock_srate	ccode	instrument	...	RMOc	LLTe	RLTe	LLOc	RLOc	MiOc	A2	HEOG	rle	rhz
0	0	-748	sub000	0	sub000/dblock_0	0	0	250.0	1	eeg	...	-25.093750	-0.753906	1.480469	-13.414062	-18.937500	-17.734375	5.660156	98.875000	-39.500000	38.375000
1	0	-744	sub000	0	sub000/dblock_0	0	0	250.0	1	eeg	...	-24.593750	0.502441	-2.466797	-17.640625	-17.468750	-15.304688	1.968750	104.750000	-38.031250	41.281250
2	0	-740	sub000	0	sub000/dblock_0	0	0	250.0	1	eeg	...	-16.484375	-1.507812	3.947266	-15.648438	-10.085938	-11.171875	8.367188	102.062500	-33.656250	43.718750
3	0	-736	sub000	0	sub000/dblock_0	0	0	250.0	1	eeg	...	-11.804688	-15.070312	9.867188	-14.906250	-7.378906	-8.742188	9.351562	100.562500	-42.906250	37.406250
4	0	-732	sub000	0	sub000/dblock_0	0	0	250.0	1	eeg	...	-6.394531	-4.019531	9.125000	-10.679688	-6.886719	-8.015625	8.125000	98.375000	-43.875000	37.906250
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
224995	600	732	sub000	0	sub000/dblock_4	0	0	250.0	0	cal	...	-4.671875	-3.517578	-4.441406	-4.718750	-4.671875	-3.400391	-4.429688	-4.406250	-3.900391	-4.371094
224996	600	736	sub000	0	sub000/dblock_4	0	0	250.0	0	cal	...	-4.179688	-4.019531	-4.195312	-4.222656	-4.425781	-3.644531	-4.429688	-4.160156	-3.412109	-4.371094
224997	600	740	sub000	0	sub000/dblock_4	0	0	250.0	0	cal	...	-4.425781	-3.767578	-4.441406	-3.974609	-4.425781	-3.400391	-4.429688	-4.160156	-3.900391	-4.859375
224998	600	744	sub000	0	sub000/dblock_4	0	0	250.0	0	cal	...	-4.425781	-4.269531	-4.195312	-4.222656	-4.425781	-3.886719	-4.429688	-4.406250	-3.900391	-4.371094
224999	600	748	sub000	0	sub000/dblock_4	0	0	250.0	0	cal	...	-4.179688	-4.019531	-3.947266	-4.222656	-4.179688	-3.400391	-4.183594	-4.406250	-3.412109	-4.371094

225000 rows × 47 columns

[4]:

eeg_epochs_df.query("epoch_id == 392")

[4]:

	epoch_id	time_ms	sub_id	eeg_artifact	dblock_path	log_evcodes	log_ccodes	dblock_srate	ccode	instrument	...	RMOc	LLTe	RLTe	LLOc	RLOc	MiOc	A2	HEOG	rle	rhz

0 rows × 47 columns

Always check the epoch x time format

When things go well the check quietly succeeds.

When they don’t the reason appears at the bottom of the messages.

Example: This check of the simulated data SUCCEEDS.

[5]:

epf.check_epochs(sim_epochs_df, ['channel0', 'channel1'], epoch_id="epoch_id", time="days")

Example: this checks FAILS because the data column named “bogus_channel0” doesn’t exist in the data.

[6]:

epf.check_epochs(sim_epochs_df, ['bogus_channel0', 'channel1'], epoch_id="epoch_id", time="days")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 epf.check_epochs(sim_epochs_df, ['bogus_channel0', 'channel1'], epoch_id="epoch_id", time="days")

File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:193, in check_epochs(epochs_df, data_streams, epoch_id, time)
    169 def check_epochs(epochs_df, data_streams, epoch_id=EPOCH_ID, time=TIME):
    170     """check epochs data are in spudtr format
    171
    172     Parameters
   (...)
    190
    191     """
--> 193     _ = _epochs_QC(epochs_df, data_streams, epoch_id=epoch_id, time=time)

File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:49, in _epochs_QC(epochs_df, data_streams, epoch_id, time)
     47 missing_channels = set(data_streams) - set(epochs_df.columns)
     48 if missing_channels:
---> 49     raise ValueError(
     50         "data_streams should all be present in the epochs dataframe, "
     51         f"the following are missing: {list(missing_channels)}"
     52     )
     54 # check no duplicate column names in index and regular columns
     55 names = list(epochs_df.index.names) + list(epochs_df.columns)

ValueError: data_streams should all be present in the epochs dataframe, the following are missing: ['bogus_channel0']

Example: this checks FAILS because the time column named “hours” doesn’t exist.

[7]:

epf.check_epochs(sim_epochs_df, ['channel0', 'channel1'], epoch_id="epoch_id", time="hours")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 epf.check_epochs(sim_epochs_df, ['channel0', 'channel1'], epoch_id="epoch_id", time="hours")

File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:193, in check_epochs(epochs_df, data_streams, epoch_id, time)
    169 def check_epochs(epochs_df, data_streams, epoch_id=EPOCH_ID, time=TIME):
    170     """check epochs data are in spudtr format
    171
    172     Parameters
   (...)
    190
    191     """
--> 193     _ = _epochs_QC(epochs_df, data_streams, epoch_id=epoch_id, time=time)

File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:60, in _epochs_QC(epochs_df, data_streams, epoch_id, time)
     57     raise ValueError("Duplicate column names not allowed.")
     59 # epoch_id and time must be the columns in the epochs_df
---> 60 _validate_epochs_df(epochs_df, epoch_id=epoch_id, time=time)
     62 # check values of epoch_id in every time group are the same, and
     63 # unique in each time group. Make our own copy so we are immune to
     64 # modification to original table
     65 table = epochs_df.copy().reset_index().set_index(epoch_id).sort_index()

File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:30, in _validate_epochs_df(epochs_df, epoch_id, time)
     28 for key, val in {"epoch_id": epoch_id, "time": time}.items():
     29     if val not in epochs_df.columns:
---> 30         raise ValueError(f"{key} column not found: {val}")

ValueError: time column not found: hours

spudtr epochs dataframe format

`spudtr` epochs dataframe format