spudtr epochs dataframe format

spudtr epochs are pandas.DataFrame objects.

There are three key elements:

  1. epoch_id an index-like integer column, where each value designates a unique epoch

  2. time an index-like column of integer timestamps, the same in each epoch

  3. the rest of the data columns

There must be at least one epoch.

There must be at least one timepoint.

All the epochs must be timestamped exactly the same way.

NOTE: timestamps are positive and negative integers, the units are unspecified: milliseconds, months, nanoseconds, hours.

[1]:
import pandas as pd
from matplotlib import pyplot as plt

from spudtr import get_demo_df, P3_1500_FEATHER
from spudtr import epf
import spudtr.fake_epochs_data as fake_data

Example: simulated categorical and continuous data

The epoch_id column is “epoch_id”, there are four epochs: 0, 1, 2, 3.

The time column is “days”, there are 31 days in each epoch, 0, 1, 2, …, 31.

The rest of the columns are the data recorded in each epoch at each time stamp.

[2]:
n_epochs_per_category = 2
sim_epochs_df, channels = fake_data._generate(
    n_epochs=n_epochs_per_category,
    n_samples=32,
    n_categories=2,
    n_channels=4,
    time="days",
    epoch_id="epoch_id",
    seed=10,
)
display(sim_epochs_df)
epoch_id days categorical continuous channel0 channel1 channel2 channel3
0 0 0 cat0 0.771321 -13.170787 -30.197057 19.609869 43.177612
1 0 1 cat0 0.020752 4.233125 -7.726009 -65.298259 41.464399
2 0 2 cat0 0.633648 8.191480 21.915223 18.568468 27.639613
3 0 3 cat0 0.748804 -48.557122 -50.952045 14.317029 -17.186617
4 0 4 cat0 0.498507 -17.193401 50.222266 0.782896 38.251473
... ... ... ... ... ... ... ... ...
123 3 27 cat1 0.744603 33.167254 -7.658414 14.630878 14.329468
124 3 28 cat1 0.469785 -60.531560 0.774228 1.689442 0.882024
125 3 29 cat1 0.598256 16.216221 66.028993 16.373534 4.854384
126 3 30 cat1 0.147620 -43.268966 26.531028 -20.493672 -12.327708
127 3 31 cat1 0.184035 -48.265511 -41.604676 -19.770519 27.925069

128 rows × 8 columns

Example: EEG data

The epoch index column is epoch_id, there are 600 epochs numbered: 0, 1, 2, …, 600. There are 600 not 601 epochs here because epoch_id 392 was excluded: the relevant event marked a pause in the recording not a stimulus. The epoch ids must be unique but they can be gappy and out of order.

The time column is time_ms, there are 375 digital samples in each epoch at 4 ms intervals, -748, -744, …, 744, 748

The rest of the columns are the data recorded in each epoch at each time stamp.

[3]:
eeg_epochs_df = get_demo_df(P3_1500_FEATHER)
display(len(eeg_epochs_df["epoch_id"].unique()))
eeg_epochs_df
600
[3]:
epoch_id time_ms sub_id eeg_artifact dblock_path log_evcodes log_ccodes dblock_srate ccode instrument ... RMOc LLTe RLTe LLOc RLOc MiOc A2 HEOG rle rhz
0 0 -748 sub000 0 sub000/dblock_0 0 0 250.0 1 eeg ... -25.093750 -0.753906 1.480469 -13.414062 -18.937500 -17.734375 5.660156 98.875000 -39.500000 38.375000
1 0 -744 sub000 0 sub000/dblock_0 0 0 250.0 1 eeg ... -24.593750 0.502441 -2.466797 -17.640625 -17.468750 -15.304688 1.968750 104.750000 -38.031250 41.281250
2 0 -740 sub000 0 sub000/dblock_0 0 0 250.0 1 eeg ... -16.484375 -1.507812 3.947266 -15.648438 -10.085938 -11.171875 8.367188 102.062500 -33.656250 43.718750
3 0 -736 sub000 0 sub000/dblock_0 0 0 250.0 1 eeg ... -11.804688 -15.070312 9.867188 -14.906250 -7.378906 -8.742188 9.351562 100.562500 -42.906250 37.406250
4 0 -732 sub000 0 sub000/dblock_0 0 0 250.0 1 eeg ... -6.394531 -4.019531 9.125000 -10.679688 -6.886719 -8.015625 8.125000 98.375000 -43.875000 37.906250
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
224995 600 732 sub000 0 sub000/dblock_4 0 0 250.0 0 cal ... -4.671875 -3.517578 -4.441406 -4.718750 -4.671875 -3.400391 -4.429688 -4.406250 -3.900391 -4.371094
224996 600 736 sub000 0 sub000/dblock_4 0 0 250.0 0 cal ... -4.179688 -4.019531 -4.195312 -4.222656 -4.425781 -3.644531 -4.429688 -4.160156 -3.412109 -4.371094
224997 600 740 sub000 0 sub000/dblock_4 0 0 250.0 0 cal ... -4.425781 -3.767578 -4.441406 -3.974609 -4.425781 -3.400391 -4.429688 -4.160156 -3.900391 -4.859375
224998 600 744 sub000 0 sub000/dblock_4 0 0 250.0 0 cal ... -4.425781 -4.269531 -4.195312 -4.222656 -4.425781 -3.886719 -4.429688 -4.406250 -3.900391 -4.371094
224999 600 748 sub000 0 sub000/dblock_4 0 0 250.0 0 cal ... -4.179688 -4.019531 -3.947266 -4.222656 -4.179688 -3.400391 -4.183594 -4.406250 -3.412109 -4.371094

225000 rows × 47 columns

[4]:
eeg_epochs_df.query("epoch_id == 392")
[4]:
epoch_id time_ms sub_id eeg_artifact dblock_path log_evcodes log_ccodes dblock_srate ccode instrument ... RMOc LLTe RLTe LLOc RLOc MiOc A2 HEOG rle rhz

0 rows × 47 columns

Always check the epoch x time format

When things go well the check quietly succeeds.

When they don’t the reason appears at the bottom of the messages.

Example: This check of the simulated data SUCCEEDS.

[5]:
epf.check_epochs(sim_epochs_df, ['channel0', 'channel1'], epoch_id="epoch_id", time="days")

Example: this checks FAILS because the data column named “bogus_channel0” doesn’t exist in the data.

[6]:
epf.check_epochs(sim_epochs_df, ['bogus_channel0', 'channel1'], epoch_id="epoch_id", time="days")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 epf.check_epochs(sim_epochs_df, ['bogus_channel0', 'channel1'], epoch_id="epoch_id", time="days")

File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:193, in check_epochs(epochs_df, data_streams, epoch_id, time)
    169 def check_epochs(epochs_df, data_streams, epoch_id=EPOCH_ID, time=TIME):
    170     """check epochs data are in spudtr format
    171
    172     Parameters
   (...)
    190
    191     """
--> 193     _ = _epochs_QC(epochs_df, data_streams, epoch_id=epoch_id, time=time)

File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:49, in _epochs_QC(epochs_df, data_streams, epoch_id, time)
     47 missing_channels = set(data_streams) - set(epochs_df.columns)
     48 if missing_channels:
---> 49     raise ValueError(
     50         "data_streams should all be present in the epochs dataframe, "
     51         f"the following are missing: {list(missing_channels)}"
     52     )
     54 # check no duplicate column names in index and regular columns
     55 names = list(epochs_df.index.names) + list(epochs_df.columns)

ValueError: data_streams should all be present in the epochs dataframe, the following are missing: ['bogus_channel0']

Example: this checks FAILS because the time column named “hours” doesn’t exist.

[7]:
epf.check_epochs(sim_epochs_df, ['channel0', 'channel1'], epoch_id="epoch_id", time="hours")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 epf.check_epochs(sim_epochs_df, ['channel0', 'channel1'], epoch_id="epoch_id", time="hours")

File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:193, in check_epochs(epochs_df, data_streams, epoch_id, time)
    169 def check_epochs(epochs_df, data_streams, epoch_id=EPOCH_ID, time=TIME):
    170     """check epochs data are in spudtr format
    171
    172     Parameters
   (...)
    190
    191     """
--> 193     _ = _epochs_QC(epochs_df, data_streams, epoch_id=epoch_id, time=time)

File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:60, in _epochs_QC(epochs_df, data_streams, epoch_id, time)
     57     raise ValueError("Duplicate column names not allowed.")
     59 # epoch_id and time must be the columns in the epochs_df
---> 60 _validate_epochs_df(epochs_df, epoch_id=epoch_id, time=time)
     62 # check values of epoch_id in every time group are the same, and
     63 # unique in each time group. Make our own copy so we are immune to
     64 # modification to original table
     65 table = epochs_df.copy().reset_index().set_index(epoch_id).sort_index()

File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:30, in _validate_epochs_df(epochs_df, epoch_id, time)
     28 for key, val in {"epoch_id": epoch_id, "time": time}.items():
     29     if val not in epochs_df.columns:
---> 30         raise ValueError(f"{key} column not found: {val}")

ValueError: time column not found: hours