spudtr
epochs dataframe format
spudtr
epochs are pandas.DataFrame
objects.
There are three key elements:
epoch_id
an index-like integer column, where each value designates a unique epochtime
an index-like column of integer timestamps, the same in each epochthe rest of the data columns
There must be at least one epoch.
There must be at least one timepoint.
All the epochs must be timestamped exactly the same way.
NOTE: timestamps are positive and negative integers, the units are unspecified: milliseconds, months, nanoseconds, hours.
[1]:
import pandas as pd
from matplotlib import pyplot as plt
from spudtr import get_demo_df, P3_1500_FEATHER
from spudtr import epf
import spudtr.fake_epochs_data as fake_data
Example: simulated categorical and continuous data
The epoch_id
column is “epoch_id”, there are four epochs: 0, 1, 2, 3.
The time
column is “days”, there are 31 days in each epoch, 0, 1, 2, …, 31.
The rest of the columns are the data recorded in each epoch at each time stamp.
[2]:
n_epochs_per_category = 2
sim_epochs_df, channels = fake_data._generate(
n_epochs=n_epochs_per_category,
n_samples=32,
n_categories=2,
n_channels=4,
time="days",
epoch_id="epoch_id",
seed=10,
)
display(sim_epochs_df)
epoch_id | days | categorical | continuous | channel0 | channel1 | channel2 | channel3 | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | cat0 | 0.771321 | -13.170787 | -30.197057 | 19.609869 | 43.177612 |
1 | 0 | 1 | cat0 | 0.020752 | 4.233125 | -7.726009 | -65.298259 | 41.464399 |
2 | 0 | 2 | cat0 | 0.633648 | 8.191480 | 21.915223 | 18.568468 | 27.639613 |
3 | 0 | 3 | cat0 | 0.748804 | -48.557122 | -50.952045 | 14.317029 | -17.186617 |
4 | 0 | 4 | cat0 | 0.498507 | -17.193401 | 50.222266 | 0.782896 | 38.251473 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
123 | 3 | 27 | cat1 | 0.744603 | 33.167254 | -7.658414 | 14.630878 | 14.329468 |
124 | 3 | 28 | cat1 | 0.469785 | -60.531560 | 0.774228 | 1.689442 | 0.882024 |
125 | 3 | 29 | cat1 | 0.598256 | 16.216221 | 66.028993 | 16.373534 | 4.854384 |
126 | 3 | 30 | cat1 | 0.147620 | -43.268966 | 26.531028 | -20.493672 | -12.327708 |
127 | 3 | 31 | cat1 | 0.184035 | -48.265511 | -41.604676 | -19.770519 | 27.925069 |
128 rows × 8 columns
Example: EEG data
The epoch index column is epoch_id
, there are 600 epochs numbered: 0, 1, 2, …, 600. There are 600 not 601 epochs here because epoch_id
392 was excluded: the relevant event marked a pause in the recording not a stimulus. The epoch ids must be unique but they can be gappy and out of order.
The time column is time_ms
, there are 375 digital samples in each epoch at 4 ms intervals, -748, -744, …, 744, 748
The rest of the columns are the data recorded in each epoch at each time stamp.
[3]:
eeg_epochs_df = get_demo_df(P3_1500_FEATHER)
display(len(eeg_epochs_df["epoch_id"].unique()))
eeg_epochs_df
600
[3]:
epoch_id | time_ms | sub_id | eeg_artifact | dblock_path | log_evcodes | log_ccodes | dblock_srate | ccode | instrument | ... | RMOc | LLTe | RLTe | LLOc | RLOc | MiOc | A2 | HEOG | rle | rhz | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | -748 | sub000 | 0 | sub000/dblock_0 | 0 | 0 | 250.0 | 1 | eeg | ... | -25.093750 | -0.753906 | 1.480469 | -13.414062 | -18.937500 | -17.734375 | 5.660156 | 98.875000 | -39.500000 | 38.375000 |
1 | 0 | -744 | sub000 | 0 | sub000/dblock_0 | 0 | 0 | 250.0 | 1 | eeg | ... | -24.593750 | 0.502441 | -2.466797 | -17.640625 | -17.468750 | -15.304688 | 1.968750 | 104.750000 | -38.031250 | 41.281250 |
2 | 0 | -740 | sub000 | 0 | sub000/dblock_0 | 0 | 0 | 250.0 | 1 | eeg | ... | -16.484375 | -1.507812 | 3.947266 | -15.648438 | -10.085938 | -11.171875 | 8.367188 | 102.062500 | -33.656250 | 43.718750 |
3 | 0 | -736 | sub000 | 0 | sub000/dblock_0 | 0 | 0 | 250.0 | 1 | eeg | ... | -11.804688 | -15.070312 | 9.867188 | -14.906250 | -7.378906 | -8.742188 | 9.351562 | 100.562500 | -42.906250 | 37.406250 |
4 | 0 | -732 | sub000 | 0 | sub000/dblock_0 | 0 | 0 | 250.0 | 1 | eeg | ... | -6.394531 | -4.019531 | 9.125000 | -10.679688 | -6.886719 | -8.015625 | 8.125000 | 98.375000 | -43.875000 | 37.906250 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
224995 | 600 | 732 | sub000 | 0 | sub000/dblock_4 | 0 | 0 | 250.0 | 0 | cal | ... | -4.671875 | -3.517578 | -4.441406 | -4.718750 | -4.671875 | -3.400391 | -4.429688 | -4.406250 | -3.900391 | -4.371094 |
224996 | 600 | 736 | sub000 | 0 | sub000/dblock_4 | 0 | 0 | 250.0 | 0 | cal | ... | -4.179688 | -4.019531 | -4.195312 | -4.222656 | -4.425781 | -3.644531 | -4.429688 | -4.160156 | -3.412109 | -4.371094 |
224997 | 600 | 740 | sub000 | 0 | sub000/dblock_4 | 0 | 0 | 250.0 | 0 | cal | ... | -4.425781 | -3.767578 | -4.441406 | -3.974609 | -4.425781 | -3.400391 | -4.429688 | -4.160156 | -3.900391 | -4.859375 |
224998 | 600 | 744 | sub000 | 0 | sub000/dblock_4 | 0 | 0 | 250.0 | 0 | cal | ... | -4.425781 | -4.269531 | -4.195312 | -4.222656 | -4.425781 | -3.886719 | -4.429688 | -4.406250 | -3.900391 | -4.371094 |
224999 | 600 | 748 | sub000 | 0 | sub000/dblock_4 | 0 | 0 | 250.0 | 0 | cal | ... | -4.179688 | -4.019531 | -3.947266 | -4.222656 | -4.179688 | -3.400391 | -4.183594 | -4.406250 | -3.412109 | -4.371094 |
225000 rows × 47 columns
[4]:
eeg_epochs_df.query("epoch_id == 392")
[4]:
epoch_id | time_ms | sub_id | eeg_artifact | dblock_path | log_evcodes | log_ccodes | dblock_srate | ccode | instrument | ... | RMOc | LLTe | RLTe | LLOc | RLOc | MiOc | A2 | HEOG | rle | rhz |
---|
0 rows × 47 columns
Always check the epoch x time format
When things go well the check quietly succeeds.
When they don’t the reason appears at the bottom of the messages.
Example: This check of the simulated data SUCCEEDS.
[5]:
epf.check_epochs(sim_epochs_df, ['channel0', 'channel1'], epoch_id="epoch_id", time="days")
Example: this checks FAILS because the data column named “bogus_channel0” doesn’t exist in the data.
[6]:
epf.check_epochs(sim_epochs_df, ['bogus_channel0', 'channel1'], epoch_id="epoch_id", time="days")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 epf.check_epochs(sim_epochs_df, ['bogus_channel0', 'channel1'], epoch_id="epoch_id", time="days")
File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:193, in check_epochs(epochs_df, data_streams, epoch_id, time)
169 def check_epochs(epochs_df, data_streams, epoch_id=EPOCH_ID, time=TIME):
170 """check epochs data are in spudtr format
171
172 Parameters
(...)
190
191 """
--> 193 _ = _epochs_QC(epochs_df, data_streams, epoch_id=epoch_id, time=time)
File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:49, in _epochs_QC(epochs_df, data_streams, epoch_id, time)
47 missing_channels = set(data_streams) - set(epochs_df.columns)
48 if missing_channels:
---> 49 raise ValueError(
50 "data_streams should all be present in the epochs dataframe, "
51 f"the following are missing: {list(missing_channels)}"
52 )
54 # check no duplicate column names in index and regular columns
55 names = list(epochs_df.index.names) + list(epochs_df.columns)
ValueError: data_streams should all be present in the epochs dataframe, the following are missing: ['bogus_channel0']
Example: this checks FAILS because the time
column named “hours” doesn’t exist.
[7]:
epf.check_epochs(sim_epochs_df, ['channel0', 'channel1'], epoch_id="epoch_id", time="hours")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 epf.check_epochs(sim_epochs_df, ['channel0', 'channel1'], epoch_id="epoch_id", time="hours")
File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:193, in check_epochs(epochs_df, data_streams, epoch_id, time)
169 def check_epochs(epochs_df, data_streams, epoch_id=EPOCH_ID, time=TIME):
170 """check epochs data are in spudtr format
171
172 Parameters
(...)
190
191 """
--> 193 _ = _epochs_QC(epochs_df, data_streams, epoch_id=epoch_id, time=time)
File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:60, in _epochs_QC(epochs_df, data_streams, epoch_id, time)
57 raise ValueError("Duplicate column names not allowed.")
59 # epoch_id and time must be the columns in the epochs_df
---> 60 _validate_epochs_df(epochs_df, epoch_id=epoch_id, time=time)
62 # check values of epoch_id in every time group are the same, and
63 # unique in each time group. Make our own copy so we are immune to
64 # modification to original table
65 table = epochs_df.copy().reset_index().set_index(epoch_id).sort_index()
File ~/miniconda/envs/env_3.9/lib/python3.9/site-packages/spudtr/epf.py:30, in _validate_epochs_df(epochs_df, epoch_id, time)
28 for key, val in {"epoch_id": epoch_id, "time": time}.items():
29 if val not in epochs_df.columns:
---> 30 raise ValueError(f"{key} column not found: {val}")
ValueError: time column not found: hours