Dataset curation
Page Contents
Dataset curation#
Subject naming convention#
Basic convention: sub-XXX
Example:
sub-001
sub-002
Multi-institution/Multi-pathology convention: sub-<site><pathology>XXX
Example of Multi-institution dataset:
sub-mon001 # mon stands for Montreal
sub-tor001 # tor stands for Toronto
Example of Multi-institution/Multi-pathology dataset:
In the case of multi-pathology dataset (two or more distinct diseases + healthy controls), it is convenient to include also pathology to the subjectID, for example:
sub-torDCM001 # tor stands for Toronto and DCM stands for Degenerative Cervical Myelopathy
sub-torHC001 # tor stands for Toronto and HC stands for Healthy Controls
sub-zurSCI001 # zur stands for Zurich and SCI stands for Spinal Cord Injury
Data collected from actual subjects goes under their specific sub-folder, for example:
sub-001
βββ anat
βΒ Β βββ sub-001_T1w.json
βΒ Β βββ sub-001_T1w.nii.gz
βΒ Β βββ sub-001_T2star.json
βΒ Β βββ sub-001_T2star.nii.gz
βΒ Β βββ sub-001_T2w.json
βΒ Β βββ sub-001_T2w.nii.gz
βββ dwi
βββ sub-001_dwi.bval
βββ sub-001_dwi.bvec
βββ sub-001_dwi.json
βββ sub-001_dwi.nii.gz
Many kinds of data have a place specified for them by BIDS. See file naming conventions and the MRI and Microscopy extensions for full details.
Warning
TODO: describe neuropoly-specific BIDS entities, like bp-cspine or acq-MTon
BIDS template#
β οΈ Every dataset must have the following files:
βββ README.md
βββ dataset_description.json
βββ participants.tsv
βββ participants.json
βββ code/
β βββ curate.py
βββ sub-XXX
β βββ anat
β βββsub-XXX_T1w.nii.gz
...
...
βββ derivatives
βββ dataset_description.json
βββ labels
βββ sub-XXX
...
...
For details, see BIDS specification.
README.md
#
The README.md
is a markdown file describing the dataset in more detail.
Below is a template - modify it!!!
README.md template:
# <NAME OF DATASET>
This is an <MRI/Microscopy> dataset acquired in the context of the <XYZ> project.
<IF DATASET CONTAINS DERIVATIVES>It also contains <manual segmentation/labels> of <MS lesions/tumors/etc> from <one/two/or more> expert raters located under the derivatives folder.
## Contact Person
Dataset shared by: <NAME AND EMAIL>
<IF THERE WAS EMAIL COMM>Email communication: <DATE OF EMAIL AND SUBJECT>
<IF THERE IS A PRIMARY PROJECT/MODEL>Repository: https://github.com/<organization>/<repository_name>
## <IF DATA ARE MISSING FOR SOME SUBJECT(S)>missing data
<LIST HERE MISSING SUBJECTS>
dataset_description.json
#
The dataset_description.json
is a JSON file describing the dataset.
dataset_description.json template:
Refer to the BIDS spec to know what version to fill in here.
{
"BIDSVersion": "BIDS X.Y.Z",
"Name": "<dataset_name>",
"DatasetType": "raw"
}
Warning
The dataset_description.json
file within the top-level dataset should include "DatasetType": "raw"
.
participants.tsv
#
The participants.tsv
is a TSV file and should include at least the following columns:
participant_id |
source_id |
species |
age |
sex |
pathology |
institution |
---|---|---|---|---|---|---|
sub-001 |
001 |
homo sapiens |
30 |
F |
HC |
montreal |
sub-002 |
005 |
homo sapiens |
40 |
O |
MS |
montreal |
sub-003 |
007 |
homo sapiens |
n/a |
n/a |
MS |
toronto |
participants.tsv template:
participant_id source_id species age sex pathology institution
sub-001 001 homo sapiens 30 F HC montreal
sub-002 005 homo sapiens 40 O MS montreal
sub-003 007 homo sapiens n/a n/a MS toronto
Other columns may be added if the data exists to fill them and it would be useful to keep.
Warning
Indicate missing values with n/a
(for βnot availableβ), not by empty cells!
Warning
This is a Tab-Separated-Values file. Make sure to use tabs between entries if editing with a text editor. Most spreadsheet software can read and write .tsv correctly.
participants.json
#
The participants.json
is a JSON file providing a legend for the columns in participants.tsv
, with longer descriptions, units, and in the case of categorical variables, allowed levels.
participants.json template:
{
"participant_id": {
"Description": "Unique Participant ID",
"LongName": "Participant ID"
},
"source_id": {
"Description": "Subject ID in the unprocessed data",
"LongName": "Subject ID in the unprocessed data"
},
"species": {
"Description": "Binomial species name of participant",
"LongName": "Species"
},
"age": {
"Description": "Participant age",
"LongName": "Participant age",
"Units": "years"
},
"sex": {
"Description": "sex of the participant as reported by the participant",
"Levels": {
"M": "male",
"F": "female",
"O": "other"
}
},
"pathology": {
"Description": "The diagnosis of pathology of the participant",
"LongName": "Pathology name",
"Levels": {
"HC": "Healthy Control",
"DCM": "Degenerative Cervical Myelopathy (synonymous with CSM - Cervical Spondylotic Myelopathy)",
"MS": "Multiple Sclerosis",
"SCI": "Traumatic Spinal Cord Injury"
}
},
"institution": {
"Description": "Human-friendly institution name",
"LongName": "BIDS Institution ID"
}
}
code/
#
The data cleaning and curation script(s) that create the sub-XXX/
folders should be kept with them, under the code/
folder. Within reason, every dataset should have a script that when run like
python code/curate.py path/to/sourcedata ./
unpacks, converts and renames all the images and related files in path/to/sourcedata/
into BIDS format in the current dataset ./
.
This program should be committed first, before the curated data it produces. Afterwards, every commit that modifies the code should also re-run it, and the code and re-curated data should be committed in tandem.
Note
Analysis scripts should not be kept here. Keep them in separate repositories, usually in public on GitHub, with instructions about. See PIPELINE-DOC.
Changelog policy#
We use git log
to track our changes. That means care should be taken to write good messages: they are there to help both you and future researchers understand how the dataset evolved.
Good commit message examples:
git commit -m 'Segment spines of subjects 010 through 023
Produced manually, using fsleyes.'
or
git commit -m 'Add new subjects provided by <email_adress>'
If you choose to also fill in BIDSβs optional CHANGES file make sure it reflects the git log
.
Derivatives Structure#
The derivatives
are files generated from the top-level dataset such as segmentations or labels.
Convention for derivatives JSON metadata:
{
"Author": "Firstname Lastname",
"Date": "YYYY-MM-DD HH:MM:SS"
}
NOTE: βDateβ is optional. We usually include it when running the manual correction via python scripts.
Warning
The derivatives
must include its own dataset_description.json
file (with "DatasetType": "derivative"
).