Train

Three models are used as part of the Phase 02 Results Management system, each responsible for one of the tasks listed below.

determines if there are lung findings, adrenal findings, or no findings
if findings are found, determine the relevant portion of the note that made that decision,
and, if there are lung findings, determine if a chest CT is recommended.

The functions to train these models are provided in the file nmrezman.phase02.train.general.py and named as follows.

nmrezman.phase02.train.general.train_findings_model()
nmrezman.phase02.train.general.train_comment_model()
nmrezman.phase02.train.general.train_lung_recommended_proc_model()

Before training these models, pretraining was performed via nmrezman.phase02.train.general.pretrain_roberta_base().

Pretraining RoBERTa Base Model

As a first step, we pretrain a DistilRoBERTa base model using radiology reports.

Training was run via the script:

python -m nmrezman.phase02.train.pretrain --data_path /path/to/data/reports_df.gz --output_dir /path/to/results/phase02/pretrain --logging_dir /path/to/results/phase02/pretrain/logging --wandb_dir /path/to/results/phase02/pretrain --do_reporting True

nmrezman.phase02.train.general.pretrain_roberta_base(data_path: str, output_dir: str, logging_dir: str, do_reporting: bool = True, wandb_dir: Optional[str] = None)[source]

Pretrain the model based on custom dataset

Parameters

data_path (str) – Path to the dataframe file with the reports and labels
output_dir (str) – Path to save model checkpoints
logging_dir (str) – Path to save 🤗 logging data
do_reporting (bool) – Boolean to determine whether 🤗 will report to logs to all (True) or no (False) supported integrations
wandb_dir (bool) – Path to save the wandb logging directory

Lung Findings, Adrenal Findings, or No Findings Model

This model classifies whether the report contains lung findings, adrenal findings, or no findings. This is the first model the report is run through. This is an MLM RoBERTa-based model.

Training was run via the script:

python -m nmrezman.phase02.train.train_findings --data_path /path/to/data/reports_df.gz --model_pretrained_path /path/to/results/phase02/pretrain/checkpoint-XXXXX --output_dir /path/to/results/phase02/findings/ --logging_dir /path/to/results/phase02/findings/logging --result_fname /path/to/results/phase02/findings/findings_best_result.log --wandb_dir /path/to/results/phase02/findings/findings_recommend/ --do_reporting True

nmrezman.phase02.train.general.train_findings_model(data_path: str, model_pretrained_path: str, output_dir: str, logging_dir: str, result_fname: str, do_reporting: bool = True, wandb_dir: Optional[str] = None)[source]

Trains the Phase 02 Lung, Adrenal, or No Findings Model.

Parameters

data_path (str) – Path to the dataframe file with the reports and labels
model_pretrained_path (str) – Path / filename to pretrained model checkpoint
output_dir (str) – Path to save model checkpoints
logging_dir (str) – Path to save 🤗 logging data
result_fname (str) – Path / filename to save model evaluation metrics
do_reporting (bool) – Boolean to determine whether 🤗 will report to logs to all (True) or no (False) supported integrations
wandb_dir (bool) – Path to save the wandb logging directory

Comment Extraction Model

This model classifies the comment in the report that indicate the relevant finding. This model is run if the Findings model identifies findings were found. This is a Question-Answer based model.

Training was run via the script:

python -m nmrezman.phase02.train.train_comment --data_path /path/to/data/reports_df.gz --output_dir /path/to/results/phase02/comment/ --result_fname_prefix results

nmrezman.phase02.train.general.train_comment_model(data_path: str, output_dir: str = 'comment_model', result_fname_prefix: str = 'results')[source]

Trains the Comment Extraction Hhase 02 MLM model.

Parameters

data_path (str) – Path to the dataframe file with the reports and labels
output_dir (str) –

Path to save training and evaluation logging and results. Model checkpoints are saved in
<output_dir_str>/output_dir. Evaluation results are in output_dir
result_fname_prefix (str) – Result file name prefix to save *.csv and *.json in output_dir

Lung Recommended Procedure Model

This model classifies whether a Chest CT or some other (“ambiguous”) procedure is recommended. This model is run if the Findings model identifies lung findings were found. This is an MLM RoBERTa-based model.

Training was run via the script:

python -m nmrezman.phase02.train.train_findings --data_path /path/to/data/reports_df.gz --model_pretrained_path /path/to/results/phase02/pretrain/checkpoint-XXXXX --output_dir /path/to/results/phase02/lung_recommend/ --logging_dir /path/to/results/phase02/lung_recommend/logging --result_fname /path/to/results/phase02/lung_recommend/lung_recommend_best_result.log --wandb_dir /path/to/results/phase02/lung_recommend/ --do_reporting True

nmrezman.phase02.train.general.train_lung_recommended_proc_model(data_path: str, model_pretrained_path: str, output_dir: str, logging_dir: str, result_fname: str, do_reporting: bool = True, wandb_dir: Optional[str] = None)[source]

Trains the Lung Recommended Procedure Phase 02 MLM model. Recommends “Chest CT” or “Ambiguous” procedure for “Lung Findings”.

Parameters

data_path (str) – Path to the dataframe file with the reports and labels
model_pretrained_path (str) – Path / filename to pretrained model checkpoint
output_dir (str) – Path to save model checkpoints
logging_dir (str) – Path to save 🤗 logging data
result_fname (str) – Path / filename to save model evaluation metrics
do_reporting (bool) – Boolean to determine whether 🤗 will report to logs to all (True) or no (False) supported integrations
wandb_dir (bool) – Path to save the wandb logging directory