Train

Three models are used as part of the Phase 02 Results Management system, each responsible for one of the tasks listed below.

  • determines if there are lung findings, adrenal findings, or no findings

  • if findings are found, determine the relevant portion of the note that made that decision,

  • and, if there are lung findings, determine if a chest CT is recommended.

The functions to train these models are provided in the file nmrezman.phase02.train.general.py and named as follows.

Before training these models, pretraining was performed via nmrezman.phase02.train.general.pretrain_roberta_base().

Pretraining RoBERTa Base Model

As a first step, we pretrain a DistilRoBERTa base model using radiology reports.

Training was run via the script:

python -m nmrezman.phase02.train.pretrain --data_path /path/to/data/reports_df.gz --output_dir /path/to/results/phase02/pretrain --logging_dir /path/to/results/phase02/pretrain/logging --wandb_dir /path/to/results/phase02/pretrain --do_reporting True
nmrezman.phase02.train.general.pretrain_roberta_base(data_path: str, output_dir: str, logging_dir: str, do_reporting: bool = True, wandb_dir: Optional[str] = None)[source]

Pretrain the model based on custom dataset

Parameters
  • data_path (str) – Path to the dataframe file with the reports and labels

  • output_dir (str) – Path to save model checkpoints

  • logging_dir (str) – Path to save 🤗 logging data

  • do_reporting (bool) – Boolean to determine whether 🤗 will report to logs to all (True) or no (False) supported integrations

  • wandb_dir (bool) – Path to save the wandb logging directory

Lung Findings, Adrenal Findings, or No Findings Model

This model classifies whether the report contains lung findings, adrenal findings, or no findings. This is the first model the report is run through. This is an MLM RoBERTa-based model.

Training was run via the script:

python -m nmrezman.phase02.train.train_findings --data_path /path/to/data/reports_df.gz --model_pretrained_path /path/to/results/phase02/pretrain/checkpoint-XXXXX --output_dir /path/to/results/phase02/findings/ --logging_dir /path/to/results/phase02/findings/logging --result_fname /path/to/results/phase02/findings/findings_best_result.log --wandb_dir /path/to/results/phase02/findings/findings_recommend/ --do_reporting True
nmrezman.phase02.train.general.train_findings_model(data_path: str, model_pretrained_path: str, output_dir: str, logging_dir: str, result_fname: str, do_reporting: bool = True, wandb_dir: Optional[str] = None)[source]

Trains the Phase 02 Lung, Adrenal, or No Findings Model.

Parameters
  • data_path (str) – Path to the dataframe file with the reports and labels

  • model_pretrained_path (str) – Path / filename to pretrained model checkpoint

  • output_dir (str) – Path to save model checkpoints

  • logging_dir (str) – Path to save 🤗 logging data

  • result_fname (str) – Path / filename to save model evaluation metrics

  • do_reporting (bool) – Boolean to determine whether 🤗 will report to logs to all (True) or no (False) supported integrations

  • wandb_dir (bool) – Path to save the wandb logging directory

Comment Extraction Model

This model classifies the comment in the report that indicate the relevant finding. This model is run if the Findings model identifies findings were found. This is a Question-Answer based model.

Training was run via the script:

python -m nmrezman.phase02.train.train_comment --data_path /path/to/data/reports_df.gz --output_dir /path/to/results/phase02/comment/ --result_fname_prefix results
nmrezman.phase02.train.general.train_comment_model(data_path: str, output_dir: str = 'comment_model', result_fname_prefix: str = 'results')[source]

Trains the Comment Extraction Hhase 02 MLM model.

Parameters
  • data_path (str) – Path to the dataframe file with the reports and labels

  • output_dir (str) –

    Path to save training and evaluation logging and results. Model checkpoints are saved in

    <output_dir_str>/output_dir. Evaluation results are in output_dir

  • result_fname_prefix (str) – Result file name prefix to save *.csv and *.json in output_dir