Train

Four models are used as part of the Phase 01 Results Management system, each responsible for one of the tasks listed below.

  • determines if there are findings or no findings

  • if findings are found, determine if there are lung or adrenal findings,

  • if findings are found, determine the relevant portion of the note that made that decision,

  • and, if there are lung findings, determine if a chest CT is recommended.

The functions to train these models are provided in the file nmrezman.phase01.train.general.py and named as follows.

Findings vs No Finding Model

This model classifies whether the report contains findings or no findings. This is the first model the report is run through. This biLSTM model uses GloVe word embeddings. Training was run via the script:

python -m nmrezman.phase01.train.train_findings --data_path /path/to/data/reports_df.gz --glove_embedding_path /path/to/data/glove.6B.300d.txt --model_checkpoint_name /path/to/results/phase01/findings/findings_best_model.h5 --result_fname /path/to/results/phase01/findings/findings_best_result.log --tokenizer_fname /path/to/results/phase01/findings/tokenizer.gz
nmrezman.phase01.train.general.train_findings_model(data_path: str, glove_embedding_path: str, model_checkpoint_name: str = 'findings_best_model.h5', result_fname: str = 'findings_best_result.log', tokenizer_fname: str = 'tokenizer.gz')[source]

Trains the Findings vs No Findings Phase 01 BiLSTM model.

Parameters
  • data_path (str) – Path to the dataframe file with the preprocessed impressions and labels in new_note and selected_finding columns, respectively

  • glove_embedding_path (str) – Path to the pre-downloaded GloVe Stanford pretrained word vectors glove.6B.300d as found at https://nlp.stanford.edu/projects/glove/

  • model_checkpoint_name (str) – Path / filename to save model checkpoints

  • result_fname (str) – Path / filename to save model evaluation metrics

  • tokenizer_fname (str) – Path / filename to save tokenizer

Lung vs Adrenal Findings Model

This model classifies whether the report contains lung or adrenal findings. This model is run if the Findings vs No Findings model identifies findings were found. This biLSTM model uses BioWordVec word embeddings.

Training was run via the script:

python -m nmrezman.phase01.train.train_lung_adrenal --data_path /path/to/data/reports_df.gz --bioword_path /path/to/data/BioWordVec_PubMed_MIMICIII_d200.bin --model_checkpoint_name /path/to/results/phase01/lung_adrenal/lung_adrenal_best_model.h5 --result_fname /path/to/results/phase01/lung_adrenal/lung_adrenal_best_result.log --tokenizer_fname /path/to/results/phase01/findings/tokenizer.gz
nmrezman.phase01.train.general.train_lung_adrenal_model(data_path: str, bioword_path: str, model_checkpoint_name: str = 'lung_adrenal_best_model.h5', result_fname: str = 'lung_adrenal_best_result.log', tokenizer_fname: str = 'tokenizer.gz')[source]

Trains the Lung vs Adrenal Findings Phase 01 BiLSTM model.

Parameters
  • data_path (str) – Path to the dataframe file with the preprocessed impressions and labels in new_note and selected_finding columns, respectively

  • bioword_path (str) – Path to the BioWordVec pretrained word vectors BioWordVec_PubMed_MIMICIII_d200.bin as from https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/BioWordVec_PubMed_MIMICIII_d200.bin

  • model_checkpoint_name (str) – Path / filename to save model checkpoints

  • result_fname (str) – Path / filename to save model evaluation metrics

  • tokenizer_fname (str) – Path / filename to save tokenizer

Comment Extraction Model

This model classifies the comment in the report that indicate the relevant finding. This model is run if the Findings vs No Findings model identifies findings were found. This is an XGBoost-based model.

Training was run via the script:

python -m nmrezman.phase01.train.train_comment --data_path /path/to/data/reports_df.gz --model_checkpoint_name /path/to/results/phase01/comment/comment_best_model.sav --result_fname /path/to/results/phase01/comment/comment_best_result.log
nmrezman.phase01.train.general.train_comment_model(data_path: str, model_checkpoint_name: str = 'comment_best_model.sav', result_fname: str = 'comment_best_result.log')[source]

Trains the Comment Extraction Phase 01 XGBoost model.

Parameters
  • data_path (str) – Path to the dataframe file with the preprocessed impressions and labels in new_note and selected_finding columns, respectively

  • model_checkpoint_name (str) – Path / filename to save model checkpoints

  • result_fname (str) – Path / filename to save model evaluation metrics