Train

Four models are used as part of the Phase 01 Results Management system, each responsible for one of the tasks listed below.

determines if there are findings or no findings
if findings are found, determine if there are lung or adrenal findings,
if findings are found, determine the relevant portion of the note that made that decision,
and, if there are lung findings, determine if a chest CT is recommended.

The functions to train these models are provided in the file nmrezman.phase01.train.general.py and named as follows.

nmrezman.phase01.train.general.train_findings_model()
nmrezman.phase01.train.general.train_lung_adrenal_model()
nmrezman.phase01.train.general.train_comment_model()
nmrezman.phase01.train.general.train_lung_recommended_proc_model()

Findings vs No Finding Model

This model classifies whether the report contains findings or no findings. This is the first model the report is run through. This biLSTM model uses GloVe word embeddings. Training was run via the script:

python -m nmrezman.phase01.train.train_findings --data_path /path/to/data/reports_df.gz --glove_embedding_path /path/to/data/glove.6B.300d.txt --model_checkpoint_name /path/to/results/phase01/findings/findings_best_model.h5 --result_fname /path/to/results/phase01/findings/findings_best_result.log --tokenizer_fname /path/to/results/phase01/findings/tokenizer.gz

nmrezman.phase01.train.general.train_findings_model(data_path: str, glove_embedding_path: str, model_checkpoint_name: str = 'findings_best_model.h5', result_fname: str = 'findings_best_result.log', tokenizer_fname: str = 'tokenizer.gz')[source]

Trains the Findings vs No Findings Phase 01 BiLSTM model.

Parameters

data_path (str) – Path to the dataframe file with the preprocessed impressions and labels in new_note and selected_finding columns, respectively
glove_embedding_path (str) – Path to the pre-downloaded GloVe Stanford pretrained word vectors glove.6B.300d as found at https://nlp.stanford.edu/projects/glove/
model_checkpoint_name (str) – Path / filename to save model checkpoints
result_fname (str) – Path / filename to save model evaluation metrics
tokenizer_fname (str) – Path / filename to save tokenizer

Lung vs Adrenal Findings Model

This model classifies whether the report contains lung or adrenal findings. This model is run if the Findings vs No Findings model identifies findings were found. This biLSTM model uses BioWordVec word embeddings.

Training was run via the script:

python -m nmrezman.phase01.train.train_lung_adrenal --data_path /path/to/data/reports_df.gz --bioword_path /path/to/data/BioWordVec_PubMed_MIMICIII_d200.bin --model_checkpoint_name /path/to/results/phase01/lung_adrenal/lung_adrenal_best_model.h5 --result_fname /path/to/results/phase01/lung_adrenal/lung_adrenal_best_result.log --tokenizer_fname /path/to/results/phase01/findings/tokenizer.gz

nmrezman.phase01.train.general.train_lung_adrenal_model(data_path: str, bioword_path: str, model_checkpoint_name: str = 'lung_adrenal_best_model.h5', result_fname: str = 'lung_adrenal_best_result.log', tokenizer_fname: str = 'tokenizer.gz')[source]

Trains the Lung vs Adrenal Findings Phase 01 BiLSTM model.

Parameters

data_path (str) – Path to the dataframe file with the preprocessed impressions and labels in new_note and selected_finding columns, respectively
bioword_path (str) – Path to the BioWordVec pretrained word vectors BioWordVec_PubMed_MIMICIII_d200.bin as from https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/BioWordVec_PubMed_MIMICIII_d200.bin
model_checkpoint_name (str) – Path / filename to save model checkpoints
result_fname (str) – Path / filename to save model evaluation metrics
tokenizer_fname (str) – Path / filename to save tokenizer

Comment Extraction Model

This model classifies the comment in the report that indicate the relevant finding. This model is run if the Findings vs No Findings model identifies findings were found. This is an XGBoost-based model.

Training was run via the script:

python -m nmrezman.phase01.train.train_comment --data_path /path/to/data/reports_df.gz --model_checkpoint_name /path/to/results/phase01/comment/comment_best_model.sav --result_fname /path/to/results/phase01/comment/comment_best_result.log

nmrezman.phase01.train.general.train_comment_model(data_path: str, model_checkpoint_name: str = 'comment_best_model.sav', result_fname: str = 'comment_best_result.log')[source]

Trains the Comment Extraction Phase 01 XGBoost model.

Parameters

data_path (str) – Path to the dataframe file with the preprocessed impressions and labels in new_note and selected_finding columns, respectively
model_checkpoint_name (str) – Path / filename to save model checkpoints
result_fname (str) – Path / filename to save model evaluation metrics

Lung Recommended Procedure Model

This model classifies the comment in the report that indicate the relevant finding. This model is run if the Lung vs Adrenal Findings model identifies lung findings were found. This is a biLSTM model.

Training was run via the script:

python -m nmrezman.phase01.train.train_lung_recommended_proc_model --data_path /path/to/data/reports_df.gz --model_checkpoint_name /path/to/results/phase01/lung_recommend/lung_recommend_best_model.h5 --result_fname /path/to/results/phase01/lung_recommend/lung_recommend_best_result.log --tokenizer_fname /path/to/results/phase01/findings/tokenizer.gz

nmrezman.phase01.train.general.train_lung_recommended_proc_model(data_path: str, model_checkpoint_name: str = 'lung_recommend_best_model.h5', result_fname: str = 'lung_recommend_best_result.log', tokenizer_fname: str = 'tokenizer.gz')[source]

Trains the Lung Recommended Procedure Phase 01 BiLSTM model. Recommends “Chest CT” or “Ambiguous” procedure for “Lung Findings”.

Parameters

data_path (str) – Path to the dataframe file with the preprocessed impressions and labels in new_note and selected_finding columns, respectively
model_checkpoint_name (str) – Path / filename to save model checkpoints
result_fname (str) – Path / filename to save model evaluation metrics
tokenizer_fname (str) – Path / filename to save tokenizer