Code Overview

Code Summary

The code is split into two main sections: Phase 01 and Phase 02. Phase 01 models include stacked biLSTM models used in the first phase of deployment. Phase 02 models include transformer models pretrained on NM radiology reports and later finetuned for various tasks in the pipeline. The block diagram below shows the different models for each phase of the project.

Note

Phase 01 refers to the original models deployed. Phase 02 refers to the updated models, which were refined after the initial clinical deployment with the aim of improving scalability and model performance. Moreover, we leveraged the latest advances for deep learning NLP. With respect to machine learning frameworks, Phase 01 utilizes the Tensorflow and Keras libraries, while Phase 02 leverages the 🤗 (Hugging Face) platform.

Phase 01: BiLSTM and XGBoost Models

Explore the source code for training the models and using them to classify reports. These were the initial models used.

Phase 02: MLM Models

Explore the source code for training the new models and using them to classify reports. These models are currently implemented at NM as part of a Result Management system.

Code Organization

The code is organized as shown below (condensed such that python __init__.py files, etc. are not included). Note that files train/train_**.py are ease-of-use scripts, which train models that are defined in train/general.py. Likewise, the run_classifier.py files are scripts to easily classify raw radiology report text when provided with trained model weights.

src/
├─nmrezman/
│ ├─phase01/
│ │ ├─classify/
│ │ │ ├─classifier.py
│ │ │ └─run_classifier.py
│ │ ├─train/
│ │ │ ├─general.py
│ │ │ ├─train_comment.py
│ │ │ ├─train_findings.py
│ │ │ ├─train_lung_adrenal.py
│ │ │ └─train_lung_recommended_proc.py
│ │ └─models.py
│ ├─phase02/
│ │ ├─classify/
│ │ │ ├─classifier.py
│ │ │ └─run_classifier.py
│ │ └─train/
│ │   ├─general.py
│ │   ├─pretrain.py
│ │   ├─train_comment.py
│ │   ├─train_findings.py
│ │   └─train_lung_recommended_proc.py
│ └─utils.py
└─setup.py

Using This Code

This documentation provides the source code used to train all models, which can be modified to fit your needs. There are a several different ways you could go about this.

Training can be run from a cloned repo by running the script as a module. For example, to train the Phase 01 Findings vs No Findings model, use the command:

cd src
python -m nmrezman.phase01.train.train_findings --data_path /path/to/data/reports_df.gz --glove_embedding_path /path/to/data/glove.6B.300d.txt --model_checkpoint_name /path/to/results/phase01/findings/findings_best_model.h5 --result_fname /path/to/results/phase01/findings/findings_best_result.log --tokenizer_fname /path/to/results/phase01/findings/tokenizer.gz

Directly run the scripts or import the functions into python once nmrezman has been pip installed as a python package from either GitHub directly or, if the repo is cloned locally, from the local directory. See the commands below.

pip install "git+https://github.com/mozzilab/NM_Radiology_AI.git@main#egg=nmrezman"

or if the repo is already installed locally

pip install /path/to/repo/NM_Radiology_AI

Once pip installed, the training functions can be imported directly.

from nmrezman.phase01.train.general import train_findings_model

result = train_findings_model(
      data_path="/path/to/data/reports_df.gz",
      glove_embedding_path="/path/to/data/glove.6B.300d.txt",
      model_checkpoint_name="/path/to/results/phase01/findings/findings_best_model.h5",
      result_fname="/path/to/results/phase01/findings/findings_best_result.log",
      tokenizer_fname="/path/to/results/phase01/findings/tokenizer.gz",
)

Last but not least, use the pre-built container with everything packaged in, ready to go. The image contains the complete environment used to build these models as well as a click-through walkthrough to get you started. The source code for the container image is available in our github repo, and the pre-built image is publicly available on our docker-hub repository, mozzilab/nmrezman.

The only requirements are that docker is installed and all the required drivers are up to date.

GPU command (suggested):
```
docker run -it --rm --net=host -e ip_addr=${IP_ADDR} --ulimit memlock=-1 --gpus all mozzilab/nmrezman:latest
```
CPU command (suggested):
```
docker run -it --rm --net=host -e ip_addr=${IP_ADDR} mozzilab/nmrezman:latest
```
Required args:
- -it - opens an interactive tty, effectively it just takes you straight to the cmd line inside the container
- -net=host - binds the host computer’s network to the container, so all ports are inherently exposed. You can specify specific ports for Jupyter and code server by including the port binding(s) in the format -p 8081:8081 along the environmental variable(s) -e VSCODE_PORT=8081 & -e JUPYTER_PORT=8889
- mozzilab/nmrezman:latest - name of the container image
- --gpus all - if using GPU(s) to run the model, you must include this flag
- --ulimit memlock=-1 - prevent the locking of shared memory, need if running GPUs
Optional args:
- --mount type=bind,src=${PATH_TO_DATA},dst=/workspace/data - bind a folder into the /workspace/data folder in the container.
- --rm - sets container to be ephemeral, so all resources are disposed of upon the container being stopped.
- -e ip_addr=${IP_ADDR} - only if operating on remote machine, will make the ip:port auto-print message work nicely (ctrl-click).

Warning

The code will likely need to be modified to suit your needs (at a minimum, preprocessing raw reports and dataframe structuring). Generalizability of this code to other health care systems is not guaranteed and only reflects the 10 hospitals and one electronic medical record for which it was tested. However, modifying the code (e.g., preprocessing, base model checkpoints, model constants) may yield similar results.