# Demo: Pretraining the NM Results Management Language Model with Custom Corpus

For Masked Language Modeling (MLM), we randomly mask some tokens by replacing them by [MASK], and then the labels are adjusted to only include masked tokens. In this example, we use a sample of ten radiology reports to pretrain from an initial RoBERTa checkpoint.

First, the data is loaded. For pretraining, we are only concerned with the radiology report. We will pretrain the model to predict masked words in the report.

[1]:

import os
import joblib
from IPython.display import display, HTML

# Define the path to the data
base_path = os.path.dirname("__file__")
data_path = os.path.abspath(os.path.join(base_path, "..", "demo_data.gz"))

# Import data


rpt_num note selected_finding selected_proc selected_label new_note
0 1 PROCEDURE: CT CHEST WO CONTRAST. HISTORY: Wheezing TECHNIQUE: Non-contrast helical thoracic CT was performed. COMPARISON: There is no prior chest CT for comparison. FINDINGS: Support Devices: None. Heart/Pericardium/Great Vessels: Cardiac size is normal. There is no calcific coronary artery atherosclerosis. There is no pericardial effusion. The aorta is normal in diameter. The main pulmonary artery is normal in diameter. Pleural Spaces: Few small pleural calcifications are present in the right pleura for example on 2/62 and 3/76. The pleural spaces are otherwise clear. Mediastinum/Hila: There is no mediastinal or hilar lymph node enlargement. Subcentimeter minimally calcified paratracheal lymph nodes are likely related to prior granulomas infection. Neck Base/Chest Wall/Diaphragm/Upper Abdomen: There is no supraclavicular or axillary lymph node enlargement. Limited, non-contrast imaging through the upper abdomen is within normal limits. Mild degenerative change is present in the spine. Lungs/Central Airways: There is a 15 mm nodular density in the nondependent aspect of the bronchus intermedius on 2/52. The trachea and central airways are otherwise clear. There is mild diffuse bronchial wall thickening. There is a calcified granuloma in the posterior right upper lobe. The lungs are otherwise clear. CONCLUSIONS: 1. There is mild diffuse bronchial wall thickening suggesting small airways disease such as asthma or bronchitis in the appropriate clinical setting. 2. A 3 mm nodular soft tissue attenuation in the nondependent aspect of the right bronchus intermedius is nonspecific, which could be mucus or abnormal soft tissue. A follow-up CT in 6 months might be considered to evaluate the growth. 3. Stigmata of old granulomatous disease is present. &#x20; FINAL REPORT Attending Radiologist: Lung Findings CT Chest A 3 mm nodular soft tissue attenuation in the nondependent aspect of the right bronchus intermedius is nonspecific, which could be mucus or abnormal soft tissue. A follow-up CT in 6 months might be considered to evaluate the growth. support devices: none. heart/pericardium/great vessels: cardiac size is normal. there is no calcific coronary artery atherosclerosis. there is no pericardial effusion. the aorta is normal in diameter. the main pulmonary artery is normal in diameter. pleural spaces: few small pleural calcifications are present in the right pleura for example on 2/62 and 3/76. the pleural spaces are otherwise clear. mediastinum/hila: there is no mediastinal or hilar lymph node enlargement. subcentimeter minimally calcified paratracheal lymph nodes are likely related to prior granulomas infection. neck base/chest wall/diaphragm/upper abdomen: there is no supraclavicular or axillary lymph node enlargement. limited, non-contrast imaging through the upper abdomen is within normal limits. mild degenerative change is present in the spine. lungs/central airways: there is a 15 mm nodular density in the nondependent aspect of the bronchus intermedius on 2/52. the trachea and central airways are otherwise clear. there is mild diffuse bronchial wall thickening. there is a calcified granuloma in the posterior right upper lobe. the lungs are otherwise clear. conclusions: 1. there is mild diffuse bronchial wall thickening suggesting small airways disease such as asthma or bronchitis in the appropriate clinical setting. 2. a 3 mm nodular soft tissue attenuation in the nondependent aspect of the right bronchus intermedius is nonspecific, which could be mucus or abnormal soft tissue. a follow-up ct in 6 months might be considered to evaluate the growth. 3. stigmata of old granulomatous disease is present.
2 3 EXAM: MRI ABDOMEN W WO CONTRAST CLINICAL INDICATION: Cirrhosis of liver without ascites, unspecified hepatic cirrhosis type (CMS-HCC) TECHNIQUE: MRI of the abdomen was performed with and without contrast. Multiplanar imaging was performed. 8.5 cc of Gadavist was administered. COMPARISON: DATE and priors FINDINGS: On limited views of the lung bases, no acute abnormality is noted. There may be mild distal esophageal wall thickening. On the out of phase series, there is suggestion of some signal gain within the hepatic parenchyma. This is stable. A tiny cystic nonenhancing focus is seen anteriorly in the right hepatic lobe (9/10), unchanged. A subtly micronodular hepatic periphery is noted. There are few subtle hypervascular lesions in the right hepatic lobe, without significant washout. The portal vein is patent. Some splenorenal shunting is redemonstrated, similar to the comparison exam. The spleen measures 12.4 cm in length. No focal splenic lesion is appreciated. There are several small renal lesions again seen, many of which again demonstrate T1 shortening. On the postcontrast subtraction series, no obvious enhancement is noted. The adrenal glands and pancreas are intact. There is mild cholelithiasis, without gallbladder wall thickening or pericholecystic fluid. No free abdominal fluid is visualized. IMPRESSION: 1. Stable cirrhotic appearance of the liver. Few subtly hypervascular hepatic lesions do not demonstrate washout, and probably relate to perfusion variants. No particularly suspicious hepatic mass is seen. 2. Mild splenomegaly to 12.4 cm redemonstrated. Splenorenal shunting is again seen. 3. Scattered simple and complex renal cystic lesions, nonenhancing, stable from March 2040. 4. Incidentally, there is evidence of signal gain in the liver on the out of phase series. This occasionally may represent iron overload. &#x20; FINAL REPORT Attending Radiologist: No Findings NaN No label on limited views of the lung bases, no acute abnormality is noted. there may be mild distal esophageal wall thickening. on the out of phase series, there is suggestion of some signal gain within the hepatic parenchyma. this is stable. a tiny cystic nonenhancing focus is seen anteriorly in the right hepatic lobe (9/10), unchanged. a subtly micronodular hepatic periphery is noted. there are few subtle hypervascular lesions in the right hepatic lobe, without significant washout. the portal vein is patent. some splenorenal shunting is redemonstrated, similar to the comparison exam. the spleen measures 12.4 cm in length. no focal splenic lesion is appreciated. there are several small renal lesions again seen, many of which again demonstrate t1 shortening. on the postcontrast subtraction series, no obvious enhancement is noted. the adrenal glands and pancreas are intact. there is mild cholelithiasis, without gallbladder wall thickening or pericholecystic fluid. no free abdominal fluid is visualized. impression: 1. stable cirrhotic appearance of the liver. few subtly hypervascular hepatic lesions do not demonstrate washout, and probably relate to perfusion variants. no particularly suspicious hepatic mass is seen. 2. mild splenomegaly to 12.4 cm redemonstrated. splenorenal shunting is again seen. 3. scattered simple and complex renal cystic lesions, nonenhancing, stable from march 2040. 4. incidentally, there is evidence of signal gain in the liver on the out of phase series. this occasionally may represent iron overload.

## Preprocess the Data

First, the impression (i.e., the findings / conclusions section) of the report is extracted, any doctor signatures are removed, and the report lowercased. This preprocessing section may need to be modified to accommodate your healthcare system’s reports, formatting, etc. The preprocess_note function is modified from nmrezman.utils.preprocess_input.

[2]:

def keyword_split(x, keywords, return_idx: int=2):
"""
Extract portion of string given a list of possible delimiters (keywords) via partition method
"""
for keyword in keywords:
if x.partition(keyword)[2] !='':
return x.partition(keyword)[return_idx]
return x

def preprocess_note(note):
"""
Get the impression from the note, remove doctor signature, and lowercase
"""
impression_keywords = [
"impression:",
"conclusion(s):",
"conclusions:",
"conclusion:",
"finding:",
"findings:",
]
signature_keywords = [
"&#x20",
]
impressions = keyword_split(str(note).lower(), impression_keywords)
impressions = keyword_split(impressions, signature_keywords, return_idx=0)
return impressions

# Preprocess the note
modeling_df["impression"] = modeling_df["note"].apply(preprocess_note)
modeling_df = modeling_df[modeling_df["impression"].notnull()]
modeling_df["impression"] = modeling_df["impression"].apply(lambda x: str(x.encode('utf-8')) +"\n"+"\n")


Next, the dataset is split into train and test sets, reserving 20% for the test set.

[3]:

from sklearn.model_selection import train_test_split

# Split into train and test data
train, test = train_test_split(modeling_df, test_size=0.2, random_state=7867)
train = train.reset_index()
test = test.reset_index()


The data is then put into 🤗 Datasets to be used with the 🤗 Trainer. This allows the Trainer function to extract data and labels from easily.

[4]:

from datasets import Dataset, DatasetDict

# Import the data into a dataset
train_dataset = Dataset.from_pandas(train["impression"].to_frame())
test_dataset = Dataset.from_pandas(test["impression"].to_frame())
dataset = DatasetDict({"train": train_dataset, "test": test_dataset})


## Tokenize the Datasets

First, we define a tokenizer to mask words or word fragments to tokens. Here, we are using 🤗’s pretrained RoBERTa base model’s checkpoint. Padding is done on the left side since NM radiology reports generally have the findings at the end of the report. Note that you can change out the tokenizer and model to start from a different RoBERTa checkpoint (e.g., roberta-large).

[5]:

from transformers import AutoTokenizer

# Specify the model checkpoint for tokenizing and get tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"distilroberta-base",
use_fast=True,
)
tokenized_dataset = dataset.map(
batched=True,
num_proc=1,
remove_columns=["impression"],
)


We group texts together and chunk them in samples of length block_size. We use a block_size of 128, but you can adjust this to your needs. Further, you can skip that step if your dataset is composed of individual sentences. This is ultimately the dataset we will use for training.

[6]:

def group_texts(examples):
# Sample chunked into size block_size
block_size = 128

# Concatenate all texts
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])

# We drop the small remainder. We could add padding if the model supported it rather than dropping it.
# This represents the maximum length based on the block size
# You can customize this part to your needs.
max_length = (total_length // block_size) * block_size
result = {k: [t[i : i + block_size] for i in range(0, max_length, block_size)] for k, t in concatenated_examples.items()}
result["labels"] = result["input_ids"].copy()

return result

# Group the text into chunks to get "sentence-like" data structure
lm_dataset = tokenized_dataset.map(
group_texts,
batched=True,
batch_size=1000,
num_proc=1,
)


## Pretrain the Model

The data_collator is a function that is responsible of taking the samples and batching them in tensors. Here we want to do the random-masking. We could do it as a pre-processing step (like we do for tokenization), but then the tokens would always be masked the same way at each epoch. By doing this step inside the data_collator, we ensure this random masking is done in a new way each time we go over the data.

To do this masking for us, 🤗 provides a DataCollatorForLanguageModeling (see their docs). We can adjust the probability of the masking; here we have chosen a probability of 15%.

[7]:

from transformers import DataCollatorForLanguageModeling

# Define a data collator to accomplish random masking
# By doing this step in the data_collator (vs as a pre-processing step like we do for tokenization),
# we ensure random masking is done in a new way each time we go over the data (i.e., per epoch)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=True,
mlm_probability=0.15,
)


Here we define the model checkpoint from which we will start training and then begin training using the 🤗 Trainer, which will train according to the parameters specified in the 🤗 TrainingArguments. 🤗 will take care of all the training for us! When done, the last checkpoint will be used as the starting checkpoint for fine-tuning the Lung, Adrenal, or No Findings model and Lung Recommended Procedure model.

[8]:

from transformers import AutoModelForMaskedLM
from transformers import Trainer, TrainingArguments

# Define the model

# Define the training parameters and 🤗 Trainer
training_args = TrainingArguments(
output_dir="/path/to/results/phase02/demo",
overwrite_output_dir=True,
num_train_epochs=4,
per_device_train_batch_size=32,
fp16=True,
save_steps=2,
save_total_limit=2,
evaluation_strategy="epoch",
seed=1,
report_to="none",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_dataset["train"],
eval_dataset=lm_dataset["test"],
data_collator=data_collator,
)

# Train!
trainer.train()

Using amp half precision backend
/usr/local/lib/python3.8/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use thePyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
***** Running training *****
Num examples = 10
Num Epochs = 4
Instantaneous batch size per device = 32
Total train batch size (w. parallel, distributed & accumulation) = 32
Total optimization steps = 4

[4/4 00:01, Epoch 4/4]
Epoch Training Loss Validation Loss
1 No log 4.768126
2 No log 2.765447
3 No log 4.451561
4 No log 2.892946

***** Running Evaluation *****
Num examples = 5
Batch size = 8
Saving model checkpoint to /path/to/results/phase02/demo/checkpoint-2
Configuration saved in /path/to/results/phase02/demo/checkpoint-2/config.json
Model weights saved in /path/to/results/phase02/demo/checkpoint-2/pytorch_model.bin
***** Running Evaluation *****
Num examples = 5
Batch size = 8
***** Running Evaluation *****
Num examples = 5
Batch size = 8
Saving model checkpoint to /path/to/results/phase02/demo/checkpoint-4
Configuration saved in /path/to/results/phase02/demo/checkpoint-4/config.json
Model weights saved in /path/to/results/phase02/demo/checkpoint-4/pytorch_model.bin
***** Running Evaluation *****
Num examples = 5
Batch size = 8

Training completed. Do not forget to share your model on huggingface.co/models =)


[8]:

TrainOutput(global_step=4, training_loss=4.260141849517822, metrics={'train_runtime': 1.4005, 'train_samples_per_second': 28.562, 'train_steps_per_second': 2.856, 'total_flos': 1326218065920.0, 'train_loss': 4.260141849517822, 'epoch': 4.0})