Model Performance Comparison | |||||
---|---|---|---|---|---|
Setup | F1 Score (%) | Recall (%) | Precision (%) | Examples/sec | Configuration |
1a-modernbert-base | 86.0 | 90.3 | 82.2 | 118 | transformers pipeline, batch size 128 |
1b-modernbert-large | 89.2 | 91.8 | 86.8 | 87 | transformers pipeline, batch size 128 |
2-DSPy-25-threads-Llama-3.2-3B-Instruct | 80.7 | 87.9 | 74.6 | 4 | DSPy, 25 threads, vLLM OpenAI server with default settings |
3-Llama-3.2-3B-Instruct-LoRA | 93.1 | 92.0 | 94.2 | 152 | vLLM, default settings |
*Speed of setup 2 is limited by DSPy. Speeds similar to setup 3 can be achieved with efficient batching. |
HuggingFace recently released ModernBERT (Warner et al. 2024), an updated version of the BERT language model (Devlin 2018) which backports many improvements from LLM research back to the classic 2018 model. In contrast to LLMs, ModernBERT is an encoder-only model that is fitted with a task-specific head outputting probabilities for structured NLP tasks, rather than tokens.
While LLMs with their decoder-only architecture were originally designed for text generation, they have also been used for structured NLP tasks like text classification. They are imbued with a large amount of general knowledge and excel at zero-shot and few-shot learning. Through the proliferation of the LLM ecosystem they are also widely available via APIs and familiar to many developers.
Here, I will compare ModernBERT to Meta’s Llama 3.2-3B by Grattafiori et al. (2024) on a text classification task using the dimensions accuracy, speed, cost and ease of use. Text classification is a simple task, yet very common and important in NLP pipelines. It may also be coupled with text generation in a chat bot, such as for intent classification or as a guardrail to prevent undesirable responses.
Task: Adverse event classification
During my work in market research for pharmaceutical companies, I frequently have to monitor data for adverse events. An adverse event is any undesirable medical event that occurs during or after treatment with a drug. Examples include side effects, lack of efficacy, and overdoses. It is of utmost importance to identify adverse events and report them to the producing pharmaceutical company. This task is labor intensive, so naturally I’m interested in automating it. I’ll use the ADE-Benchmark Corpus Gurulingappa et al. (2012) as an example dataset. It contains 23,500 English medical text sentences describing effects of drugs. Each sentence is classified as 1: adverse drug reaction or 0: no adverse drug reaction. This represents a subtask of the broader task of adverse event monitoring.
Resource | Link |
---|---|
💻 Python code & readme | GitHub |
📊 Experiment results | Weights & Biases project |
📝 Dataset: ADE-Benchmark Corpus | Hugging Face Hub |
All training and inference is done on a single A10G GPU hosted on Modal. It costs $1.10/h. A Modal account is required to run the code. The free tier ($30 of free credits per month) is sufficient for this experiment.
Experiment setup
The diagram below illustrates three experiment setups: fine-tuning ModernBERT, few-shot learning with Llama 3.2-3B, and fine-tuning Llama 3.2-3B.
Dataset preparation
The dataset on HuggingFace consists of 23,516 sentences. After removing duplicate sentences, 20,896 unique examples are left. The distribution of classes is uneven, with more examples of texts without an adverse events. To balance the classes, I’m subsampling the negative examples down to 4,271 cases. Balanced classes prevent the models from overfitting to the majority class and let us compare the models using a simple accuracy metric.
Then, the dataset is split into 60% training, 20% validation and 20% test sets. The validation set is used to tune hyperparameters and implement early stopping. Splits are stratified by class to ensure a 50:50 split between positive and negative examples in each split. The final example count is:
Split | Class | Examples |
---|---|---|
Training | Adverse Event | 2,562 |
Training | No Adverse Event | 2,562 |
Validation | Adverse Event | 855 |
Validation | No Adverse Event | 855 |
Test | Adverse Event | 854 |
Test | No Adverse Event | 854 |
Model selection
I’m comparing ModernBERT-base and ModernBERT-large as the structured language models with Llama 3.2-3B-instruct as the LLM.
Model | Architecture | Parameters | Size at FP32 |
---|---|---|---|
ModernBERT-base | Encoder-only: outputs a probability distribution over classes | 149M | ~0.6GB |
ModernBERT-large | Encoder-only: outputs a probability distribution over classes | 395M | ~1.6GB |
Llama 3.2-3B | Decoder-only: outputs text | 3B | ~12GB |
For inference, about 1.5 to 2x the model size is required to store the attention cache, calculate layer activations and other intermediate results. The A10G GPU used for this experiment has 24GB memory, so both models fit. The memory footprint can be reduced by half by using FP16 or INT8 precision, which is common for inference.
Setup 1: Fine-tuning ModernBERT
I’m using the transformers library to fine-tune ModernBERT base and large on the training set. Schmid (2024) from Hugging Face wrote a helpful guide which I adapted for use on Modal. The models are optimized on binary cross-entropy loss for 5 epochs. Training took about 2 minutes for ModernBERT-base and 3.5 minutes for ModernBERT-large.
Setup 2: Few-shot learning with Llama 3.2-3B and DSPy
I’m using DSPy (Khattab et al. 2023) to automatically select an optimal set of examples for few-shot learning. That’s a more objective approach than manual prompting and usually results in equally good or better accuracy. In my first trials, DSPy didn’t manage to write a suitable system prompt as it didn’t understand the adverse drug reaction task from examples alone. So I added the prompt: “Determine if the following sentence is about adverse drug reactions:” to the examples. This increased the accuracy by about 15 percentage points.
DSPy settings:
- 20 few-shot examples plus 5 bootstrapped (AI generated) examples
- Optimized for accuracy using MIPROv2 (minibatch size 50, minibatch full eval steps 10, num trials 3)
- 25 threads for calls to the LLM, which is hosted using FastAPI and vLLM on Modal
The optimized predictor is available as a JSON file in the Weights & Biases project.
Setup 3: Fine-tuning Llama 3.2-3B
I’m using the torchtune library and a fine-tuning configuration to train a LoRA adapter on the training set. It targets the attention and feed-forward layers of the model. The adapter is a smaller set of weights that are added to the model at inference time. LoRA training incurs less training cost than full fine-tuning of all weights, but may result in worse accuracy. The LoRA settings used for training are available in the W&B project and the training config file for torchtune. Training took about 8 minutes on the A10G.
Results
Accuracy and speed
ModernBERT-base and ModernBERT-large performed similarly. Large has a 3.2 percentage point advantage in F1 score but at the cost of 27% slower inference. The few-shot approach doesn’t need nearly as much training data, but also results in a less accurate model. It also ran slowly, and this didn’t change when I increased the number of threads used by DSPy to communicate with the vLLM server. I suspect it’s due to inefficient batch inference code in DSPy. Higher speeds could be achieved with a more efficient batching approach.
The clear winner of the experiment is the LoRA fine-tuned Llama 3.2-3B. It’s the most accurate and the fastest. This is down to vLLM being extremely well optimized and using CUDA graph capturing ahead of inference time. If that preparation time of 30 seconds is added, it’s examples per second go down to 42.
Cost and effort
All setups can be trained for under one dollar and in less than 15 minutes. The differences are negligible. What matters more is the time spent setting up training and inference. The transformers library and the tutorial made it very easy to fine-tune ModernBERT and run inference. A major plus is that due to its low size, it can run on CPU at good speed too. DSPy was more involved because it required setting up a vLLM server too. This step is easier when using a managed service like Fireworks AI. Fine-tuning Llama 3.2-3B was the most involved step, as it required formatting the data in a chat format and going through the detailed configuration of the torchtune library and vLLM. Still, it only took a few hours. This step is also easier with a managed service.
Discussion
Implications for NLP
Fine-tuning vs prompt-based approaches
Fine-tuning continues to outperform purely prompt-based approaches, even when those are optimized using automated prompt engineering. If you have enough examples to fine-tune on, it’s a good idea to do so. Still the recall achieved by the few-shot approach is impressive and can serve as a strong baseline and starting point in the development of text classification systems.
Model size and architecture
In fine-tuning, the size of the model is a key factor. Here, ModernBERT did well and is a strong choice for text classification and other structured NLP tasks. ModernBERT-large offers a modest accuracy improvement in exchange for slower inference. However, Llama 3.2-3B with a fine-tuned LoRA adapter outperformed it in accuracy in this experiment. Its architecture as a decoder-only model is, in theory, less suited for structured tasks. Did it win by sheer size? It would be interesting to see what a ModernBERT-3B or -8B model would achieve. In a related task of sentiment analysis (Zhou et al. 2024), the scaling limit was found to be at 8 billion parameters with a decoder-only model.
Processing speed
Processing speed is highly sensitive to the hardware and inference setup. Thanks to vLLM’s CUDA graph capturing and other optimizations, the Llama 3.2-3B LoRA adapter ran faster than the ModernBERT models in this experiment, despite its size. Perhaps the efficiency optimizations made in LLM research could be backported to encoder-only models too, just like ModernBERT backported training techniques from LLMs back to the classic 2018 model. Note that this speed comparison was not comprehensive and is dependent on the GPU, the inference library and the exact settings used, such as batch size.
Implications for adverse event monitoring
Greater sensitivity with larger models
The primary metric for adverse event monitoring is sensitivity, as missing a true adverse event is much more costly than flagging a false positive. The results show a sensitivity of 92% in detection of adverse drug reactions in medical texts using a Llama 3.2-3B. It outperforms previous approaches that used a convolutional neural network (Huynh et al. 2016, 89%) and a BERT sentence embeddings model (Haq, Kocaman, and Talby 2022, 85%). This advance is a step towards an automated adverse event monitoring system. With larger models and more training data, the sensitivity can be improved further.
Towards a production system
A production system for automated adverse event monitoring would need a more comprehensive approach:
- Adjustable threshold for flagging adverse events
- Flagging of complex cases for human review
- Tests and training data for other languages, other text types such as case reports, social media and interview transcripts
- Tests and training data for other adverse types, such as overdose, lack of efficacy, and use during pregnancy or breastfeeding
None of these require new breakthroughs in AI - they are doable with current technology.
Preview image generated with FLUX.1-schnell and DiffusionBee.