Challenges of Training Models on Medical Data

Techniques to tackle Class Imbalance, Multi-Task, and Dataset Size

Rishiraj Acharya
4 min readApr 9, 2020

Amongst the many problems faced during training algorithms on medical datasets, these three are most common:

  1. Class Imbalance challenge
  2. Multi-Task challenge
  3. Dataset Size challenge

For each of these problems, I will share a few techniques to tackle them. So let’s start with them one by one!

Class Imbalance challenge

In the real world, we see a lot more healthy people than diseased people and this is reflected in medical datasets as well. There is not an equal distribution of the number of examples of healthy and diseased classes. This is a reflection of the prevalence or the real-world frequency of disease. In not just medical datasets but also datasets for credit card fraud, you might see a hundred times as many normal examples as abnormal examples.

As a result, it is easy to be tricked into the illusion of the model performing very well whereas it really isn’t doing so. This can happen if simple metrics like accuracy_score are used. Accuracy isn’t a great metric for this kind of datasets since the labels are heavily skewed, so a neural network that just outputs…

--

--

Rishiraj Acharya

GDE in ML (Gen AI, Keras) | GSoC '22 at TensorFlow | TFUG Kolkata Organizer | Hugging Face Fellow | Kaggle Master | MLE at Tensorlake, Past - Dynopii, Celebal