Coursework 1:
Submitted for the partial fulfilment of the DSM150 course
By Hendrik Matthys van Rooyen
230221176
DLWP Flow: MedMNIST - OCTMINST¶
Introduction¶
For this coursework the instructions followed were as follows: ...
The Dataset¶
In preparation for this coursework, many datasets were considered, amongst them:
- Anemia Classification
- Back-order Prediction
- Heart Attack Risk Prediction
- IMDB Reccomender
- Text Generation using a corpus
Of these the Anemia and IMDB provided simple models, with minimal opportunity to satisfy the requirements for the Coursework. Both the Back-order and Heart Attack risk provided very poor quality models, barely out-preforming the baseline models. The Text Recommendation, although an interesting project does not fall into the scope of the project, and was infeasible to achieve within the Coursework limitations on the available hardware.
All this considered, the octmnist dataset form the MedMNIST library has provided ample opportunity to explore the data, balance the datasets, and leaves room for improvement on top of a basic neural network. While the simple Neural networks like these aren't normally considered for image classification, as opposed to CNN's, the DLWP book showed some success in the classification of the normal MNIST dataset, and this one was also considdered for that reason.
Prepare Environment¶
Load Packages¶
import medmnist
from medmnist import INFO, Evaluator
import pandas as pd
import numpy as np
from tensorflow.keras.callbacks import ModelCheckpoint
Load Dataset¶
data_flag = 'octmnist'
# data_flag = 'breastmnist'
download = True
NUM_EPOCHS = 3
BATCH_SIZE = 128
lr = 0.001
info = INFO[data_flag]
n_channels = info['n_channels']
n_classes = len(info['label'])
info
{'python_class': 'OCTMNIST', 'description': 'The OCTMNIST is based on a prior dataset of 109,309 valid optical coherence tomography (OCT) images for retinal diseases. The dataset is comprised of 4 diagnosis categories, leading to a multi-class classification task. We split the source training set with a ratio of 9:1 into training and validation set, and use its source validation set as the test set. The source images are gray-scale, and their sizes are (384−1,536)×(277−512). We center-crop the images and resize them into 1×28×28.', 'url': 'https://zenodo.org/record/6496656/files/octmnist.npz?download=1', 'MD5': 'c68d92d5b585d8d81f7112f81e2d0842', 'task': 'multi-class', 'label': {'0': 'choroidal neovascularization', '1': 'diabetic macular edema', '2': 'drusen', '3': 'normal'}, 'n_channels': 1, 'n_samples': {'train': 97477, 'val': 10832, 'test': 1000}, 'license': 'CC BY 4.0'}
DataClass = getattr(medmnist, info['python_class'])
# Load each split
train_data = DataClass(split='train', download=True)
val_data = DataClass(split='val', download=True)
test_data = DataClass(split='test', download=True)
Using downloaded and verified file: C:\Users\mvanr\.medmnist\octmnist.npz Using downloaded and verified file: C:\Users\mvanr\.medmnist\octmnist.npz Using downloaded and verified file: C:\Users\mvanr\.medmnist\octmnist.npz
Investigate Dataset¶
In investigating the dataset, we inspect both samples from the data, to determine if there may be any notable differences or similarities between entries, which could inform the design of the network or the handling of the data. We also investigate the distribution of the entries across classes.
import matplotlib.pyplot as plt
def display_images_with_label(images, labels, info, target_label, num_images, seed=None):
# Set the random seed if provided.
if seed is not None:
np.random.seed(seed)
# Ensure the labels are numpy arrays for boolean indexing.
labels = np.array(labels)
images = np.array(images)
# Find the indices of all images that match the target label.
matching_indices = np.where(labels == target_label)[0]
# Select a random subset of these indices.
if len(matching_indices) >= num_images:
selected_indices = np.random.choice(matching_indices, size=num_images, replace=False)
else:
selected_indices = matching_indices # If there aren't enough, select them all.
# Set up the plot size.
plt.figure(figsize=(2 * num_images, 2)) # You can adjust the figure size as needed.
# Create a subplot for each selected image.
for i, index in enumerate(selected_indices, 1):
ax = plt.subplot(1, num_images, i)
plt.imshow(images[index], cmap=plt.cm.binary)
plt.axis('off') # Hide the axis to put more focus on the images.
# Set the title for the first image only.
if i == 1:
ax.set_title(f'{info["label"][str(target_label)]}')
plt.show()
display_images_with_label(test_data.imgs,test_data.labels,info,0,4,42)
display_images_with_label(test_data.imgs,test_data.labels,info,1,4,42)
display_images_with_label(test_data.imgs,test_data.labels,info,2,4,42)
display_images_with_label(test_data.imgs,test_data.labels,info,3,4,42)
From the samples above, a challenge that will form a larger theme in the project becomes apparent; that being, the similarity between the drusen and the normal images.
import matplotlib.pyplot as plt
unique_labels, counts = np.unique(train_data.labels, return_counts=True)
label_names = [info['label'][str(label)] for label in unique_labels]
# Plotting the distribution of labels
plt.figure(figsize=(10, 6))
plt.bar(label_names, counts, color='skyblue')
plt.xlabel('Labels')
plt.ylabel('Frequency')
plt.title('Distribution of Labels in the Training Dataset')
plt.xticks(rotation=45)
plt.show()
The above graph shows a large inequality between entries per label, with normal being the most, and drusen being the least. This increases the problem of the classifications being very similar.
import matplotlib.pyplot as plt
unique_labels, counts = np.unique(np.concatenate((train_data.labels, val_data.labels, test_data.labels), axis=0), return_counts=True)
label_names = [info['label'][str(label)] for label in unique_labels]
# Plotting the distribution of labels
plt.figure(figsize=(10, 6))
plt.bar(label_names, counts, color='skyblue')
plt.xlabel('Labels')
plt.ylabel('Frequency')
plt.title('Distribution of Labels in the Full Dataset')
plt.xticks(rotation=45)
plt.show()
Prepare Data¶
In preparation of training the network we compute the class weight of each label in order to experiment with balancing the importance in training.
In order to balance the classes we augment the images of classes with lower representation.
The image pixel values, which is normally between 5 and 255 is finally divided by 255 in order to normalize the values between 0 and 1
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight(
class_weight='balanced',
classes=np.unique(train_data.labels.flatten()),
y=train_data.labels.flatten()
)
class_weight_dict = dict(enumerate(class_weights / 4))
class_weight_dict
{0: 0.18194697467447138, 1: 0.596525261921081, 2: 0.7856993164818158, 3: 0.13236676009212184}
from keras.preprocessing.image import ImageDataGenerator
import numpy as np
def augment_images_to_balance_classes(images, labels, batch_size, seed):
images = images.reshape((-1, 28, 28, 1))
# Find the label with the most entries
(unique, counts) = np.unique(labels, return_counts=True)
max_count = np.max(counts)
class_indices = {label: np.where(labels == label)[0] for label in unique}
max_label = unique[np.argmax(counts)]
# Initialize the image data generator
datagen = ImageDataGenerator(
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest'
)
# Seed for reproducibility
np.random.seed(seed)
# Initialize lists for the balanced dataset
balanced_images = list(images)
balanced_labels = list(labels)
# Augment data for classes with fewer images than the max_count
for label, indices in class_indices.items():
if label == max_label:
continue # Skip the class with the most samples
num_to_augment = max_count - counts[unique.tolist().index(label)]
# Augment images until the class has the same number of images as the max_count
augmentation_factor = int(np.ceil(num_to_augment / len(indices)))
for i in range(augmentation_factor):
for index in indices:
image_to_augment = images[index].reshape((1, *images[index].shape))
iterator = datagen.flow(image_to_augment, batch_size=1, seed=seed)
for _ in range(min(num_to_augment, batch_size)):
augmented_image = next(iterator)[0].astype('uint8')
balanced_images.append(augmented_image)
balanced_labels.append(label)
num_to_augment -= 1
if num_to_augment <= 0:
break
if num_to_augment <= 0:
break
return np.array(balanced_images), np.array(balanced_labels)
train_images, train_labels = augment_images_to_balance_classes(train_data.imgs, train_data.labels.flatten(), batch_size=32, seed=42)
train_images = train_images / 255.0
val_images = val_data.imgs / 255.0
val_labels = val_data.labels
test_images = test_data.imgs / 255.0
test_labels = test_data.labels
print('tensor shape:')
print('\ttraining images:', train_images.shape)
print('\ttraining labels:', train_labels.shape)
print('\tvalidation images:', val_images.shape)
print('\tvalidation images:', val_labels.shape)
print('\ttesting images:', test_images.shape)
print('\ttesting images:', test_labels.shape)
tensor shape: training images: (184104, 28, 28, 1) training labels: (184104,) validation images: (10832, 28, 28) validation images: (10832, 1) testing images: (1000, 28, 28) testing images: (1000, 1)
import matplotlib.pyplot as plt
unique_labels, counts = np.unique(train_labels, return_counts=True)
label_names = [info['label'][str(label)] for label in unique_labels]
# Plotting the distribution of labels
plt.figure(figsize=(10, 6))
plt.bar(label_names, counts, color='skyblue')
plt.xlabel('Labels')
plt.ylabel('Frequency')
plt.title('Distribution of Labels in the Training Dataset after Augmentation')
plt.xticks(rotation=45)
plt.show()