Cancer is a disease in which cells multiply uncontrollably and crowd out the normal cells. In biopsies, pathologists provide the histopathologic assessment of the microscopic structure of the tissue and make final diagnosis by applying visual inspection of histopathological samples under the microscope and aim to differentiate between normal, and malignant cells. Manual detection is a tedious, tiring task and most likely to comprise human error, as most parts of the cell are frequently part of irregular random and arbitrary visual angles. The goal of this project is to identify whether a tumor is benign or malignant in nature, as malignant tumors are cancerous and should be treated as soon as possible to reduce and prevent further complications.
The dataset contains 6 gzipped HDF5 files. The files contain histopathologic scans of lymph node sections in the form of multidimensional arrays of scientific or numerical data. The description of the dataset is as follows:
| File Name | Content | Size | 
|---|---|---|
| Camelyonpatch_level_2_split_train_x.h5.gz | 262144 images | 6.1 GB | 
| Camelyonpatch_level_2_split_train_y.h5.gz | 262144 labels | 21 KB | 
| Camelyonpatch_level_2_split_valid_x.h5.gz | 32768 images | 0.8 GB | 
| Camelyonpatch_level_2_split_valid_y.h5.gz | 32768 labels | 3 KB | 
| Camelyonpatch_level_2_split_test_x.h5.gz | 32768 images | 0.8 GB | 
| Camelyonpatch_level_2_split_test_y.h5.gz | 32768 labels | 3 KB | 
- The given dataset was in the form of gzipped HDF5 files. So, in order to perform dataset exploration we first unzipped the file, uploaded it on google drive, and then loaded the .h5 file into the ‘datasets.PCAM’ function of torchvision (version = 0.12) library and used the transform attribute to obtain the images in tensor format.
- Next, in order to make the obtained dataset iterable, we passed it through the dataloader function.
- The images obtained were of the dimensions 3x96x96.
- We also converted the images in tensor format to a dataframe with features columns as the pixels and target columns as labels in order to use the sklearn library models like Random Forest Classifier, LightGBM etc.
- Most of the pixels in the image are redundant and do not contribute substantially. it is required to eliminate them to avoid unnecessary computational overhead. This can be achieved by compression techniques.
- This is necessary to remove redundancy from the input data which only contributes to the computational complexity of the network without providing any significant improvements in the result.
- The compression technique implemented by us is image resizing. We resized both the dimensions to half, thereby maintaining the aspect ratio but reduced the area to 1/4th.
- Principal Component Analysis (PCA) : It is one of the most commonly used unsupervised machine learning algorithms that increases interpretability but at the same time minimizes information loss. It is a statistical procedure that uses an orthogonal transformation and converts a set of correlated variables to a set of uncorrelated variables.
- Linear Discriminant Analysis (LDA) : It is supervised dimensionality reduction technique which accounts for the intraclass and interclass variations as well to increase/maintain the separability of the classes after the dimensionality reduction. We tried to do the LDA but the colab file was crashing as the RAM was getting full.
- Sequential Feature Selection (SFS) : Attempted but didn’t adopt it as it was taking too much time to run. Reason: The time complexity factor of SFS technique is O(n!). In our data, n = 6912. Hence the n factorial (n!) of such a big value increases the time complexity to a huge extent. The run time exceeded 7 hours, which made it not possible to attempt.
| Models implemented | Accuracy | Specificity | Precision | Recall | 
|---|---|---|---|---|
| Transfer Learning | 0.85757 | 0.90867 | 0.89819 | 0.80643 | 
| Convolutional Neural Network | 0.74716 | 0.92434 | 0.88271 | 0.56982 | 
| Multi-layer Perceptron | 0.680175 | 0.86956 | 0.78983 | 0.49063 | 
| Random Forest Classifier(with PCA) | 0.69393 | 0.84991 | 0.65 | 0.85 | 
| LightGBM Classifier(with PCA) | 0.7275 | 0.7578 | 0.71 | 0.76 | 
| Support Vector Machine(with PCA) | 0.5233 | 0.0469 | 1.00 | 0.05 | 
We chose ‘specificity’ as the metric for evaluation as it denotes the chance of correctly classifying negative samples thereby maximizing the surety of positive samples not going undetected. While training the deep learning models (MLP, CNN and Transfer Learning model), the model with highest specificity is saved and it turns out to be the model with lowest validation loss. From the loss vs epoch curves for the Deep Learning frameworks(Linear and CNN), it can be observed that after a certain number of epochs the training loss is decreasing whereas the validation loss is increasing, this implies that the model started to overfit after a certain number of epochs.
From the attached ROC curve, we can see that the validation AUC is less than that of training AUC in the case of ‘Transfer Learning’ model. Hence, we can say that the Transfer learning model is not getting overfitted.From ROC curves and the evaluation table, it is quite evident that the Transfer Learning model is performing the best as the AUC/specificity is coming out to be maximum in that case. Since our evaluation criteria is specificity, we will go with the Transfer learning model, however it can be observed from the evaluation table above that the transfer learning model is outperformed over other models in terms of accuracy, precision and recall as well.
- One needs to save the weights of the model which performed the best. In our case, the weights of the best model are : Weights
- Run the following command in the terminal :
streamlit run app.py
| Name | Branch | Institute | 
|---|---|---|
| Vedant A Sontake | EE | IIT Jodhpur | 
| Debdut Saini | CS | IIT Jodhpur | 
| Suborno Biswas | EE | IIT Jodhpur | 
We referred to the following research papers and documentations:

