Hello Readers!
Welcome to yet another value prediction work! Today, we will be looking at the in-demand dataset , namely Genetic Variant Classifications.
We will look at this dataset and go for it's primary objective, that is classification of the two lab reports and determining whether they both conflict or not.
The Kernel you may want to look at for more information : Conflicting result classifications
As usual, we will be looking at the dataset with the aim of EDA, Feature Engineering and Predictions
Exploratory Data Analysis
One would like to see what are the Chromosomes vs Class distribution of this data. For that, you can simply use :
As you can observe in the graph given below, the dataset happens to be heavily biased towards the non- conflicting genes and that too with the CHROM == 2 standing out as the clear bias winner.
Since the incidents where the genes are recorded to be conflicting, we can assume that our classifiers won't be doing much of a great job and we can assume that most of them would take CHROM 2 as their most important feature if we don't feature hash it into something of a lower dimension
Once that is clear, let's go on and check out our feature selections:
Feature Engineering
Applying Feature Hashing and Binary Classification on various classes.
Machine Learning Models
I will be skipping the pure modelling part here for that you guys can visit the kernel from the link above and will just share the interesting detail which I came across recently, namely Feature Importance.
The feature importance of the Worst Performer (Logistic Classifier) can be shown as :
Comments
Post a Comment