Skip to main content

Genetic Variant Classifier : Random Forest Beats Deep Architecture (By a HUGE MARGIN)

Hello Readers!
Welcome to yet another value prediction work! Today, we will be looking at the in-demand dataset , namely Genetic Variant Classifications.




We will look at this dataset and go for it's primary objective, that is classification of  the two lab reports and determining whether they both conflict or not.

The Kernel you may want to look at for more information : Conflicting result classifications

As usual, we will be looking at the dataset with the aim of EDA, Feature Engineering and Predictions

Exploratory Data Analysis

One would like to see what are the Chromosomes vs Class distribution of this data. For that, you can simply use : 




As you can observe in the graph given below, the dataset happens to be heavily biased towards the non- conflicting genes and that too with the CHROM == 2 standing out as the clear bias winner.
Since the incidents where the genes are recorded to be conflicting, we can assume that our classifiers won't be doing much of a great job and we can assume that most of them would take CHROM 2 as their most important feature if we don't feature hash it into something of a lower dimension

Once that is clear, let's go on and check out our feature selections:








Feature Engineering

Applying Feature Hashing and Binary Classification on various classes.


Machine Learning Models

I will be skipping the pure modelling part here for that you guys can visit the kernel from the link above and will just share the interesting detail which I came across recently, namely Feature Importance. 

The feature importance of the Worst Performer (Logistic Classifier) can be shown as : 
From the graph above, we can easily understand that the most important features taken into consideration by our Logistic Regressor are Inversion, Microsatellite and LOW, while the chromosomes are given very little or no importance.
While the Feature importance of the Best Performance (Random Forest Classifier) can be show as : 
Here, we can see the emphasis on Chromosome types.

Deep Learning Model

I tried the following deep learning model for the same prediction (although it failed. Pretty badly)

and well, it came out with a performance of worse than the Decision Tree Classifier (discussed in Kernel itself)

Conclusion

By this, we can conclude that Random Forest Classifier is our best bet, and our deep learning model probably failed because of extensive binary and feature modelling and evident bias.

Comments

Total Pageviews

Popular posts from this blog

Kaggle Dataset Analysis : Is your Avocado organic or not?

Hey readers! Today, allow me to present you yet another dataset analysis of a rather gluttony topic, namely Avocado price analysis. This Data set  represents the historical data on avocado prices and sales volume in multiple US markets. Our prime objectives will be to visualize the dataset, pre-process it and ultimately test multiple sklearn classifiers to checkout which one gives us the best confidence and accuracy for our Avocado's Organic assurance! Note : I'd like to extend the kernel contribution to Shivam Negi . All this code belongs to him. Data Visualization This script must procure the following jointplot  While a similar joint plot can be drawn for conluding the linearly exponent relations between extra large bags and the small ones. Pre Processing The following script has been used for pre processing the input data. Model Definition and Comparisons We will be looking mostly at three different models, namely ra...

IOT Breakthrough : TensorFlow 1.9 Officially Supports the Raspberry Pi

Hey Readers! Good news for all the "cheap fair power" computer fans, as a result of a major collaboration effort between TensorFlow and Raspberry Pi foundation, one can now install tensorflow precompiled binaries using Python's pip package system !  When TensorFlow was first launched in 2015, they wanted it to be an “ open source machine learning framework for everyone ”. To do that, they needed to run on as many of the platforms that people are using as possible. They have long supported Linux, MacOS, Windows, iOS, and Android, but despite the heroic efforts of many contributors, running TensorFlow on a Raspberry Pi has involved a lot of work. If one is using Rasbian9 they can simply use these 2 commands to install tensorflow on their machine! According to an excerpt from TensorFlow's medium article page :  " We’re excited about this because the Raspberry Pi is used by many innovative developers, and is also widely used in education to ...

5 AI advices you need to implement, from TODAY: DeepMind CoFounder

Data Science and Artificial Intelligence fans, this might be a good day for you. Google DeepMind Cofounder gives a teenage AI fan  pieces of advice, and I think you should know that too! Some artificial intelligence specialists at organizations like Google and Facebook are currently acquiring more cash than venture financiers at Goldman Sachs and J.P. Morgan.  These specialists additionally have the benefit of working in a field of technology that is ready to majorly affect the world we live in.  Be that as it may, for some individuals, it's not clear how to approach landing a job in AI. This week, 17-year-old Londoner Aron Chase asked Shane Legg — the chief scientist and cofounder of DeepMind, an AI lab acquired by DeepMind for a reported £400 million — for five pieces of advice for an AI enthusiast like himself. " Hey Shane I’m currently 17 from London England and am very passionate about AI, also learning about in-depth human needs. What would be the 5 piec...