Skip to main content

Predicting App Popularity using its Size : Bad Idea

Hey Folks!

I recently came across an excellent hot trending dataset in kaggle. It is named Mobile App Statistics (mainly based upon Apple iStore) and instantly determined to try my hands on this data set.

Link to the data set: Apple App Store (7200 datasets)





Now, our objectives would be these 3 for the developed kernel :

  1. Checking out popular applications in each genre in Apple Store (Basically, grabbing the top charts for our data Analysis)
  2.  Checking the trend of an App's User Experience with respect to its costUser Rating count and Number of devices and Languages it supports.
  3. Judging a game's popularity by its APK size and make a Random Forest Classifier to classify by popularity

Of these 3, I will be discussing the first and the third objective here!

Before anything else, one may notice that the size in bytes is not really a good standard of measure, so why not convert it into Megabytes by the following code?

and apply it using standard "df['newcol'] = df['oldcol'].apply(funcname)" method!


Working on Objective 1


My next job was to look forward to The top charts of the simple categories, so I decided to use this simple code and hoped that it would work.









Well, did it work? Of course not. But, why? The thing to notice here that we are simply sorting it by the User Ratings. It is understood that popularity of an app is not only determined by its ratings, but also by the number of people who have rated it or simply, downloaded it. Keeping it in mind, how about we consider a benchmark of 1 million raters to be listed in the top charts?




Working on Objective 3


Now, what I want is to grab merely 'Rating total' column and ' Size in MBs' for our analysis. We will be largely focussing on categorizing popularity into 3 major columns. 
  • If an app has Number of User ratings < 10k, it will be rated as poor or label of 0.
  • If an app has Number of User ratings >= 10k but <100k, it will be rated as average or label of 1.
  • If an app has Number of User ratings >= 100k, it will be rated as popular or label it as 2.
Now, we essentially need to refine our input and output labels for stuff to work. here's what I used:




With all this being said and done, we just need to initialize our Random Forest Classifier like this!




Conclusion

Looking at the matrix, we can easily figure out that out of 1800 inputs, it has correctly classified 1379 inputs which amount it to an accuracy of 76.61 %.
Although, if we focus closely into the case here, you may have already noticed that this classifier well mostly in the case of the poor popularity of application: hitting an accuracy of 87.84% but is really bad at classifying popular apps (an accuracy of only 2.04% )
So, one may safely conclude that looking solely at the aspect of application size as predicting feature can backfire really bad for any classification purpose. Although, if you are still interested in making this work, I'd suggest you to consider the User Rating as an input feature along with the application size.
It would be great to experiment with the popularity dependence and learning capability of our model when looking at language supportdevice support and development version of the app as possible input features!



Please do hit me up with suggestions and any constructive comments you think which maybe helpful!

Comments

Total Pageviews

Popular posts from this blog

Kaggle Dataset Analysis : Is your Avocado organic or not?

Hey readers! Today, allow me to present you yet another dataset analysis of a rather gluttony topic, namely Avocado price analysis. This Data set  represents the historical data on avocado prices and sales volume in multiple US markets. Our prime objectives will be to visualize the dataset, pre-process it and ultimately test multiple sklearn classifiers to checkout which one gives us the best confidence and accuracy for our Avocado's Organic assurance! Note : I'd like to extend the kernel contribution to Shivam Negi . All this code belongs to him. Data Visualization This script must procure the following jointplot  While a similar joint plot can be drawn for conluding the linearly exponent relations between extra large bags and the small ones. Pre Processing The following script has been used for pre processing the input data. Model Definition and Comparisons We will be looking mostly at three different models, namely ra...

IOT Breakthrough : TensorFlow 1.9 Officially Supports the Raspberry Pi

Hey Readers! Good news for all the "cheap fair power" computer fans, as a result of a major collaboration effort between TensorFlow and Raspberry Pi foundation, one can now install tensorflow precompiled binaries using Python's pip package system !  When TensorFlow was first launched in 2015, they wanted it to be an “ open source machine learning framework for everyone ”. To do that, they needed to run on as many of the platforms that people are using as possible. They have long supported Linux, MacOS, Windows, iOS, and Android, but despite the heroic efforts of many contributors, running TensorFlow on a Raspberry Pi has involved a lot of work. If one is using Rasbian9 they can simply use these 2 commands to install tensorflow on their machine! According to an excerpt from TensorFlow's medium article page :  " We’re excited about this because the Raspberry Pi is used by many innovative developers, and is also widely used in education to ...

Your help in Fashion : 7 layer CNN at your service (~92% accurate)

Hey Readers! Welcome to yet another post where I play with a self designed neural network. This CNN would be tackling a variant of classical MNIST known as Fashion MNIST dataset  . Before we start exploring what is the approach for this dataset, let's first checkout what this dataset really is. Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. ...