Predicting App Popularity using its Size : Bad Idea

Hey Folks!

I recently came across an excellent hot trending dataset in kaggle. It is named Mobile App Statistics (mainly based upon Apple iStore) and instantly determined to try my hands on this data set.

Link to the data set: Apple App Store (7200 datasets)

Now, our objectives would be these 3 for the developed kernel :

Checking out popular applications in each genre in Apple Store (Basically, grabbing the top charts for our data Analysis)
Checking the trend of an App's User Experience with respect to its cost, User Rating count and Number of devices and Languages it supports.
Judging a game's popularity by its APK size and make a Random Forest Classifier to classify by popularity

Of these 3, I will be discussing the first and the third objective here!

Before anything else, one may notice that the size in bytes is not really a good standard of measure, so why not convert it into Megabytes by the following code?

and apply it using standard "df['newcol'] = df['oldcol'].apply(funcname)" method!

Working on Objective 1

My next job was to look forward to The top charts of the simple categories, so I decided to use this simple code and hoped that it would work.

Well, did it work? Of course not. But, why? The thing to notice here that we are simply sorting it by the User Ratings. It is understood that popularity of an app is not only determined by its ratings, but also by the number of people who have rated it or simply, downloaded it. Keeping it in mind, how about we consider a benchmark of 1 million raters to be listed in the top charts?

Working on Objective 3

Now, what I want is to grab merely 'Rating total' column and ' Size in MBs' for our analysis. We will be largely focussing on categorizing popularity into 3 major columns.

If an app has Number of User ratings < 10k, it will be rated as poor or label of 0.
If an app has Number of User ratings >= 10k but <100k, it will be rated as average or label of 1.
If an app has Number of User ratings >= 100k, it will be rated as popular or label it as 2.

Now, we essentially need to refine our input and output labels for stuff to work. here's what I used:

With all this being said and done, we just need to initialize our Random Forest Classifier like this!

Conclusion

Looking at the matrix, we can easily figure out that out of 1800 inputs, it has correctly classified 1379 inputs which amount it to an accuracy of 76.61 %.

Although, if we focus closely into the case here, you may have already noticed that this classifier well mostly in the case of the poor popularity of application: hitting an accuracy of 87.84% but is really bad at classifying popular apps (an accuracy of only 2.04% )

So, one may safely conclude that looking solely at the aspect of application size as predicting feature can backfire really bad for any classification purpose. Although, if you are still interested in making this work, I'd suggest you to consider the User Rating as an input feature along with the application size.

It would be great to experiment with the popularity dependence and learning capability of our model when looking at language support, device support and development version of the app as possible input features!

For Full Kernel : Random Forest Classifier Kernel

Please do hit me up with suggestions and any constructive comments you think which maybe helpful!

The Rise of Artificial Intelligence

Search This Blog