Skip to main content

Datasets by Microsoft Research now available in the cloud : Microsoft announces open Datasets!

Hey Readers, today I bring forth an exciting news for you all aspiring data scientists and machine learners!

Something new happened in Microsoft Research Blog : 

The Microsoft Research Outreach team has worked extensively with the external research community to enable adoption of cloud-based research infrastructure over the past few years. Through this process, we experienced the ubiquity of Jim Gray’s fourth paradigm of discovery based on data-intensive science – that is, almost all research projects have a data component to them. This data deluge also demonstrated a clear need for curated and meaningful datasets in the research community, not only in computer science but also in interdisciplinary and domain sciences.


Today we are excited to launch Microsoft Research Open Data – a new data repository in the cloud dedicated to facilitating collaboration across the global research community. Microsoft Research Open Data, in a single, convenient, cloud-hosted location, offers datasets representing many years of data curation and research efforts by Microsoft that were used in published research studies.

With data growing at an exponential rate, perceived to be over 150 ZB of data available by 2025, it is now recognized that we need to prioritize bringing processing to data versus relying on data movement through Internet bandwidth that is growing at a much slower pace. We believe that there is real utility in providing an option to bring the processing to the data. Therefore, in addition to providing an option to download the data assets, users can also copy datasets directly to an Azure based Data Science virtual machine.
Now, one may think that Kaggle is still a better source of free datasets, but given Microsoft's new investments in acquiring Github and now competing with Kaggle one can assume that something is brewing in Satya Nadella's mind.

Let me discuss the Pros and Cons of the same.
Pros : 
  • Extensive Categorisation of the datasets for various research field which is cross referred by PhD scholars.
  • Really easy imports in inter microsoft systems.
Cons : 
  • Less Datasets (Bu quality data)
  • Minimal support for local machines with no azure cloud setup.
  • Of course, no API present like that of kaggle that could download the dataset for you within seconds!
So, in my opinion, Kaggle is still the best site for practising your machine learning techniques but Microsoft may have plans for expanding the support in future and that , could be trouble for the current dataset hosting giants.

Let's observe how this new initiative unfolds for either parties. Until then, keep discovering ;)

Uddeshya Singh

Comments

Total Pageviews

Popular posts from this blog

Kaggle Dataset Analysis : Is your Avocado organic or not?

Hey readers! Today, allow me to present you yet another dataset analysis of a rather gluttony topic, namely Avocado price analysis. This Data set  represents the historical data on avocado prices and sales volume in multiple US markets. Our prime objectives will be to visualize the dataset, pre-process it and ultimately test multiple sklearn classifiers to checkout which one gives us the best confidence and accuracy for our Avocado's Organic assurance! Note : I'd like to extend the kernel contribution to Shivam Negi . All this code belongs to him. Data Visualization This script must procure the following jointplot  While a similar joint plot can be drawn for conluding the linearly exponent relations between extra large bags and the small ones. Pre Processing The following script has been used for pre processing the input data. Model Definition and Comparisons We will be looking mostly at three different models, namely ra...

Tackling MNIST Dataset : Deep Convolutional Neural Network 99.571% accurate

Hey there! My dear readers.  Today this kernel review is going to be about the famous MNIST data-set, one of the most famous introductory datasets which we encounter along side Iris dataset and the titanic survival challenge data sets.  Since it is a competition kernel, I have decided not to make it public. (yet) Also,  if you want to try your hands at the challenge itself, then you can find the challenge page here :  Note: This kernel has been largely focused on network modelling rather than Exploratory Data Analysis because it's simple, classic stuff. Still, I will try my best to explain that stuff here. Exploratory Data Analysis First indication of a great dataset is the face that it gives all the mentioned labels equal rows in it and this one doesn't disappoint. This data set maintains a fair 4000+ entries per label which actually is a great statistic for a good dataset. And the next thing one needs to know is how...

Your help in Fashion : 7 layer CNN at your service (~92% accurate)

Hey Readers! Welcome to yet another post where I play with a self designed neural network. This CNN would be tackling a variant of classical MNIST known as Fashion MNIST dataset  . Before we start exploring what is the approach for this dataset, let's first checkout what this dataset really is. Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. ...