May 20, 2020

Meta Dataset: Why Less is More in for Machine Learning

Contrary to what expected, machine learning works better with Few-Shot Learning model
Source: Pixabay

As the world becomes more developed, the need for new technologies is needed. "From scarcity comes abundance". This concept fits perfectly when talking about the new way of applying machine learning for datasets.

Machine learning is a process that is described by its name. The computer has a learning ability. Either from experience, from new input data compared to previous data, or from mistakes. It can be considered a ramification of Artificial Intelligence.

Google Dataset Search

Google has a not yet well-known function using its searching algorithm but delivering datasets instead of single and independent data output. Google Dataset Search allows the user to search in another way. Instead of getting literal results, the information is processed as datasets with the most relevant information. Statistics, Business opportunities, scientific researches, among others. The main searches in Google Dataset Search include educational purposes, climate, sports, and dogs. Google offers the search of 25 million datasets.  

The company alleges that searching with datasets it´s more effective and a smarter way to obtain accurate information. Its beta testing started in 2018 and its public version came out in January 2020.

Tons of data in the dataset search index comes from government agencies. About 2 million U.S government datasets are indexed and available right now. some private organizations make public some of their data for specific purposes.  

If one knows about some interesting dataset or have one itself, can make it available to be indexed by using a standard markup

Few-shot learning, a concept which describes the lesser usage of data for machine learning purposes, we can see how datasets have their own specific and unique role. Instead of giving the learning model large amounts of manually introduced data, small but targeted amounts of data are provided. Few-shot can be used from text as from images. As the study shows, using a few-shot image classification, the model learns new information from only representative images.

Human reasoning vs Machine Learning 

It´s very interesting to see how a machine can learn better than a person when it comes to learning from reduced and limited amounts of data.  We as humans need more information to make more conclusions, but it has been proven that it is not exactly the same with computers. The efficiency of a machine when analyzing data is without a doubt multiple times more developed. It doesn't require big amounts of data to synthesize what can be learned in a small data input.

Few-shot classification Is based on two data types. A training moment, where the machine learns the data. And a classifying moment, where other data that was not present in the training shows up and the machine has to react to it. 

The most popular dataset for a few-shot classification is called the mini-ImageNet. A sample version of classes from ImageNet. The dataset contains 100 classes in total and are divided into training, validation, and test class splits. Recent studies show how this method allows a model to be competitive when testing. By re-using features learned at training.  

The image shows test tasks from Mini-ImageNet. Using the unseen test classes at running time.

Meta-Datasets work similarly but not the same. There are two pillars, Pre-training and meta-learning. Pre-training trains a classifier on the set of classes using when learning. Later on, the results of a test can be classified by nearest neighbor comparisons. The support examples work towards the prediction of each query.  

Meta-learning works by building "training tasks".The objective is to reflect the goal of performing well on each task based on the query. Training classes are sampled randomly, some later used as support, and other as query sets.  

The most relevant finding in the evaluation of pre-training and meta-learning models of Meta Dataset is: some models are more capable than others of exploiting additional data at test time.

The performance of different models varies depending on the number of available examples in each test task Contrary to logical expectations. Models perform best with fewer support samples. As we don't know the number of examples that will be available at the testing time, one has to identify a model that can leverage any number of examples.

In conclusion, Google launches a new searching service, using datasets instead of their original algorithm. It has been in beta testing for longer than what is traditionally recommended, 2 years. The information delivered as subsets is different than when one browses through their regular search engine. A lot of educators and students are using the platform to access other kinds of information. Based on Artificial Intelligence, machine learning is being used at its core. Discoveries have been surprising, as now we know that using the Few shot learning way of introducing data, the machine is more efficient.

Tags Machine Learning Artificial Intelligence Google
Lucas Bonder
Technical Writer
Lucas is an Entrepreneur, Web Developer, and Article Writer about Technology.

Related Articles