Machine Learning

Avatto > > DATA SCIENTIST > > SHORT QUESTIONS > > Machine Learning

Machine Learning is used for programming the computers so that they can automatically learn about data and different situations. Based on learning, computers can take decisions to handle different situations.

Machine learning also helps us in understanding more about different data sets. Sometimes, we use machine learning to create segments of different groups in a dataset.

We can also use it to predict trends based on existing data.

One popular use of Machine learning is to predict user behavior based on different models and past user behavior information.

Some of the examples of Machine Learning use cases are:
  • Email/Spam filtering
  • Network intrusion detection
  • Optical Character Recognition (OCR)
  • Ranking Data
  • Fraudulent Financial Transactions Detection
  • ​Image Recognition
Data mining is the process of discovering patterns in a data set.

We perform data mining by using programming methods and algorithms. We can use it to extract useful information from a large amount of raw data. It helps us in making the data understandable and usable.

Machine Learning is a technique to make the computer learn new things without explicitly programming. It is based on pattern recognition, computational learning theory, and artificial intelligence.

Some of the main uses of Machine Learning are predictive analysis and classification.

The important difference here is that in Data mining, we explicitly look for patterns, whereas in Machine learning our algorithm/model identifies the patterns.
In machine learning, we create models to determine a conclusion. Whenever a model becomes over-complicated to predict a specific set of data, it is called overfishing.

In such a scenario, the initial data is predicted with high accuracy by the model. But with any additional data, the model predicts with much lesser accuracy. This defeats the purpose of the model.

E.g. Let say, we want to predict the type of fruit-based on its height, width, color, and weight. There may be some outliers in our data. Like- a yellow color apple.

If we make our model complicated, it may accurately predict such an object of yellow color to be an apple. But in actual data there maybe lemons. Due to overfitting, our model will start predicting lemons as apples based on color.

One simple way to understand overfitting is that our information from past experiences can be divided into two groups.

1. Information that is relevant for prediction

2. Information that is irrelevant for prediction. It is also called noise.

Whenever there is more noise, it is more difficult for a model to predict correctly. It is a difficult problem for a model to determine which part should be ignored. Once we have a robust learning algorithm, the chance of fitting noise reduces drastically.
Overfitting occurs when the criteria used for training the model are not the same as the criteria used for judging the efficacy of the model.

Overfitting also happens when a model tries to memorize the training data instead of learning from training data.

Once we come to an optimum set of parameters for the model, the overfitting stops. If we increase the number of parameters beyond the optimum level, overfitting occurs.

If our model performs better on training set than on test set, it means there is overfitting in our model. In such a scenario we have high variance in our model. To reduce overfitting, we can find ways to reduce variance in the model. .
Some of the important ways of preventing overfitting are as follows:

 a) Cross Validation: One of the best ways to prevent overfitting is the technique of Cross Validation.

In the simplest version of this technique, we divide the dataset into two populations.
One is a training population and the other is the testing population.

By using Cross-Validation, we use the training population to create the model, whereas the testing population is used for testing the model. Once we have the correct model, it works well with training as well as the test population.
Further, there are many versions of cross-validation by performing test/train split in different ways. One popular method is called K-fold cross-validation.


b) Collect more Data: We can also avoid overfitting by collecting more data in the first place. When we have a small dataset, the chances of overfitting increase. We can collect more data to build the correct model.

c) Reduce the number of features: In a complex model with the problem of overfitting, it is better to reduce the number of features that are not strong predictors. This helps in reducing the variance as well as boosting the performance of the model.

d) Stopping Early: When we iterate multiple times to get a model, it can memorize the data. In such a scenario overfitting occurs. To prevent this situation, we can stop early in the iterations. This helps in stopping the training process at an optimum point.