Features And Labels
In this chapter 2 we'll discuss two important conceptual definition of machine learning. These are features and labels. These are the most important part of machine learning. So, let's get started.
Features
What is a feature?
In simple words, features are the input variables of the dataset.
In machine learning algorithm it takes input data to generate the output. Input data in a tabular form is called dataset. Dataset is a collection of rows and columns. Each row represents a single data point and each column represents a feature of that data point. So, features are the columns of the dataset. Rows(instances or observations) are called as samples and columns(variable or attribute) are called as features.
Example of features
- Suppose we need to predict an animal's species based on its weight and height. So, in this case weight and height are the features of the dataset.
- Suppose we need to predict the price of a house based on its area, number of rooms, number of bathrooms, etc. So, in this case area, number of rooms, number of bathrooms, etc. are the features of the dataset.
- Suppose we need to predict the price of a car based on its mileage, engine size, number of cylinders, etc. So, in this case mileage, engine size, number of cylinders, etc. are the features of the dataset.
So now we can say features are characteristics of observations
Measurements of features
Example 1
The thermometer performs measurements whose result constitutes a table of observations.The value associated with each measurement corresponds to a feature “temperature”.
Example 2
A police radar measures the speed of cars passing by. The speed of each car is recorded as a feature of the dataset. In the computer where each row contains the license plate number of a car and its speed in this case license plate ID or index of the dataset and the speed are the features. If we want to predict what cars use a certain street, we can use the license plate number as a feature not the index of the dataset.
Identification of Features
Above example shows that we alternatively use the license plate number or the index of the dataset as a feature. So it ultimately depends on the task we want to perform or we can say depends on the objective of the task. So nobody approaches a problem without background knowledge and expectations about that problem’s characteristics, we can say that our expectations shape our measurements
Car's License Plate | Speed |
---|---|
1234 | 60 km/h |
5678 | 70 km/h |
9012 | 80 km/h |
so here which will be the feature it depends on the task we want to perform.
Labels
What is a label?
In simple words, labels are the output variables of the dataset.
In machine learning algorithm it takes input data to generate the output.
For example
In the above example, the input data is the image of a cat and the output is the label of the image which is cat.
Labels vs Features
We can understand the difference between both of them with a simple example of an image dataset of animals. So, height, weight, color, etc., are the features. Whereas it is a cat or dog, these are the labels.
What is data labeling?
If we input a vast amount of raw data to a Machine Learning model and expect it to learn from it, then it is not enough. We need to label the data. So, labeling is the process of assigning labels to the data. It is a very important step in machine learning. So, we need to label the data properly. If we label the data properly then the model will perform well. If we label the data improperly then the model will perform poorly.
Hence, we can define it as, "Data labelling is a process of adding some meaning to different types of datasets, so that it can be properly used to train a Machine Learning Model. Data labelling is also called as Data Annotation (however, there is minor difference between both of them)."
Labeled Data vs. Unlabelled Data
-
Labeled Data: Labeled data is data that has some predefined tags such as name, type, or number. For example, an image has an apple or banana.
-
Unlabeled Data: At the same time, unlabelled data contains no tags or no specified name.
-
Labeled Data: Labeled data is used in Supervised Learning techniques
-
Unlabeled Data: Unlabelled data is used in Unsupervised Learning.
-
Labeled Data: Labeled data is difficult to get
-
Unlabeled Data: Unalabled data is easy to acquire.
How does Data Labelling Work?
Today, most machine learning models use supervised learning techniques to map an input variable to an output variable and predict its value.For supervised learning, we need the labeled dataset to train the model so that it can make accurate predictions.The Data labeling begins with a process of "Human-in-the-loop" or HITL participation, where humans are asked to make a judgment for the given unlabelled data. For example, a human labeler may be asked to tag an image dataset, where "does the image contain a cat" is true.
ML models learn from human-provided labels, which is called a model training process, and then they can predict with new data or test data using the trained model.