Data Collection process for Machine Learning
What is Data Collection?
In simple words, Data Collection is the process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes.
Data collection means that pooling data by scraping,capturing, recording, measuring, and observing and loading it from multiple sources, either manually or automatically or online or offline sources. High volumes of data creation can be the hardest part of the machine learning project, especially at scale.These data can be numeric (temperature, loan amount, customer retention rate), categorical (gender, color, highest degree earned), or even free text (think doctor’s notes or opinion surveys).
Why is Data Collection Important?
When you collect and analyze data, you can identify recurring patterns from those patterns. From those patterns, you can use machine learning algorithms to build predictive models that forecast future events.
For example, you can use data to predict the number of sales you will make in the next quarter, or you can use data to predict the number of customers who will visit your store in the next month.
Companies that integrate big data-based decision-making into their existing business processes will gain a huge advantage over competitors. Instead of relying on gut feelings or subjective experiences, big data enables companies to make evidence-based decisions.
Problems in Data Collection
Following are some of the problems that can arise in data collection:
- Inaccurate data - The collected data could be unrelated to the problem statement.
- Incomplete data - The collected data could be incomplete.Missing data. Sub-data could be missing. That could take the form of empty values in columns or missing images for some class of prediction.
- Irrelevant data - The collected data could be irrelevant to the problem statement.
- Outliers - The collected data could have outliers. Outliers are data points that are significantly different from the rest of the data. They can skew the results of the analysis.
- Duplicate data - The collected data could have duplicate data. Duplicate data is data that is repeated in the dataset. It can skew the results of the analysis.
- Unstructured data - The collected data could be unstructured. Unstructured data is data that is not organized in a predefined manner. It can be difficult to analyze unstructured data.
- Inconsistent data - The collected data could be inconsistent. Inconsistent data is data that is not consistent with the rest of the data. It can skew the results of the analysis.
- Corrupted data - The collected data could be corrupted. Corrupted data is data that is corrupted or damaged. It can skew the results of the analysis.
Several techniques can be applied to address those problems:
Following are some of the techniques that can be applied to address those problems:
- Pre-cleaned, freely available datasets - There are many freely available datasets that are pre-cleaned and ready to use. You can use those datasets to train your machine learning models.
- Web crawling and scraping - You can use web crawling and scraping techniques to collect data from the web.
- Private data - ML engineers can create their own data. This is helpful when the amount of data required to train the model is small and the problem statement is too specific to generalize over an open-source dataset.
- Data augmentation - Data augmentation is a technique to artificially increase the size of a dataset by adding slightly modified copies of already existing data or newly created synthetic data.
- Data cleaning - Data cleaning is a technique to remove or replace corrupted, incomplete, irrelevant, or duplicate data.
- Custom data collection - Agencies can create or crowdsources the data for a fee.
Data pre-processing
In the real world, raw data and images are frequently incomplete, inconsistent, and lacking certain behaviors or trends, as well as containing errors, so they are pre-processed into a format that the machine learning algorithm can use.
Pre-processing includes a number of techniques and actions:
- Data cleaning - In simple terms, data cleaning is the process of removing or replacing corrupted, incomplete, irrelevant, or duplicate data.
- Data imputations - Data imputation is the process of replacing missing data with substituted values.ML frameworks include methods and APIs for balancing or filling in missing data. Techniques generally include imputing missing values with standard deviation, mean, median and k-nearest neighbors (k-NN) of the data in the given field.
- Oversampling - Oversampling is the process of increasing the number of samples in the minority class by randomly duplicating samples from the minority class.
- Data integration - Data integration is the process of combining data from multiple sources into a single dataset.
- Data normalization - Data normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.
Website for Data Collection
- Kaggle - Kaggle is a platform for data science competitions and machine learning education. It is a community of data scientists working to solve the world’s most challenging problems. You can visit Kaggle (opens in a new tab) to find datasets for your machine learning projects.
- UCI Machine Learning Repository - The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. You can visit UCI Machine Learning Repository (opens in a new tab) to find datasets for your machine learning projects.
- Google Dataset Search - Google Dataset Search is a search engine for datasets. It is a free and open source search engine for datasets. You can visit Google Dataset Search (opens in a new tab) to find datasets for your machine learning projects.
- OpenML - OpenML is a collaborative machine learning platform. It is a free and open source platform for machine learning. You can visit OpenML (opens in a new tab) to find datasets for your machine learning projects.
- Amazon AWS Public Datasets - Amazon AWS Public Datasets is a collection of datasets that are publicly available for use by anyone. You can visit Amazon AWS Public Datasets (opens in a new tab) to find datasets for your machine learning projects.
- Google Cloud Public Datasets - Google Cloud Public Datasets is a collection of datasets that are publicly available for use by anyone. You can visit Google Cloud Public Datasets (opens in a new tab) to find datasets for your machine learning projects.
- Microsoft Azure Public Datasets - Microsoft Azure Public Datasets is a collection of datasets that are publicly available for use by anyone. You can visit Microsoft Azure Public Datasets (opens in a new tab) to find datasets for your machine learning projects.
- IBM Cloud Public Datasets - IBM Cloud Public Datasets is a collection of datasets that are publicly available for use by anyone. You can visit IBM Cloud Public Datasets (opens in a new tab) to find datasets for your machine learning projects.
- Data.gov - Data.gov is a collection of datasets that are publicly available for use by anyone. You can visit Data.gov (opens in a new tab) to find datasets for your machine learning projects.
- Data.world - Data.world is a collection of datasets that are publicly available for use by anyone. You can visit Data.world (opens in a new tab) to find datasets for your machine learning projects.
- Quandl - Quandl is a collection of datasets that are publicly available for use by anyone. You can visit Quandl (opens in a new tab) to find datasets for your machine learning projects.
- Datahub.io - Datahub.io is a collection of datasets that are publicly available for use by anyone. You can visit Datahub.io (opens in a new tab) to find datasets for your machine learning projects.
- AWS Open Data Registry - AWS Open Data Registry is a collection of datasets that are publicly available for use by anyone. You can visit AWS Open Data Registry (opens in a new tab) to find datasets for your machine learning projects.
- Google Dataset Search - Google Dataset Search is a collection of datasets that are publicly available for use by anyone. You can visit Google Dataset Search (opens in a new tab) to find datasets for your machine learning projects.
- Open Data Network - Open Data Network is a collection of datasets that are publicly available for use by anyone. You can visit Open Data Network (opens in a new tab) to find datasets for your machine learning projects.
- Open Data Monitor - Open Data Monitor is a collection of datasets that are publicly available for use by anyone. You can visit Open Data Monitor (opens in a new tab) to find datasets for your machine learning projects.
- Open Data Institute - Open Data Institute is a collection of datasets that are publicly available for use by anyone. You can visit Open Data Institute (opens in a new tab) to find datasets for your machine learning projects.
- Open Data Census - Open Data Census is a collection of datasets that are publicly available for use by anyone. You can visit Open Data Census (opens in a new tab) to find datasets for your machine learning projects.
- Open Data Index - Open Data Index is a collection of datasets that are publicly available for use by anyone. You can visit Open Data Index (opens in a new tab) to find datasets for your machine learning projects.
- Open Data Watch - Open Data Watch is a collection of datasets that are publicly available for use by anyone. You can visit Open Data Watch (opens in a new tab) to find datasets for your machine learning projects.
- Open Data Handbook - Open Data Handbook is a collection of datasets that are publicly available for use by anyone. You can visit Open Data Handbook (opens in a new tab) to find datasets for your machine learning projects.