Insufficient data is often one of the major setbacks for most data science projects. However, knowing how to collect data for any project you want to embark on is an important skill you need to acquire as a data scientist.
Data scientists and machine learning engineers now use modern data gathering techniques to acquire more data for training algorithms. If you’re planning to embark on your first data science or machine learning project, you need to be able to get data as well.
How can you make the process easy for yourself? Let’s take a look at some modern techniques you can use to collect data.
Why You Need More Data for Your Data Science Project
Machine learning algorithms depend on data to become more accurate, precise, and predictive. These algorithms are trained using sets of data. The training process is a little like teaching a toddler an object’s name for the first time, then allowing them to identify it alone when they next see it.
Human beings need only a few examples to recognize a new object. That’s not so for a machine, as it needs hundreds or thousands of similar examples to become familiar with an object.
These examples or training objects need to come in the form of data. A dedicated machine learning algorithm then runs through that set of data called a training set—and learns more about it to become more accurate.
That means if you fail to supply enough data to train your algorithm, you might not get the right result at the end of your project because the machine doesn’t have sufficient data to learn from.
So, it’s necessary to get adequate data to improve the accuracy of your result. Let’s see some modern strategies you can use to achieve that below.
1. Scraping Data Directly From a Web Page
Web scraping is an automated way of getting data from the web. In its most basic form, web scraping may involve copying and pasting the elements on a website into a local file.
However, web scraping also involves writing special scripts or using dedicated tools to scrape data from a webpage directly. It could also involve more in-depth data collection using Application Programming Interfaces (APIs) like Serpstack.
Although some people believe that web scraping could lead to intellectual property loss, that can only happen when people do it maliciously. Web scraping is legal and helps businesses make better decisions by gathering public information about their customers and competitors.
For instance, you might write a script to collect data from online stores to compare prices and availability. While it might be a bit more technical, you can collect raw media like audio files and images over the web as well.
Take a look at the example code below to get a glimpse of web scraping with Python’s beautifulsoup4 HTML parser library.
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "Enter the full URL of the target webpage here"
targetPage = urlopen(url)
htmlReader = targetPage.read().decode("utf-8")
webData = BeautifulSoup(htmlReader, "html.parser")
Before running the example code, you’ll need to install the library. Create a virtual environment from your command line and install the library by running pip install beautifulsoup4.
2. Via Web Forms
You can also leverage online forms for data collection. This is most useful when you have a target group of people you want to gather the data from.
A disadvantage of sending out web forms is that you might not collect as much data as you want. It’s pretty handy for small data science projects or tutorials, but you might run into constraints trying to reach large numbers of anonymous people.
Although paid online data collection services exist, they aren’t recommended for individuals, as they are mostly too expensive—except if you don’t mind spending some money on the project.
There are various web forms for collecting data from people. One of them is Google Forms, which you can access by going to forms.google.com. You can use Google Forms to collect contact information, demographic data, and other personal details.
Once you create a form, all you need to do is send the link to your target audience via mail, SMS, or whatever available means.
However, Google Forms is only one example of popular web forms. There are many alternatives out there that do excellent data collection jobs as well.
You can also collect data via social media outlets like Facebook, LinkedIn, Instagram, and Twitter. Getting data from social media is a bit more technical than any other method. It’s completely automated and involves the use of different API tools.
Social media can be difficult to extract data from as it is relatively unorganized and there is a vast amount of it. Properly organized, this type of dataset can be useful in data science projects involving online sentiments analysis, market trends analysis, and online branding.
For instance, Twitter is an example of a social media data source where you can collect a large volume of datasets with its tweepy Python API package, which you can install with the pip install tweepy command.
For a basic example, the block of code for extracting Twitter homepage Tweets looks like this:
myAuth = tweepy.OAuthHandler(paste consumer_key here, paste consumer_secret key here)
auth.set_access_token(paste access_token here, paste access_token_secret here)
authenticate = tweepy.API(myAuth)
target_tweet = api.home_timeline()
for targets in target_tweet:
You can visit the docs.tweepy.org website to access the tweepy documentation for more details on how to use it. To use Twitter’s API, you need to apply for a developer’s account by heading to the developer.twitter.com website.
Facebook is another powerful social media platform for gathering data. It uses a special API endpoint called the Facebook Graph API. This API allows developers to collect data about specific users’ behaviors on the Facebook platform. You can access the Facebook Graph API documentation at developers.facebook.com to learn more about it.
A detailed explanation of social media data collection with API is beyond the scope of this article. If you are interested in finding out more, you can check out each platform’s documentation for in-depth knowledge about them.
In addition to writing scripts for connecting to an API endpoint, social media data collecting third-party tools like Scraping Expert and many others are also available. However, most of these web tools come at a price.
4. Collecting Pre-Existing Datasets From Official Sources
You can collect pre-existing datasets from authoritative sources as well. This method involves visiting official data banks and downloading verified datasets from them. Unlike web scraping and other options, this option is faster and requires little or no technical knowledge.
The datasets on these types of sources are usually available in CSV, JSON, HTML, or Excel formats. Some examples of authoritative data sources are World Bank, UNdata, and several others.
Some data sources may make current data private to prevent the public from accessing them. However, their archives are frequently available for download.
More Official Dataset Sources for Your Machine Learning Project
This list should give you a good starting point for getting different types of data to work with in your projects.
There are many more sources than this, and careful searching will reward you with data perfect for your own data science projects.
Combine These Modern Techniques for Better Results
Data collection can be tedious when the available tools for the task are limited or hard to comprehend. While older and conventional methods still work well and are unavoidable in some cases, modern methods are faster and more reliable.
However, rather than relying on a single method, a combination of these modern ways of gathering your data has the potential of yielding better results.
About The Author
Credit: Google News