By: Sofia Sayyah
We hear the term “big data” all the time. We live in the “era of big data,” companies are learning to “utilize big data,” you can “improve your business model with big data,” yada yada yada. But what is big data? Besides a buzzword, that is.
The phrase “big data” can refer either to the data itself or to the field that undertakes the task of analyzing it. We will discuss what makes data “big” later on, but for now let’s work with the second usage of the term. Big data is a field of work that deals with analyzing and finding insights in data that is too overwhelmingly large or complex for traditional data-processing software to work efficiently.
How big data became the booming field it is today
Data has been leveraged as a resource for decision-making since Ancient Egypt. However, the basis for what would become “big data” began in the 1970s, with database management and data warehousing. These strategies worked with structured data that could be stored in Relational Database Management Systems, and formed the foundation of data mining as we know it, using queries and statistical analysis. Still, it wasn’t until the dot com boom around 1995 that data began to accumulate at a rapidly increasing rate, as it has for the last 20 years.
The expansion of web traffic followed by social media in the 2000s created new opportunities for data analysis, with information such as click-rates and search logs now being collected. Organizations were facing a tsunami of data, data that was semi-structured or unstructured — meaning the data has no pre-defined internal structure and therefore is not stored in a structured database — instead of structured — tabular, and organized in such a way that makes it easily searchable in a database — requiring new techniques for storing, wrangling, and analyzing data in order to make it meaningful.
In the last 10 years only, we’ve seen a huge expansion of new types of data collection with the rise of the Internet of Things: smartphones, smart appliances, watches, home assistants, security systems, industrial and commercial sensors, and more have made data collection easy and cheap. There is a wealth of information about our behavior and preferences like never before, if we can just understand all this data.
What makes data “big”?
Now that we understand how the field of big data reached its current state, let’s take a look at what technically makes data “big.”
There are three key criteria that define big data: volume, velocity, and variety.
Volume: The quantity of generated and stored data. Big data usually means you’ll be dealing with large amounts of unstructured data. The size of the data helps to determine the potential insights to be made.
Velocity: The speed at which the data is generated and processed. Big data is often available in real time, and requires immediate evaluation in order to meet the challenges involved in growth and development, and are produced more continually. Velocity can also be further divided into (1) the frequency of generation and (2) the frequency of handling, recording, and publishing.
Variety: The type and nature of the data. Big data draws from many types of data, including text, images, audio, video, and other unstructured or semistructured data. Further, it can fill in the gaps with data fusion. Understanding the variety allows analysts to effectively process the data and create meaningful insights.
1. Write Your First AI Project in 15 Minutes
2. Generating neural speech synthesis voice acting using xVASynth
3. Top 5 Artificial Intelligence (AI) Trends for 2021
4. Why You’re Using Spotify Wrong
Another factor also considered when determining what is big data is veracity.
Veracity: The data quality and the data value. The quality of captured data can vary greatly, which impacts the quality of analysis and insight. Low quality data is even more challenging to link, clean, and transform across systems, limiting its usefulness.
Big data is a potential gold mine of information for organizations. The sheer quantity and variety of data being collected, often in real time, lead to much greater statistical power. The gains to be made from utilizing big data are substantial, and can include the following:
Improved decision-making capabilities: Organizations have access to more and more information, making for varied and deep insights. By the law of large numbers, as sample size and number of trials increase, sample statistics converge to population parameters; this means that having access to more data, as well as data that is more representative, organizations have much greater statistical power in learning about their subject of interest. Where data was once lacking, there is now a constant flow of information to minimize guesswork involved in decision-making, which has far-reaching implications across an organization.
Increased efficiency: Big data allows an organization to both increase productivity and decrease costs. If the right tools are being used, analysts can analyze more data faster, increasing productivity per analyst. This reduces the effort it takes for an organization to effectively make use of data. In addition to data on their customers and products, organizations can also gather internal data, allowing them to analyze and target operational efficiencies in order to get rid of them.
Greater potential applications: Now that data is being collected about nearly everything we do, the applications of this data are endless. One primary use is improving customer experience; when data is collected from interactions with customers, organizations can make their customer relationships more effective. Other applications include product development, systems maintenance, fraud detection, market disruption, and faster time-to-market, among many others.
Big data may come with big rewards, but it also comes with big challenges. Many of these challenges have to do with just how big the data is.
Data storage: The New York Stock Exchange generates roughly 1 terabyte of data each day. Facebook ingests 500+ terabytes of data each day. A jet engine can generate 10+ terabytes of data each half hour it is in flight — with hundreds of thousands of flights every day, that is petabytes of data. This leads us to the trillion dollar question — how do we store all of that data?
Data analysis: Say we’ve solved the problem of storing all this data — now we have to sift through it and make something of it. It’s a flood of information, and analysts don’t always know what they’re looking for or how to find it, particularly in data that is unstructured or semi-structured. Finding insights can be like searching for a needle in a very large, very messy haystack.
Data cleaning: Data cleaning is the process of making data uniform and removing or correcting incomplete or inaccurate data. Analysts spend anywhere from 50–80% of their time cleaning data, work that is necessary for any analysis to be done. This should make data analysis easier, but unstructured or semi-unstructured data require much more work to make usable, and data quality can also help or hinder this process. Such a large volume of data makes the task of cleaning an enormous one.
Data maintenance: Now that data has been stored, cleaned, and analyzed, what should be done with it? Do we keep it? Is it still relevant? How do we make room for the data still being collected every minute? How can we keep this data neat as we add to it and update it? All the normal problems faced in data management grow with the size of the data.
The traditional tools used to work with smaller amounts of data are inefficient in handling the same tasks on a much larger scale. Tools that are geared specifically for this challenge, such as Apache Hadoop and Spark, help to manage data that exists on such a large scale (100s GB+), and more tools are constantly in development.
Healthcare: Big data is completely transforming healthcare. By putting medical records to use, doctors are able to identify illnesses with greater accuracy and treat illnesses with greater precision. This is leading a growing sector focused on personalized medicine, including wearable medical technology. Recently, we’ve also seen how predictive models are helping us to understand the spread of COVID-19.
Retail: The retail market is fast-paced, and retail establishments need to anticipate customers’ wants and provide for them. Big data is helping organizations predict trends, optimize pricing, and match their customers with the products they want.
Finance: Big data is essential to big money trading, particularly in financial models and real time stock market analysis. In addition to these applications, financial institutions are using data to detect fraudulent transactions or manage risk.
Transportation, supply chain, logistics: Organizations are improving their logistics by using big data to map journeys, inform decisions to restock, and prepare for unexpected circumstances. Additionally, sensor data collected from planes, trains, and automobiles can help uncover sources of failure and ways to improve.
Data: Now that big data is so common, an industry devoted to it is quickly growing. There are companies who are built on data, such as Pinterest or Spotify, but have a different end user product. There are also organizations that deal entirely in data, such as providing data management services, or machine learning insights (which is some of what Big Data at Berkeley does for our clients!).
Big data is in action in our daily lives, whether we see it or not. It is moulding the technologies we use and improving our quality of life. The changes in the healthcare industry mentioned earlier are improving patient outcomes. Our consumption of products is also changing. Amazon, Spotify, and Netflix can tell us what we want to consume before we know it ourselves. Our daily comforts are beginning to rely on big data-utilizing technologies that have integrated seamlessly into our lives without our realizing it.
With these benefits come risks, too. The rise of big data has led to data privacy, security, and misuse concerns. Many people don’t realize how much data on our behavior is being collected — it’s not uncommon for people to skip through the fine print of user agreements. And even more, many organizations are not keeping our data to themselves. Information on our behaviors, preferences, and demographics and constantly being collected and shared; our digital identities reach far beyond where the eye can see. Once this data has been collected, it’s very difficult to take it out of circulation. This makes it hard for some to maintain their desired level of privacy, especially in the case of data breaches, where unauthorized parties gain access to private data, or when organizations are misusing data in ways it wasn’t intended for or that users did not agree to (Read our “Data Science in Social Media” or our “Data Ethics” blogs for more detail).
Just this July, the “Big 4” — Amazon, Apple, Facebook, Google — tech CEOs faced an antitrust congressional hearing. The focus was on the companies trying to stifle competition, and examples of how they did this often had to do with their use of big data. Amazon was accused of lying to lawmakers about tapping data from third party sellers, and all four were accused of putting users’ privacy at risk.
It’s important to note that innovation prays at the altar of change, more specifically, change for the sake of change. Innovation without the guiding hand of ethics can often pose more harm to our society than good. Big data can make enormous changes to our lives. But this innovation must be wielded with care for those changes to be good.
On the organizational level: While big data might present itself as the best and only way to scale an organization and grow their success, it’s important for organizations to weigh the benefits of big data with their ability to manage the challenges. Organizations should also make sure that their big data strategies align with their overall business strategies and goals. If big data is not being managed or utilized effectively, the costs may outweigh the benefits, whether that be in legal fees or company culture.
On the individual level: Users of products and services may want to read the fine print to better understand how their data is being used, and if it is in their best interest to allow their data to be used in certain ways. They may want to limit their exposure to data misuse or security issues.
On the governmental level: Recent laws and regulations are slowly giving consumers more rights in limiting their exposure with regards to private data. The California Consumer Privacy Act allows consumers to demand to see the information a company has saved about them, as well as a list of third parties that information has been shared with. It also allows consumers to opt-out of unnecessary data collection. In the European Union, “right to be forgotten” laws have been passed that allow users to ask that their personal information be made unsearchable. While the right to be forgotten faces an uphill battle in the United States, where first amendment free speech rights are regarded very highly, the EU’s laws provide one example for how to approach data privacy concerns.
The technologies that have been created and the gains that have been made in the last 20 years are greater than anyone would have imagined prior. Today we can’t even imagine life without them. The pace of technological innovations has been rapidly increasing, and we expect they will continue to do so in the next decade as well. Understanding innovations such as big data can help guide our use of it, and make sure that we are headed in the right direction.