How data becomes knowledge, Part 2
This content is part # of # in the series: How data becomes knowledge, Part 2
Stay tuned for additional content in this series.
This content is part of the series:How data becomes knowledge, Part 2
Stay tuned for additional content in this series.
The data lake concept has been in existence for a few years now. It
initially attracted some controversy and was labeled marketing hype. The
term data lake wasn’t part of any traditional data-storage
architecture, so vendors freely used it to mean many different things.
Terminology for data storage, such as streams, pools, reservoirs,
and clouds, is in widespread use in data science. Inevitably,
people began drawing parallels to the natural water ecosystem so now we
have data lakes and data swamps as well.
Analogies are great for explaining concepts, but there’s always the danger
of carrying the analogy too far until it fails. Analogies also make the
terminology confusing if you’re a new entrant to the field and don’t know
what it all really means. As the data lake concept has slowly gained
acceptance, however, there have been attempts to define an architecture to
formalize the concepts.
All that said, I’m going to explain these concepts by using yet another
analogy. The sidebar shows the standard definitions of the terminology;
the analogy that follows explains them in conceptual terms. My analogy is
based on making a sandwich (in my defense, I’m writing this before lunch,
and I’m hungry). I begin the analogy at a grocery store, where most of us
get our sandwich makings.
A simple analogy
A grocery store has aisles and shelves on which employees sort and neatly
store the groceries by category. You can easily select and buy the
groceries you want. The grocery store is analogous to a database that
stores data assets in table rows and columns for easy retrieval.
The groceries the store stocks come from multiple sources and suppliers,
arrive at various times, and have different sell-by dates. Similarly, data
can come from multiple data sources at various times. Data can also become
stale, just like groceries. Like the many ingredients from the grocery
store that go into a sandwich, information is a collection of cataloged
data in a specific context. In other words, the sandwich is analogous to
The whole vegetables and greens are analogous to unstructured data; the
sliced and diced vegetables and greens are analogous to structured data.
(To make this analogy work, I assume that the whole veggies are
Now, assume that your local sandwich shop selects and buys groceries from
this grocery store, cleans and washes the groceries, cuts them for use in
sandwiches, and bins them separately — just like cleaning,
structuring, and normalizing data before using it for analysis.
When you want to eat a sandwich, you head to the sandwich shop. The
sandwich shop could also have different counters where you can get a
sandwich, wraps, or salads — analogous to data marts and data
warehouses. Just like a counter is a subset of the sandwich shop, the data
mart is a subset of the data warehouse. A data mart corresponds to an
individual department, while a data warehouse corresponds to the entire
At the sandwich shop, you look at the menu and decide what kind of
sandwich you want; then, you order it. The sandwich maker uses the same
repetitive process to make each sandwich; indeed, you can find some
sandwiches already made and wrapped for immediate consumption. The
sandwich shop is analogous to the menu for the business intelligence (BI)
tools integrated with the data warehouse. The analytics also uses
repetitive processes to generate reports and provide users with some
canned reports for immediate consumption.
Most people prefer to customize their sandwich, asking for changes in the
quantities of the ingredients, changing the garnishing, or omitting some
of the ingredients. Likewise, with BI tools, you can customize reports by
selecting specific data. Just like you can create your own sandwich by
specifying the ingredients to the sandwich maker, you can also create
custom analytics reports by specifying the data and algorithms in the BI
Now, imagine that you’re a food inspector and want to ensure that none of
the groceries used to prepare the sandwiches was contaminated. Also you
want to ensure that the process used for food preparation, including
washing, cleaning, and dicing, was consistent and done under sanitary
conditions. In such a case, you would need to audit the processes used for
food preparation and periodically inspect the food preparation area.
Similarly, auditors need to access the raw data to verify that there has
been no contamination of the data in the data preparation process because
of transcription, cleaning, formatting, and normalizing. Unlike in the
case of the groceries in the sandwich shop, you can copy and clone data.
So, for compliance and auditing, storage of the raw data is possible.
Originally, data lake referred to the data reservoir holding raw
data as well as unstructured data such as text, images, audio, and video.
However, as mentioned, vendors have other definitions of data
Continuing the analogy, imagine a finicky consumer who’s suspicious of the
origins and freshness of the ingredients in the containers on the sandwich
counter. The consumer might also want to put vegetables or meats not
available in the sandwich shop into their sandwich. The sandwich shop is
certainly not going to allow consumers to go behind the counter to prepare
their own sandwich, so the consumer has no choice but to go to the grocery
store to buy groceries and make their sandwich in their own kitchen.
Often, professional analysts and data scientists want access to the raw
data rather than to the prepared aggregate summary data stored in the data
warehouse: They would rather get the latest data from source to ensure its
validity and relevance. They might also want to see the arrival velocities
of the data, which could suffer from masking during the preparation
process. If analysts want to see other data not considered in the data
warehouse, they will want to access the raw databases directly. Rather
than accessing the source data directly, a data lake keeps clones of the
raw databases for such access needs and to sandbox new analytics.
Sometimes, a gourmet sandwich maker might insist on getting ingredients
farm fresh from the farmer rather than the grocery store. In that case,
that gourmet sandwich maker must duplicate the functions of the grocery
store produce buyer, which is analogous to real-time data such as from an
Internet of Things (IoT) device. In such a case, the data lake must
perform extract, transform, load (ETL) functions as well for such
real-time data streams.
Finally, imagine a seedy sandwich shop. The containers at the counter
don’t have labels. Vegetables and meats overflow into one another
willy-nilly, and even the sandwich maker is unsure exactly what type of
meat is in that last container. Customers might walk out because they
can’t be sure what kind of sandwich they’re getting. This is analogous to
a data swamp, which is a poorly maintained data lake. The data is like
mystery meat, and no one can confirm the antecedents of some of the data.
Good data is inaccessible because the data swamp doesn’t appropriately
document (or worse, wrongly documents) the metadata labels or some of the
data is in a format that the integrated tools can’t read or is not
retrievable by a query.
Why do we really need data lakes?
You now know that we need data lakes for several reasons:
- As a raw data repository for compliance and audit purposes (for
example, audio and video recordings, document scans, and text and log
- As a platform for data scientists and analysts to access both
structured and unstructured data for validation purposes and to
sandbox new analytics models
- As a platform to integrate real-time data from operational or
transactional systems and, increasingly, sensor data from IoT devices
The aggregate and summary data that the data warehouse provides is enough
for most BI users. The users of a data lake can be auditors, specialist
analysts, and data scientists (who are in the minority). What other
compelling reasons are there for an enterprise to choose to create a data
lake? Therefore, it’s worthwhile to examine how the data lake differs from
a data warehouse.
What’s the difference between a data
warehouse and a data lake?
Data warehouses are a mature and secure technology with a formal
architecture. They store fully processed and structured data subject to
data governance processes. Data warehouses combine data into an aggregate,
summary form for use enterprise-wide and write metadata and schema
definitions while performing the data Write operations. Data warehouses
usually have fixed configurations; they are highly structured and
therefore less flexible and agile. A cost is associated with processing
all the data before storage, and large volume storage is relatively more
Data lakes, in contrast, are a newer technology and have evolving
architectures. Data lakes store raw data in any form — both
structured and unstructured — and in any format, including text,
audio, video, and images. As defined, a data lake is not subject to data
governance, but experts agree that good data management is essential to
prevent a data lake from turning into a data swamp. Data lakes create
schemas during data Read operations. Data lakes are less structured and
more flexible; they offer better agility than data warehouses. No
processing is necessary until data retrieval, and data lakes use
inexpensive storage by design.
Despite their advantages, data lakes have some catching up to do regarding
security, governance, and management. But, there is an elephant in the
room that is a compelling driver.
Machine learning and deep learning as drivers
One of the least discussed yet probably the most compelling reasons to
adopt data lakes are the rising adoption of machine learning and deep
learning technologies for data mining and analytics. Software auditing is
a mature domain for traditional search and analytics, but it’s in its
infancy when it comes to machine learning and deep learning technologies
used for data mining and analytics.
Speech transcription, optical character recognition, image and video
recognition, and so forth, now routinely use machine learning or deep
learning technologies. Data scientists need to access the raw,
unstructured data to train these systems to perform systems validation and
to ensure an audit trail. Similarly, deep learning performs tasks such as
data mining to find patterns and relationships between dimensional and
Another deep learning application is to extract formerly inaccessible data
that a query cannot retrieve. Such data, called dark data, is the
subject of the next segment in this series. The advent of machine learning
and deep learning in data mining and analytics applications is a very
compelling reason to move to data lake architectures.
The benefits of data lakes
Data lakes have several benefits:
- Easy data collection and ingestion: All the data
sources in an enterprise feed into the data lake. The data lake,
therefore, becomes a seamless point of access to both structured and
unstructured data stored in either on-premise servers or cloud
servers. The entire silo-less data collection is thus easily available
for ingestion by data analytics tools. Besides, the data lake can
store data in multiple formats, such as text, audio, video, and
images, in multiple file formats. This flexibility simplifies the
integration of legacy data stores.
- Support for real-time data sources: Data lakes
support ETL functions for real-time and high-velocity data streams,
which allows the convergence of sensor data from IoT devices with
other data sources within the data lake.
- Faster data preparation: Analysts and data scientists
don’t have to spend time accessing multiple sources directly and can
search for, find, and access data much more easily, speeding the
data-preparation and reuse process. Data lakes also track and confirm
data lineage, which helps to ensure that data is trustworthy and
produces prompt BI for data-driven decision making.
- Better scalability and agility: Data lakes can take
advantage of distributed file systems for storage and are thus highly
scalable. The use of open source technologies also reduces storage
costs. Data lakes are less rigidly structured and therefore inherently
offer better flexibility which results in better agility. Data
scientists can create sandboxes within the data lake to develop and
test new analytics models.
- Advanced analytics with artificial intelligence:
Access to raw data, the capability to create sandboxes, and the
flexibility to reconfigure, make data lakes a powerful platform to
rapidly develop and use advanced analytics models. Data lakes are
ideally suited to the use of machine learning and deep learning to
perform tasks such as data mining and data analysis as well as for the
extraction of unstructured data.
The evolution of data lakes
The evolution of data lakes is more a convergence of technologies than an
evolution. Data warehouses were an evolutionary step up from their
predecessor, the relational databases, but we cannot say the same for data
lakes and data warehouses.
Data lakes bring together diverse technologies, including data
warehousing, real-time and high-velocity data streaming technologies, data
mining, deep learning, distributed storage, and other technologies. There
is a feeling, however, that data lakes have a limited user group among
professional data scientists or analysts. Another common misconception is
tying the data lake concept to a specific enabling technology such as
The data lake concept has a much greater potential than any one underlying
technology, though, and it is in the process of continuous evolution as
vendors add features and functionality. Potential areas of growth include:
- Architectural standardization, and interoperability
- Data governance, management, and curation
- Holistic data security
As with most evolving technologies, competition among vendors and business
drivers pushes the barrier. It’s only a matter of time before data lakes
gain widespread acceptance among the pantheon of data-storage
The application of data
Some features of data lakes make them well suited to certain applications.
This section examines two of them.
Healthcare and the life
Data lakes can help resolve electronic medical record (EMR)
interoperability issues. The intention of the federal mandate for the use
of EMRs was to give physicians the ability to access patient medical
records across multiple systems and for easy transition of patient care
between providers. In practice, many of these records — both
insurance claims and clinical data — are either not interoperable
or not in the form of machine-readable data. Data lakes store records in
any format until retrieval. So, patient records might also include
handwritten doctor’s notes, medical imaging, and so on. Data lakes also
have the ability to extract and store data from real-time data streams,
resulting from the growing use of medical device telemetry and the IoT in
Banking and finance
The banking and finance industry typically deals with multiple data
sources. It also deals with high-velocity transaction data, from stock
markets to credit cards, and other banking transactions. Banking and
financial institutions routinely store legal and other documents for
regulatory compliance and audit requirements. Data lakes are ideal for
storing these mixed data formats and to store legacy data digitally for
easy retrieval. Data lakes serve as an agile platform for ingesting
multiple data streams for the heavy use of analytics in this industry
Data lakes, when designed and implemented properly, are a powerful way to
store large volumes of multiformat data without the need for silos. They
cut the time and cost of data ingestion and transformation and thus make
the data available promptly to users. They also allow the use of
lower-cost distributed storage. Data lakes have yet to mature
architecturally, and there is currently a lack of standardization between
vendor offerings. Data lakes are still evolving and adding new
functionality to improve features for access control, security, data
management, curation, and so on. The advent of machine learning and deep
learning technologies for data mining and analytics introduced the need
for a platform that provides easy access to raw data to train these
systems, for systems validation, and to ensure an audit trail. Data lakes
are an elegant answer to that need. Deep learning also enables access to
previously ingested legacy data in data lakes that is inaccessible through
standard query mechanisms. This so-called “dark data” is the subject of Part 3 of this series.