Saturday, January 23, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Technology Companies

How data becomes knowledge, Part 3: Extracting dark data

January 12, 2019
in Technology Companies
How data becomes knowledge, Part 3: Extracting dark data
587
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

Credit: IBM

How data becomes knowledge, Part 3

You might also like

The Eclipse Foundation’s move to Europe will bring open standards and open source together – IBM Developer

Kafka Monthly Digest – December 2020 and 2020 in review – IBM Developer

Learn new skills while working on volunteer projects – IBM Developer


Content series:

This content is part # of # in the series: How data becomes knowledge, Part 3

https://www.ibm.com/developerworks/library/?series_title_by=**auto**

Stay tuned for additional content in this series.

This content is part of the series:How data becomes knowledge, Part 3

Stay tuned for additional content in this series.

In my previous article, you saw how data lakes help speed up and reduce
the costs of data ingestion by allowing storage of large volumes of
multiformat data. The advent of inexpensive storage technologies makes it easier
for organizations to store enormous amounts of data cheaply.

Organizations store data for many reasons, most often for record keeping
and regulatory compliance. But, there is also a tendency to hoard data that
could potentially become valuable. In the end, most companies never use
even a fraction of the data they store for any purpose because the data may become inaccessible.
This could be because the storage reservoir doesn’t document the metadata labels
appropriately, some of the data is in a format the integrated tools can’t
read, or the data isn’t retrievable through a query. (This last is
especially true for data such as scanned documents, medical imaging files,
voice recordings, video, and some forms of machine-generated data.)

This untapped data that organizations routinely store during normal
operations is called dark data. Dark data is a major limiting
factor in producing good data analysis because the quality of any data
analysis depends on the body of information accessible to the analytics
tools, both promptly and in full detail.

Gartner defines dark data as “the information assets
organizations collect, process and store during regular business
activities, but generally fail to use for other purposes.”

Companies aren’t the only ones who deal with dark data: Many everyday
examples of dark data exist. For example, I read a lot of technical papers
and journals; often during my research, I download and store PDF files or
links for later reference. These files don’t have descriptive names and
many, especially research papers, simply use a numeric document
identifier. Without descriptive information, it becomes impossible to
search for a specific article by keywords. To find a particular paper, I
might need to open and review each document until I get to the one I want
— a time-consuming and inefficient process. And, I often wind up
performing the online search all over again only to realize that I already
have the file when a download attempt results in a duplicate file error.

I could have mitigated this problem through better data governance, such as
storing the files in folders by category or adding descriptive metadata to
the file properties. However, doing this consumes time during my search
and distracts my train of thought. The result is that I wind up with a
collection of often duplicate files that I might never actually use but
hoard because they might become useful in the future. In other words, my
Downloads folder — my personal data lake — has turned into a
data swamp.

Another example of everyday dark data occurs with digital photography.
Digital cameras usually follow a file-naming convention of numbering
picture files sequentially, and the programs that download images to a
computer drive or the cloud typically have a date-based organization.
However, if you want to search for photographs of a specific location,
person, or event, you have to manually review the photographs because no
documentation of the correlation between the photograph creation date the
context of the search exists. Photographs embed metadata, but only
professional photographers tend to use this feature.

Smart applications have solved both of these problems, initially by using
rules-based search and sort methods, but increasingly by using machine
learning and deep learning. Desktop search tools can scan through document
contents and find documents based on keywords, and photo-organizing tools
can recognize faces, landmarks, and features to categorize photographs
automatically.

This installment of the series discusses the factors that lead to the
creation of dark data, the steps you can take to curate and manage data
more effectively, and the methods you can use to extract and use dark data
after the fact.

Why does data go dark?

Data becomes inaccessible and unusable for many reasons, but the principal
reason is that big data is, well, big. Not just big, but
mind-bogglingly enormous. Look at a few social media statistics: In 2017,
every minute on average, Twitter users sent a half million tweets, and 4
million Facebook users clicked Like.

The 3 Vs of big data

  • Volume: Big data typically has an enormous volume,
    and processing this data is both costly and time-consuming. That’s why
    organizations tend to defer processing until doing so is necessary and
    justifiable. For example, US federal mandates to use electronic
    medical records forced healthcare organizations to digitize their
    paper records. But most of these records are in the form of scanned
    images. A doctor can easily pull up a patient record, but the data
    within that record isn’t accessible to an information retrieval and
    analysis system.
  • Variety: Data also comes in a large variety of
    formats, both structured and unstructured. For example, customer
    relationship management (CRM) data typically includes email messages, social
    media messages, voice messages, video, and so on in addition to
    traditional data in a database. Formats such as audio, images, and
    video need preprocessing to extract information for storage in a
    format conducive to retrieval through query and analysis. Again, for
    reasons of cost and time, organizations tend to defer this
    preprocessing and simply store the raw data.
  • Velocity: Business transactional and operations
    systems such as stock market trades or card transactions in the
    financial industry can generate high-velocity data streams. The
    processing and structuring of such data often lags behind the data
    arrival rate. An organization often stores this data just for
    regulatory compliance and auditing. Because there’s no immediate need
    to process the data, the result is to defer processing in favor of
    storing raw data.

Lack of data provenance

In this case data is accessible but has no provenance. It is simply not usable for analysis. The raw unstructured data is needed for provenance, but it is not accessible. Result: dark data.

This is not a direct relationship. Data scientists rely
on the credibility and trustworthiness of data sources to ensure that the
product of data analysis is credible and reproducible. If data doesn’t
have a provenance, then it becomes unusable as a reliable source of
information. Part 2 showed that data lakes facilitate curating this provenance
by preserving the unstructured and raw data.

Poor metadata documentation

Another common reason for a data source to become unusable is the lack of
good metadata. Missing metadata leads directly to data becoming dark data
because you cannot access the data through queries. Inferior quality or
incorrect metadata also causes good data to become inaccessible through
metadata searches. Similarly, inconsistent metadata can split a category
based on the variations in the label metadata.

The pitfalls and risks of dark
data

Now that you have seen how data turns into dark data, it is time to examine
the pitfalls and risks associated with dark data.

Data quality

The main impact of dark data is on the quality of data used for analysis to
extract valuable information. This is important. Dark data makes it
difficult to access and find vital information, confirm its origins, and
promptly obtain essential information to make good, data-driven decisions.
The impact on quality stems from the following factors:

  • Data accessibility: Inability to access data that is
    unstructured or in a different media format, such as images, audio, or
    video, leads to loss of access to essential information that would
    improve analysis.
  • Data accuracy: The accuracy of a data analysis rests
    on the accuracy of the input data. Accurate analysis leads to the
    extraction of qualitatively more valuable information. Hence, dark
    data has a significant impact on the accuracy of the extracted
    information and the quality of the information produced by that
    analysis.
  • Data auditability: The inability to trace the
    provenance of data can lead to its omission from analysis, thereby
    affecting the data quality. This, in turn, can lead to faulty
    data-driven decision making.

Data security

Stored data often holds sensitive information. Sensitive information could
include proprietary information, trade secrets, personal information of
employees and clients such as financial and medical records, and so on.
Organizations tend to relax data security processes when they do not know
that their data store holds sensitive information. Data security breaches
are on the rise from hackers who often discover this sensitive information
first. This leads to costly liability and remedial actions.

Increased costs

Dark data leads to higher costs in two ways:

  • Data storage costs: Although data storage hardware
    costs are decreasing, the volume of stored information grows
    exponentially and can add up significantly in the long term. With
    third-party storage management solutions, the result is the
    application of a higher subscription tier, which in turn leads to
    spiraling costs. This added cost is for data that has unknown worth as
    it is dark data.
  • Regulatory compliance: Businesses must follow many
    laws and regulations. Some, such as the Sarbanes-Oxley
    Act
    , drive the need to store business-related data; others,
    such as the Health Insurance Portability and Accountability Act and Payment Card Industry Data Security Standard, have
    requirements for enhanced protection of certain sensitive stored data,
    all of which can lead to increased compliance monitoring costs.
    Organizations also incur an added cost for monitoring and securely
    destroying expired data. As a result, organizations might continue to
    store dark data long after the regulatory period has lapsed as both
    the sensitivity details or whether the data has expired are unknown.

The benefits of extracting dark
data

Organizations extracting dark data incur an expense and spend considerable
engineering effort, but there are many benefits to doing this.

Dark data is valuable

Dark data is valuable because it often holds information that is not
available in any other format. Therefore, organizations continue to pay
the cost of collecting and storing dark data for compliance purposes and
with hopes of exploiting the data (for that valuable information) in the
future.

Because of this value, organizations sometimes resort to human resources to
manually extract and annotate the data, and then enter it into a
relational database, even though this process is expensive, slow, and
error-prone. Deep learning technologies perform dark data extraction
faster and with much better accuracy than human beings. Dark data
extraction is less expensive and uses less engineering effort when using
these techniques and tools.

Better-quality analytics

With access to better data sources and more information, the quality of
analytics improves dramatically. Not only is the analysis based on a
larger pool of high-quality data, but the data is available for analysis
promptly. The result is faster and better data-driven decision making,
which in turn leads to business and operational success.

Reduced costs and risks

Extracting dark data leaves organizations less exposed to risks and
liability in securing sensitive information. Organizations can also
securely purge unnecessary data, thereby reducing the recurring storage
and curation costs. Regulatory compliance also becomes easier.

Dark data extraction technology is
valuable

In addition to the dark data itself, dark data extraction technologies are
extremely valuable. Recent reports suggest that Apple purchased artificial
intelligence (AI) company Lattice Data for $200 million. Lattice Data
applied an AI-enabled inference engine to extract dark data.

Similarly, the Chan Zuckerberg Initiative (CZI), a philanthropic
organization founded by Facebook CEO Mark Zuckerberg, bought Meta for an undisclosed amount. Meta is an AI-powered
research search engine startup that CZI plans to make available freely.
Thus, dark data extraction technology and intellectual property developed
in-house is also potentially independently quite valuable.

Data-extraction tools

There are many open source dark data extraction tools. This section shows some of the more successful tools.

  • DeepDive: Stanford
    University developed this open source tool, commercially supported by
    Lattice Data. Development is no longer active with Apple’s acquisition
    of Lattice Data in 2017.
  • Snorkel:
    Stanford University also developed this tool. Snorkel accelerates dark data extraction by developing tools to create datasets to help train learning algorithms for dark data extraction.
  • Dark
    Vision
    : This app is a technology demonstrator that uses
    IBM® Watson® services to extract dark data from videos, a classic example of dark data extraction.

Dark data: an untapped resource for
better analytics

Dark data is the untapped data organizations routinely store during normal
operations. This dark data typically remains unused because it’s
inaccessible to traditional relational database tools. Typically, this is
because the data is in an unstructured, unusable format (for example,
document scans or because poor metadata descriptions do not allow efficient
searching. The quality of any data analysis depends on the body of information
accessible to the analytics tools both promptly and in full detail. Dark
data is, therefore, a big limiting factor.

The proportion of dark data to usable data tends to be huge. For example,
in this news release, IBM estimates that 90 percent of all sensor data
collected from Internet of Things devices is never used. This dark data is
valuable, however, because it’s data that isn’t available in any other
format. Therefore, organizations continue to pay the cost of collecting
and storing it for compliance purposes in the hopes of exploiting it in
the future.

Storing and securing dark data does have associated costs and risks, some
of which exceed its value. Also, dark data can be time sensitive, and the
longer the data remains inaccessible, the more value it loses. As a
result, many organizations resort to human resources to manually extract
and annotate the data and enter it into a relational database — an
expensive, slow, and error-prone process. The advent of deep learning has
made it possible to create a new breed of intelligent data extraction and
mining tools that can extract structured data from dark data much faster and
with greater accuracy than human beings can. The technology for these tools is independently quite valuable.


Downloadable resources

Related topics

Credit: IBM

Previous Post

Build an advanced Cloud Foundry app on the IBM Cloud platform

Next Post

Why Is Deep Learning Still On The ‘Hype Cycle’?

Related Posts

Istio 1.8 focuses on usability and upgrades – IBM Developer
Technology Companies

The Eclipse Foundation’s move to Europe will bring open standards and open source together – IBM Developer

January 15, 2021
Six courses to build your technology skills in 2021 – IBM Developer
Technology Companies

Kafka Monthly Digest – December 2020 and 2020 in review – IBM Developer

January 14, 2021
Six courses to build your technology skills in 2021 – IBM Developer
Technology Companies

Learn new skills while working on volunteer projects – IBM Developer

January 8, 2021
Six courses to build your technology skills in 2021 – IBM Developer
Technology Companies

IBM Advance Toolchain for Linux on Power 13.0-3 released!

January 6, 2021
Six courses to build your technology skills in 2021 – IBM Developer
Technology Companies

Following the data science methodology – IBM Developer

January 6, 2021
Next Post
Why Is Deep Learning Still On The ‘Hype Cycle’?

Why Is Deep Learning Still On The ‘Hype Cycle’?

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

As Bitcoin price surges, DDoS extortion gangs return in force
Internet Security

As Bitcoin price surges, DDoS extortion gangs return in force

January 23, 2021
Sharing eBook With Your Kindle Could Have Let Hackers Hijack Your Account
Internet Privacy

Sharing eBook With Your Kindle Could Have Let Hackers Hijack Your Account

January 23, 2021
Red Kill Switch for AI Autonomous Systems May Not be a Life Saver
Artificial Intelligence

Red Kill Switch for AI Autonomous Systems May Not be a Life Saver

January 22, 2021
Fairness in Machine Learning Predictions – Web Hosting | Cloud Computing | Datacenter
Machine Learning

Fairness in Machine Learning Predictions – Web Hosting | Cloud Computing | Datacenter

January 22, 2021
Ransomware victims aren’t reporting attacks to police. That’s causing a big problem
Internet Security

Hackers publish thousands of files after government agency refuses to pay ransom

January 22, 2021
Missing Link in a ‘Zero Trust’ Security Model—The Device You’re Connecting With!
Internet Privacy

Missing Link in a ‘Zero Trust’ Security Model—The Device You’re Connecting With!

January 22, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • As Bitcoin price surges, DDoS extortion gangs return in force January 23, 2021
  • Sharing eBook With Your Kindle Could Have Let Hackers Hijack Your Account January 23, 2021
  • Red Kill Switch for AI Autonomous Systems May Not be a Life Saver January 22, 2021
  • Fairness in Machine Learning Predictions – Web Hosting | Cloud Computing | Datacenter January 22, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates