Friday, March 5, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Data Science

What are Data Lakes ?

March 6, 2020
in Data Science
Discover how machine learning can solve finance industry challenges by Jannes Klaas
585
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter
Some of you are highly organized, disciplined, and keep all your belongings arranged in an ordered manner. You can easily remember the location of all these items and will be able to fetch them in a short time. But there are people who are highly unorganized and keeps all their belongings in a random manner and will have no clue on where to look for an item when they really need it. This results in a search operation on all storage spaces in the house when they want to fetch an item. There is no doubt that the items will be located in both of these cases. Both of these methods have its own pros and cons. We cannot say that one method has merit over the other.

Let us examine the above scenario more closely using the example of a library. To maintain books in a library you have two options. The first option is to catalog the books based on Topic, Author, Title etc and arrange them ordered on racks. This method is universally followed and it makes it easy to fetch specific books from the racks in a single look up. The look up operation saves time in searching. Data storage based on catalogs and and indexing is used in traditional database management systems. Suppose we have infinite computing resources to perform the search operation in a negligible time, then we need not keep the books arranged in order. Just keep the books on the racks as it arrives and when someone asks for a book, initiate a search operation on all the racks starting from beginning to end. The search will stop when the particular book is located and the time taken for this search varies depending on the location of the book on the rack. But in the case of catalog based book arrangement, the time taken for locating specific books will be identical irrespective of the location of the book on rack. But, it takes some time to index and keep the books arranged on the racks. When a book is returned by the borrower it has to be placed back in its allotted location. Misplacing of returned books may result in situations in which the book may get declared as missing. Or a full search on the complete rack is to be initiated before declaring the book as missing.

You might also like

A Plethora of Machine Learning Articles: Part 2

The Effect IoT Has Had on Software Testing

Why Cloud Data Discovery Matters for Your Business

Here we have two parameters to take into consideration before we choose a method for arranging books in the library racks. The first parameter is the time taken to maintain the books on racks in an ordered manner. The second parameter is the time taken for a total search on the racks. In maintaining data warehouses, we maintain the data in an ordered fashion using traditional databases and it involves time and cost to structure the data and place them on the database tables. With the evolution of low cost implementation of distributed file systems like Hadoop Distributed File Systems (HDFS), it has become possible to implement parallel searches on data to speed up the fetch operation. In this case, the time taken for the fetch operation is minimum. In fact, the evolution of low cost highly reliable distributed file systems triggered the emergence of data lakes. In data lakes, we keep the data files in directories. Also we can keep files with the same name. Analytics operations like MapReduce and Machine Learning training are designed to work on the entire data set. This makes data lakes suitable for systems involving big data analytic and machine learning. Apache Spark is a proven distributed computing framework using data stored on HDFS. Data lakes hosted on HDFS get efficiently utilized in Spark applications used for big data analytics and machine learning.

Data lake is a repository of data stored in its natural format, usually object blobs or files, and usually in a distributed file system that maintains raw copies of source system data for ensuring reliability. It is a scalable repository that allows you to store all your structured and unstructured data as and when it arrives. You can store your data as is, without having to first structure the data followed by indexing. You can run different types of analytics on data lakes from dashboards and visualizations to big data processing, real-time analytics, and machine learning to help better decisions.

Data lakes are an ideal workload to be deployed in the cloud, because the cloud provides performance, scalability, reliability, availability, a diverse set of analytic engines, and massive economies of scale. The top reasons customers perceived the cloud as an advantage for data lakes are better security, faster time to deployment, better availability, more frequent feature/functionality updates, more elasticity, more geographic coverage, and costs linked to actual utilization.

The crux of the story is that maintaining a data ware house is expensive but gives quick access to specific data records. And data lakes are low cost implementations which gives slow access to specific data records and is ideal for applications in which the entire data set is to be accessed in every cycle of processing. Hope this helps you to make a decision about your storage strategy based on your data usage scenario.

See you next time ………


Credit: Data Science Central By: Janardhanan PS

Previous Post

AI Spawning New Products in Investment Business

Next Post

Tech giants push back on forced Uyghur labour claims

Related Posts

A Plethora of Machine Learning Articles: Part 2
Data Science

A Plethora of Machine Learning Articles: Part 2

March 4, 2021
The Effect IoT Has Had on Software Testing
Data Science

The Effect IoT Has Had on Software Testing

March 3, 2021
Why Cloud Data Discovery Matters for Your Business
Data Science

Why Cloud Data Discovery Matters for Your Business

March 2, 2021
DSC Weekly Digest 01 March 2021
Data Science

DSC Weekly Digest 01 March 2021

March 2, 2021
Companies in the Global Data Science Platforms Resorting to Product Innovation to Stay Ahead in the Game
Data Science

Companies in the Global Data Science Platforms Resorting to Product Innovation to Stay Ahead in the Game

March 2, 2021
Next Post
Tech giants push back on forced Uyghur labour claims

Tech giants push back on forced Uyghur labour claims

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

Six courses to build your technology skills in 2021 – IBM Developer
Technology Companies

Is your Cloud infrastructure securely configured? Does your DevSecOps pipeline integrate ibm-terraform compliance checks? – IBM Developer

March 5, 2021
Ransomware as a service is the new big problem for business
Internet Security

Ransomware as a service is the new big problem for business

March 5, 2021
Google Will Use ‘FLoC’ for Ad Targeting Once 3rd-Party Cookies Are Dead
Internet Privacy

Google Will Use ‘FLoC’ for Ad Targeting Once 3rd-Party Cookies Are Dead

March 5, 2021
AI and machine learning’s moment in health care
Machine Learning

AI and machine learning’s moment in health care

March 4, 2021
The Examples and Benefits of AI in Healthcare: From accurate diagnosis to remote patient monitoring | by ITRex Group | Mar, 2021
Neural Networks

The Examples and Benefits of AI in Healthcare: From accurate diagnosis to remote patient monitoring | by ITRex Group | Mar, 2021

March 4, 2021
Welcome to events Thursdays: Thursday’s daily brief
Digital Marketing

Welcome to events Thursdays: Thursday’s daily brief

March 4, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • Is your Cloud infrastructure securely configured? Does your DevSecOps pipeline integrate ibm-terraform compliance checks? – IBM Developer March 5, 2021
  • Ransomware as a service is the new big problem for business March 5, 2021
  • Google Will Use ‘FLoC’ for Ad Targeting Once 3rd-Party Cookies Are Dead March 5, 2021
  • AI and machine learning’s moment in health care March 4, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates