Friday, February 26, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Data Science

Why Cassandra is a poor choice for an object storage metadata database

January 25, 2021
in Data Science
Why Cassandra is a poor choice for an object storage metadata database
585
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

Cassandra is a popular, tried-and-true NoSQL database that supports key-value wide-column tables. Like any powerful tool, Cassandra has its ideal use cases – in particular, Cassandra excels at supporting write-heavy workloads, while having limitations when supporting read-heavy workloads. Cassandra’s eventual consistency model and lack of transactions, multi-table support like joins, subqueries can also limit its usefulness.

You might also like

How Machine Learning Discretely Assists Data Scientists

A Plethora of Machine Learning Articles: Part 1

AI Chatbot Platforms: The Best in the Market and Why to Consider

However, using Cassandra as a metadata database for an object storage system introduces significant complexity resulting in data integrity and performance issues at scale – particularly if one wants to use their object store as a primary storage system. Object storage needs are far simpler and different from what Cassandra is built for.

Because the implications of employing Cassandra as a object storage metadata database were not properly understood, many object storage vendors made it a foundational part of their architecture – unfortunately it keeps them from ever moving past simple archival workloads into the modern workloads that define the future of object storage (AI/ML, analytics, web/mobile applications).

Let’s explore why in a little more detail.

  1. Cassandra was never designed to manage file or object storage metadata and it is predictably weak in this regard. It is not ACID compliant. It does not have the rigidity to prevent partially successful writes, dupes, contradictions and the like. Cassandra does not support joins or foreign keys, and consequently does not offer consistency in the ACID sense. Further, there is no capacity to roll back transactions in the event of a failure.

    While Cassandra supports atomicity and isolation at the row-level, it trades transactional isolation and atomicity for high availability and fast write performance.

  2. Cassandra is categorized as an AP system in CAP. Meaning it trades Consistency for Availability and Partition tolerance. When employing Cassandra as a metadata database for an object store, you can either be fast or consistent – but not both at the same time.

    Cassandra’s tunable consistency is a compromise, not a feature. Any setting other than QUORUM or ALL means you are at risk of reading stale data. It is important to apply this consistency setting for both read and write operations in addition to the object data operations performed outside of it.

    In the object storage world, the implication is that you can be good for archival use cases (write once, read very infrequently) or you choose a different architecture.

  3. Similar to the consistency problem, durability guarantee is also a tradeoff between performance and correctness. The storage engine’s default commit log is set to sync periodically every 10 seconds. This means you will lose up to 10 seconds worth of latest updates in the event of power failure. The only reasonable way to make Cassandra durable is to use the synchronous batch mode committer which comes with a performance penalty.
  4. Cassandra’s high-availability guarantee is not suited for erasure coded object stores. With a replication factor of 3 and consistency quorum of 2, Cassandra can only tolerate a single node / drive failure within a replication group. Increasing the replication factor and quorum consistency to 5 or higher serves only to make the meta performance go from bad to worse. Unlike replication, erasure coding can tolerate multiple servers and drives failures in a distributed system. Even if you have configured the erasure code setting to 6 parity (any 6 nodes may fail) in a 16 node setup, you are still limited by the weak link, i.e Cassandra’s replication factor. The ops team is often unaware of these high-availability surprises until it is too late.
  5. Object storage systems organize the data in a tree structured hierarchical namespace. Since Cassandra does not support a hierarchical key namespace, you will have to build a tree data model on top for each directory prefix and also maintain a flat list for direct lookups without directory walk. Atomically updating multiple tables with batched commit log and full read / write quorum is slow and prone to  corruption.
  6. While objects themselves are immutable, the object storage system is mutable. When you add, remove, overwrite objects and its metadata, apply policies, collect metrics, grant session tokens and rotating credentials, the metadata is always mutating. Cassandra is not designed to handle this level of metadata mutation and definitely not for the primary storage workloads. Long term archival use cases where the objects are large (GBs in size) and infrequently accessed, will work – other use cases will not..

    The reason is that Cassandra’s log structured storage system quickly appends new writes to the end of the log file, but delays the deletes and overwrites with a tombstone marker. Vacuuming these tombstones is an expensive operation, because the actual delete operation is applied by copying the SSTables to a new table sieving the stale entries in the process. This operation has to be performed on all the nodes simultaneously. If you delay vacuuming, excessive tombstones will result in increased read latencies, memory GC pauses and failed queries. Some object storage vendors use an additional Redis database to offload Cassandra’s pressure. Using two databases to manage an object stores metadata is hardly elegant and introduces additional points of failure.  

    The biggest gotcha? You won’t see these problems until you are deep into production and it is too late.

  7. Small objects (KB to MB in size) will fill up the metadata drives dedicated to Cassandra much sooner than the data drives. Also small object workloads exacerbate Cassandra’s limitations, because they are sensitive to latency and consistency issues.  Some vendors store small objects entirely inside Cassandra to address this problem. At this point, you are merely looking at an S3 proxy on top of Cassandra database.

    This too is a bad practice.

    If you use your object store for large objects and employ erasure coding and use Cassandra as your data store for small objects and use replication – you have introduced a non-trivial SLA problem. In this approach, data is protected by different guarantees. Given that drives die all the time, the probability of serving an old object or a corrupted object goes up considerably.

    As noted above, your metadata database is now the weak link. Availability, consistency and durability guarantees are only as good as the weakest link. If the weakest link employs replication (three copies) you can only withstand one-node or one drive failure before losing data.

    A counter argument might be to replicate five copies. The result is a massive performance hit and you can still really only withstand two-node or two-drive failure.

    In using replication for small objects and erasure coding for large objects you also undermine the efficiency gains associated with EC. If you only use erasure code for large objects (likely a small percent of your overall object pool) you don’t gain much but increase your exposure considerably.

  8. Employing Cassandra as your metadata database for an object store also introduces a troublesome Java dependency. This in turn can result in bloatware and memory management issues. Cassandra taxes the JVM memory management with constant large scale metadata allocation and mutation resulting in memory exhaustion and garbage collection pauses.

The obvious takeaway is that it is a lot more complicated to operate a Cassandra cluster than a properly designed object storage system. Cassandra is built for a different purpose and object-storage meta-data is not one of them. The areas where Cassandra struggles are the areas that are core to a performant, scalable and resilient object store.

The last point is of note – object storage is a natural fit for blob data and that is why Erasure Coding is so effective and efficient. Cassandra is designed for replication. When you use that model for metadata it breaks the object store’s erasure coding advantage (or at the very least makes it brittle and prone to breakage).

Bottom line. Write your metadata atomically with your object. Never separate them.

We welcome your comments. Feel free to engage us on Twitter, on our Slack channel or by dropping us a note at [email protected].


Credit: Data Science Central By: Jonathan Symonds

Previous Post

AWS unveils fresh migration, cloud governance and machine learning partner training

Next Post

Why Should Python Be Used in Machine Learning?

Related Posts

How Machine Learning Discretely Assists Data Scientists
Data Science

How Machine Learning Discretely Assists Data Scientists

February 24, 2021
A Plethora of Machine Learning Articles: Part 1
Data Science

A Plethora of Machine Learning Articles: Part 1

February 24, 2021
What are Data Pipelines ?
Data Science

AI Chatbot Platforms: The Best in the Market and Why to Consider

February 24, 2021
Modernizing Data Dashboards. – Data Science Central
Data Science

Modernizing Data Dashboards. – Data Science Central

February 24, 2021
4 ways Cryptocurrency is Benefiting the Fintech Industry
Data Science

4 ways Cryptocurrency is Benefiting the Fintech Industry

February 23, 2021
Next Post
Why Should Python Be Used in Machine Learning?

Why Should Python Be Used in Machine Learning?

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

SolarWinds cybersecurity spending tops $3 million in Q4, sees $20 million to $25 million in 2021
Internet Security

SolarWinds cybersecurity spending tops $3 million in Q4, sees $20 million to $25 million in 2021

February 26, 2021
Chinese Hackers Using Firefox Extension to Spy On Tibetan Organizations
Internet Privacy

Chinese Hackers Using Firefox Extension to Spy On Tibetan Organizations

February 25, 2021
DataStax Astra goes serverless | ZDNet
Big Data

DataStax Astra goes serverless | ZDNet

February 25, 2021
Tesla Working on Full Self-Driving Mode, Extending AI Lead 
Artificial Intelligence

Tesla Working on Full Self-Driving Mode, Extending AI Lead 

February 25, 2021
Cloudera aims to fast track enterprise machine learning use cases with Applied ML Prototypes
Machine Learning

Cloudera aims to fast track enterprise machine learning use cases with Applied ML Prototypes

February 25, 2021
Facebook bans Myanmar military-controlled accounts from its platforms
Internet Security

Facebook bans Myanmar military-controlled accounts from its platforms

February 25, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • SolarWinds cybersecurity spending tops $3 million in Q4, sees $20 million to $25 million in 2021 February 26, 2021
  • Chinese Hackers Using Firefox Extension to Spy On Tibetan Organizations February 25, 2021
  • DataStax Astra goes serverless | ZDNet February 25, 2021
  • Tesla Working on Full Self-Driving Mode, Extending AI Lead  February 25, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates