Two new analytics engine startups — both based in Israel — are announcing launches, one day apart. Yesterday, Varada unveiled its data platform, which follows a beta period of some length. And this morning, Firebolt is announcing its cloud data warehouse as a service, as well as a $37 million Series A financing round, with participation from Zeev Ventures, TLV Partners, Bessemer Venture Partners and Angular Ventures.
The two startups’ platforms target similar workloads and run exclusively on Amazon Web Services at present. But their technology approaches differ in important ways.
Varada, founded by alumni of XtremIO, takes a data virtualization approach, using the open source Presto query engine for data source connectivity and basic query services. But Varada uses a combination of caching and machine learning-based optimizations to accelerate performance. It also uses multiple types of indexes (including even Lucene indexes for text-based data) to mitigate the need for doing huge file scans. Varada says its platform automatically chooses the most effective index for each “nano block” (Varada’s name for the sub-units of columnar data storage) based on the data content and structure.
In a demo for ZDNet, Varada showed side-by-side queries between it and Amazon Athena, which also uses Presto. The two platforms ran the same queries, on the very same data files in Amazon S3, with several queries executing two orders of magnitude faster on Varada, and even one exception to that being over 30x faster. While one would expect a vendor-controlled demo to show that vendor’s platform in a positive light, this was still impressive.
But beyond speed, Varada’s use of indexes means it needed to scan much less data than did Athena. And since Athena bills based on scanned data volume, this isn’t just a matter of elegance, but a truly cost-saving feature. Varada also says it offers “glass box” visibility into workload performance and cluster utilization, with the platform optimizing workloads for their customer-configured priority and budget. In addition, Varada says that, using machine learning, the platform elastically adjusts the compute and storage cluster.
Firebolt: Beyond the cache
Firebolt is led by Eldad Farkash who served as co-founder and CTO at Sisense (also founded in Israel) from 2004 to 2018. Farkash’s background is influenced by the work of researchers at Centrum Wiskunde & Informatica in the Netherlands, and their MonetDB project. That database pioneered the use of columnar storage, vector processing and utilizing the CPU cache, in addition to RAM, for query acceleration; Sisense’s engine took a similar approach. Firebolt leverages the CPU cache as well, but that’s not its headline architectural feature.
Firebolt’s philosophy is that the Parquet columnar file format, now relied upon by most data lake technologies, while innovative, isn’t sufficient to support lightning-fast queries. Its combination of columnar and partitioned storage can be good for certain BI-style queries that happen to be aggregated by the column the file is partitioned upon (date for example). But when queries go outside that scope (for example, aggregating on geography or product), and can’t exploit the partitioning scheme, big file scans are required and performance can suffer significantly.
Firebolt’s solution to this problem is to use its own FFF file format. FFF varies its structure depending on what tier of the storage hierarchy (Amazon S3, solid state disk or CPU cache) is used. It uses new compression and encoding options and is optimized for the Firebolt query engine. Every data file is sorted by a primary key and indexed, using sparse indexes that are loaded into memory. The indexes accelerate queries when the physical sort order can’t. And in addition to these optimizations, Firebolt can utilize GPUs to accelerate certain workloads further.
The need for speed
Both Varada and Firebolt are focused on making data lakes into launchpads for fast analytics over huge volumes of data, and not just solutions for storing it. Firebolt builds a proprietary data warehouse, with its own nested data-optimized SQL syntax, over the lake. Varada uses standard data lake storage formats and a popular, open source query engine, but augments it with its own indexing technology and workload management.
In the broader data and analytics market, whether the nomenclature is “lake,” “warehouse” or “lakehouse,” the goal is the same: provide fast queries over large data volumes and control costs. Digital transformation is driving these needs, and the Coronavirus has accelerated digital transformation. Most vendors now realize that balancing price, performance and ease-of-use should be prioritized over developing raw features. And in the case of Varada and Firebolt, the market now has two startups founded on this very notion.