At Google NEXT this week, Google is introducing its own strategy for accommodating open source platforms. Rather than compete with its own implementations, it is making them first-class citizens on GCP with native integration to its own cloud management infrastructure. InfluxData, the creator of one of the most popular open source time series databases, has signed on. It occurs as time series databases are starting to crawl out from behind the shadows. We’ll be reviewing this more deeply next week in our postmortem on the event.
As for time series data, it is not a new use case. But, with a few exceptions, databases that are optimized for this form of data have been exceptions to the rule. The explosion of machine data from sensors, mobile devices, and consumer electronics has generated a wealth of new use cases demanding time series analysis.
Among the earliest examples is KDB, a proprietary database developed by Kx which has been around for over a couple decades, originally targeted for time series analysis of stock market feeds for Wall Street firms. Streaming engines, such as Amazon Kinesis Data Analytics, provide capabilities for aggregating and processing data in sliding time windows.
For many organizations, the strategy was to try their luck with SQL relational databases, an approach that was akin to fitting a square peg into a round hole. While the SQL language supports data types such as DATETIME and INTERVAL, most commercial databases lack features that optimize partitioning or indexes for time dimensions, and they were not designed for supporting sliding time windows.
And so, necessity being the mother of invention, recent years have seen an explosion of projects – many of them open source – for building time series data stores. Among the most popular is InfluxDB, an open source purpose-built time series database that was first released into the wild back in 2013. InfluxData is the company behind it. Since then, InfluxDB has drawn a large community, with hundreds of thousands of implementations, more than 500 paying customers, with the company attracting over $120 million in venture funding. Even Oracle, which would normally defer to its own databases, is a customer. At a recent InfluxDB customer event in New York, an Oracle speaker described how Oracle was building a performance metrics service for its public cloud using InfluxDB.
As no good deed goes unpunished, InfluxDB has drawn competition. The demand, not only for managing and monitoring IoT, the performance of cloud-based application services, and tracking user behavior, have provided the impetus for new players to jump in the game. Kx, whose heritage came from Wall Street, is aiming to branch out from capital markets feeds to IoT and has just entered a partnership with H2O.ai to integrate its KDB database in H2O’s Driverless AI data science platform. Timescale, which was founded in 2015, has taken a different path, adapting PostgreSQL to appeal to the broad body of SQL developers. It adds a virtual “hypertable” atop the PostgreSQL engine that handles the partitioning problem, automatically abstracting time series data into distributed “chunks” that still appear as a single logical view. There’s also Interana, the analytics engine that is jointly sold with Microsoft for performing customer behavioral analytics with Azure Active Directory, Bing, and Office 365. Interana does not position itself as a time series database provider, but employs one on its back end to deliver its analytics of customer behavior.
And then there’s AWS. Their entry into markets, not only challenges independents, but validates that the technology is ready for prime time. It recently introduced Timestream, which is now in preview. It offers a modular architecture, decoupling data ingest from query and storage to support elasticity and scalability. Timestream also offers APIs and SDKs with the goal of making the platform agnostic to query language and output formatting. Amazon differentiates Timestream, not only as a managed purpose-build time series cloud database, but also for its scale and the option to pay per use. For now, Timestream preview supports a proprietary SQL-like query language, which will be familiar to developers, but in the long run could be opened up to other query languages.
Taking a turn
With all this mounting competition, InfluxDB is about to unleash a second version of its platform overhauling and simplifying the overall API and introducing a new functional query language. It’s a familiar story with emerging technologies that, at the v2.0 level, code and interfaces are refactored as the development team (or open source project) contemplates what it’s going to take to scale. It happened with Spark, which changed the APIs to unify streaming and batch, and introduced new libraries for machine learning for the 2.0 generation.
Specifically, InfluxDB 2.0 decouples the query language from the database engine and introduces a new query language, Flux. The 2.0 version simplifies deployment of Influx DB by putting each of the pieces (each of them, separate open source projects) of its “TICK” stack, under the same unified API. These pieces encompass data collection, GUI, streaming data processing engine, and the core database, with modules such as Telegraf having drawn over a couple hundred plug-ins to date. The 2.0 open source version should enter general release sometime early in the second half of the year, with the commercial enterprise version to follow shortly after that. However, the InfluxDB 2.0 managed cloud edition, a serverless offering that will be aimed at new accounts using the new Flux language, is targeting release by the end of the second quarter.
Clearly, making such basic architectural changes is a huge riverboat gamble for InfluxData. The challenge of course is dividing the community; while migration to the new unified API should be an exercise in simplification, and one that could be rolled into existing deployments, introducing a new query language runs the risk of dividing the community. It does so at the point where the company is drawing higher profile competitors as awareness of time series databases expands dramatically. It was an experience that the Spark community survived, as the move to a new generation did not drive attrition; but the transition also occurred as alternatives to Spark processing emerged for running advanced analytic and AI models.
For InfluxDB, the decoupling of the query language from the database could provide an answer to keeping the community, literally, whole.