A few weeks ago, I covered Tableau’s release of a dataset, updated daily, with a simplified presentation of Johns Hopkins Center for Systems Science and Engineering’s (JHU’s) global COVID-19 dataset. That was an important step in democratizing the data so that people could connect to and analyze it on a self-service basis. I was one such person and performed a few simple analyses on that I shared in the post.
Also read: Tableau makes Johns Hopkins coronavirus data available for the rest of us
Meanwhile, people are hungry for more. Data enthusiasts and epidemic specialists alike want access to metrics beyond confirmed cases and deaths, as well as demographic data outside the scope of COVID-19 itself. There’s a lot of public data out there, but tracking it down, cleaning it, blending it and modeling it isn’t trivial. So now, various companies in the data space are working to address pain points and make it easier to work with this broader range of data.
OLAP for COVID
Let’s start with AtScale, the San Mateo- and Boston-based company focused on OLAP over big data in the cloud. The company is announcing today its COVID-19 Cloud OLAP Model, ready for drill-down analysis. AtScale is hosting the model on its own platform and making it available for querying, free of charge. Datasets include the Starschema: COVID-19 Epidemiological Data, which is available through Snowflake’s Data Exchange, and data from Boston Children’s Hospital’s COVIDNearYou.org. AtScale says their model is updated daily, as are the source datasets.
To gain access to the AtScale model, interested parties can request access here. AtScale will respond with an email providing login information and connection instructions. Attached to that email are fully developed Excel and Tableau workbooks based on the model (the Excel one is pictured above). Users can open those workbooks, plug in their unique user id and password, then start slicing, dicing and analyzing.
Databricks provides easy data access, launches hackathon
Meanwhile, Databricks, whose Spark-based platform serves as a workbench for data engineers and data scientists, is adding value to the COVID-19 data scene too. To begin with, Databricks has added various COVID-19 datasets to be available natively on its platform (on both the Amazon Web Services and Microsoft Azure clouds). Specifically, developers can find the data in the “/databricks-datasets/COVID/” folder built in the Databricks file system (DBFS), on either the paid service or the free Community Edition. In other words, spin up any Databricks cluster and COVID-19 data will be in its file system, automatically. The company has also created sample workbooks that demonstrate how to open up the data and analyze it — details on the datasets and links to the notebooks are provided in a blog post by Databricks’ Denny Lee.
In addition to the data availability, and in coordination with Databricks’ upcoming Spark + AI Summit virtual event, Databricks is launching a related hackathon under the banner “Data Teams Unite!” Teams who enter the hackathon will be asked to focus on COVID-19, climate change or challenges in their own communities (using open data resources made available by national, regional, state and local governments). Since Databricks’ event is a virtual one this year, and is free, the company expects a significant increase in attendance and hopes to see robust hackathon participation. Teams of up to 4 people may enter the hackathon. Three finalist teams will be selected and Databricks will make direct donations to charities of the teams’ choice; the grand prize winner also receives free training and a ticket to a future Spark + AI event. The Hackathon begins today and entries are due June 12. Judging will take place between June 15-19.
Looker and others
A slew of other companies have offerings of their own. For example, just yesterday, Looker, now part of Google Cloud, announced yesterday its COVID-19 Data Block, including LookML models, ready-to-run dashboards and Looker “explores” (which allow ad hoc slicing and dicing of the data). The Looker offering uses the COVID-19 data its parent has made available, free of charge, on its BigQuery service (details here), and is offered on a hosted instance of Looker that is also free. Data in the models is drawn from JHU, the New York Times, the COVID Tracking Project, Definitive Healthcare, the Kaiser Family Foundation, and Italy’s Dipartimento della Protezione Civile.
And there’s more. Starschema and Snowflake have teamed up to offer a data share pre-loaded with COVID-19 related data (it’s one of the data sources AtScale uses in its model). The share is available to current Snowflake customers or those with trial accounts; request access here. Yellowbrick is providing free access to its data warehouse service to aid researchers and companies actively working on a vaccine for COVID-19 (details here). MariaDB is offering healthcare, medical and academic nonprofits that are fighting COVID-19 free access to MariaDB SkySQL. Location intelligence-focused HERE Technologies is offering its Tracking Coronavirus COVID-19 site. Not enough? Even more resources can be found on data.world’s Coronavirus (COVID-19) Data Resource Hub.
There are a lot of resources out there that go well beyond CSV files. Specialists focused on the crisis have lots of choices; that should help them get to insights — and, hopefully, sound policies and effective protocols — faster. And if you’re a non-specialist and find yourself home due to lockdown, in need a project to focus on, maybe you can take advantage of all these great COVID-19 data resources as well.
Updated April 22 at 12:40pm ET to revise Databricks hackathon due date and judging period from May 29 and June 1-5, respectively to June 12 and June 15-19, respectively.