Since its launch in 2013, Databricks has relied on its ecosystem of partners, such as Fivetran, Rudderstack, and dbt, to provide tools for data preparation and loading. But now, at its annual Data + AI Summit, the company announced LakeFlow, its own data engineering solution that can handle data ingestion, transformation and orchestration and eliminates the need for a third-party solution.
With LakeFlow, Databricks users will soon be able to build their data pipelines and ingest data from databases like MySQL, Postgres, SQL Server and Oracle, as well as enterprise applications like Salesforce, Dynamics, Sharepoint, Workday, NetSuite and Google Analytics.
Why the change of heart after relying on its partners for so long? Databricks co-founder and CEO Ali Ghodsi explained that when he asked his advisory board at the Databricks CIO Forum two years ago about future investments, he expected requests for more machine learning features. Instead, the audience wanted better data ingestion from various SaaS applications and databases. “Everybody in the audience said: we just want to be able to get data in from all these SaaS applications and databases into Databricks,” he said. “I literally told them: we have great partners for that. Why should we do this redundant work? You can already get that in the industry.”
As it turns out, even though building connectors and data pipelines may now feel like a commoditized business, the vast majority of Databricks customers were not actually using its ecosystem partners but building their own bespoke solutions to cover edge cases and their security requirements.
At that point, the company started exploring what it could do in this space, which eventually led to the acquisition of the real-time data replication service Arcion last November.
Ghodsi stressed that Databricks plans to “continue to double down” on its partner ecosystem, but clearly there is a segment of the market that wants a service like this built into the platform. “This is one of those problems they just don’t want to have to deal with. They don’t want to buy another thing. They don’t want to configure another thing. They just want that data to be in Databricks,” he said.
In a way, getting data into a data warehouse or data lake should indeed be table stakes because the real value creation happens down the line. The promise of LakeFlow is that Databricks can now offer an end-to-end solution that allows enterprises to take their data from a wide variety of systems, transform and ingest it in near real-time, and then build production-ready applications on top of it.
At its core, the LakeFlow system consists of three parts. The first is LakeFlow Connect, which provides the connectors between the different data sources and the Databricks service. It’s fully integrated with Databricks’ Unity Data Catalog data governance solution and relies in part of technology from Arcion. Databricks also did a lot of work to enable this system to scale out quickly and to very large workloads if needed. Right now, this system supports SQL Server, Salesforce, Workday, ServiceNow and Google Analytics, with MySQL and Postgres following very soon.
The second part is LakeFlow Pipelines, which is essentially a version of Databricks’ existing Delta Live Tables framework for implementing data transformation and ETL in either SQL or Python. Ghodsi stressed that LakeFlow Pipelines offers a low-latency mode for enabling data delivery and can also offer incremental data processing so that for most use cases, only changes to the original data have to get synced with Databricks.
The third part is LakeFlow Jobs, which is the engine that provides automated orchestration and ensures data health and delivery. “So far, we’ve talked about getting the data in, that’s Connectors. And then we said: let’s transform the data. That’s Pipelines. But what if I want to do other things? What if I want to update a dashboard? What if I want to train a machine learning model on this data? What are other actions in Databricks that I need to take? For that, Jobs is the orchestrator,” Ghodsi explained.
Ghodsi also noted that a lot of Databricks customers are now looking to lower their costs and consolidate the number of services they pay for — a refrain I’ve been hearing from enterprises and their vendors almost daily for the last year or so. Offering an integrated service for data ingestion and transformation aligns with this trend.
Databricks is rolling out the LakeFlow service in phases. First up is LakeFlow Connect, which will become available as a preview soon. The company has a sign-up page for the waitlist here.
Comment