Databricks AI Summit 2023 Databricks Session
The landscape of data science and AI is rapidly changing, with key industry leaders like Google leading the charge. Despite the apparent gap in resources and budgets between technology giants and startups, solutions such as Databricks are evolving to bridge this gap. This report summarizes key points discussed in a recent conference around the same theme.
Importance of Data and AIโ
The conference strongly emphasized that the winners in every industry will be data and AI companies. However, it was also acknowledged that many companies still need help handling data effectively, often due to a need for more resources and infrastructure.
Evolution of Data Handlingโ
Initially, handling structured data through tools like Excel, business intelligence, and data warehousing was straightforward. As unstructured and unsemantic data became more prevalent, there was a need for a more sophisticated platform that could handle data lakes, orchestration, governance, data science, data warehousing, streaming, and business intelligence (BI).
Governance and Silosโ
The pitfalls of poor data governance were highlighted, noting that it can lead to flawed engineering. Data silos were identified as drivers of high operational costs. Inconsistent policies and disparate tools can reduce trust in data and inhibit cross-team productivity.
Databricks' Lakehouseโ
Databricks proposes a solution akin to the 'iPhone of data'โthe Lakehouse. It is designed to unify all data usages into a single layer, providing one copy of data with centralized governance. The concept of the Lakehouse is built on unification, which offers unique advantages in data management.
Open Source and Portabilityโ
Open-source solutions were deemed not to be advantageous in and of themselves. However, the 'open' nature of such solutions signifies portability and helps avoid lock-ins, which can benefit organizations.
Data Explosionโ
The conference recognized that the amount of data in circulation would only continue to grow, with a prediction of a massive explosion of data.
Cost-effectivenessโ
As data scales up, certain operations like ETL on platforms like Snowflake become more expensive compared to Databricks, thus making the latter a more cost-effective solution.
Real-time Streamingโ
Over 50% of Databricks' customers use its real-time streaming features for critical risk profiling, highlighting the importance of a platform capable of handling such operations.
AI/ML on Lakehouseโ
The Lakehouse AI ML aims to provide unified data and AI with one security and governance model, further simplifying data handling and usage.
Dollyโ
Databricks has introduced Dolly, the first truly open instruction-tuned LLM, which is licensed for commercial usage.
Build vs. Buyโ
The conference concluded with a discussion about the 'build vs. buy' dilemma, presenting a checklist that includes considerations such as an abundance of engineers, time availability, financial resources, and the need for a single cloud.
Conclusionโ
As the importance and volume of data continue to grow, companies must invest in scalable, unified, and cost-effective solutions to stay competitive. Innovations like Databricks' Lakehouse model provide promising avenues to address the unique challenges posed by the modern data landscape.