Databricks AI Summit 2023 Databricks Session
The landscape of data science and AI is rapidly changing, with key industry leaders like Google leading the charge. Despite the apparent gap in resources and budgets between technology giants and startups, solutions such as Databricks are evolving to bridge this gap. This report summarizes key points discussed in a recent conference around the same theme.
Importance of Data and AI
The conference strongly emphasized that the winners in every industry will be data and AI companies. However, it was also acknowledged that many companies still need help handling data effectively, often due to a need for more resources and infrastructure.
Evolution of Data Handling
Initially, handling structured data through tools like Excel, business intelligence, and data warehousing was straightforward. As unstructured and unsemantic data became more prevalent, there was a need for a more sophisticated platform that could handle data lakes, orchestration, governance, data science, data warehousing, streaming, and business intelligence (BI).
Governance and Silos
The pitfalls of poor data governance were highlighted, noting that it can lead to flawed engineering. Data silos were identified as drivers of high operational costs. Inconsistent policies and disparate tools can reduce trust in data and inhibit cross-team productivity.
Databricks' Lakehouse
Databricks proposes a solution akin to the 'iPhone of data'—the Lakehouse. It is designed to unify all data usages into a single layer, providing one copy of data with centralized governance. The concept of the Lakehouse is built on unification, which offers unique advantages in data management.
Open Source and Portability
Open-source solutions were deemed not to be advantageous in and of themselves. However, the 'open' nature of such solutions signifies portability and helps avoid lock-ins, which can benefit organizations.
Data Explosion
The conference recognized that the amount of data in circulation would only continue to grow, with a prediction of a massive explosion of data.
Cost-effectiveness
As data scales up, certain operations like ETL on platforms like Snowflake become more expensive compared to Databricks, thus making the latter a more cost-effective solution.
Real-time Streaming
Over 50% of Databricks' customers use its real-time streaming features for critical risk profiling, highlighting the importance of a platform capable of handling such operations.
AI/ML on Lakehouse
The Lakehouse AI ML aims to provide unified data and AI with one security and governance model, further simplifying data handling and usage.
Dolly
Databricks has introduced Dolly, the first truly open instruction-tuned LLM, which is licensed for commercial usage.
Build vs. Buy
The conference concluded with a discussion about the 'build vs. buy' dilemma, presenting a checklist that includes considerations such as an abundance of engineers, time availability, financial resources, and the need for a single cloud.
Conclusion
As the importance and volume of data continue to grow, companies must invest in scalable, unified, and cost-effective solutions to stay competitive. Innovations like Databricks' Lakehouse model provide promising avenues to address the unique challenges posed by the modern data landscape.