What is Delta Table ?
A Delta Table is a type of table used in Delta Lake, which is an open-source storage layer built on top of Apache Spark and Hadoop. It helps manage large datasets by combining the benefits of data lakes (like flexibility and scalability) with the reliability and performance of data warehouses.
To put it simply:
-
A Delta Table is a table that supports ACID transactions, meaning it can handle updates, deletes, and inserts in a consistent and reliable way. This makes it more robust than a regular data lake, which usually lacks these features.
-
It supports versioning, so you can track changes to the data over time. This allows you to time travel—you can access previous versions of your data.
-
Optimized performance: Delta Tables are optimized for faster read and write operations by storing metadata and applying optimizations like indexing.
Think of a Delta Table as a highly reliable, performant, and flexible version of a regular table that works well in big data environments.
Differnce betwenn managed and external tables ?
A managed table in Databricks means Databricks
controls both the data files and metadata. An external table
means Databricks only manages the metadata, while the actual data files
remain in an external location (like S3, ADLS, or Blob Storage).
🗂️ Managed Tables
- Storage
location: Data is stored inside Databricks’ default warehouse
directory (usually dbfs:/user/hive/warehouse).
- Lifecycle:
When you drop a managed table, Databricks deletes both the metadata and
the underlying data files.
- Use
case: Best when you want Databricks to fully manage the table
lifecycle and don’t need to reuse the data outside Databricks.
- Creation
example:
python
df.write.saveAsTable("my_managed_table")
📦 External Tables
- Storage
location: Data resides in an external path you specify (e.g., dbfs:/mnt/mydata/
or s3://bucket/path).
- Lifecycle:
Dropping the table removes only the metadata; the underlying data files
remain intact.
- Use
case: Ideal when data is shared across multiple systems, or you want
to retain control of the files outside Databricks.
- Creation
example:
sql
CREATE TABLE my_external_table
USING DELTA
LOCATION 'dbfs:/mnt/mydata/external_table_path';
🔑 Key Differences
|
Aspect |
Managed Table |
External Table |
|
Data location |
Databricks warehouse dir |
External path (S3, ADLS, Blob, DBFS mount) |
|
Lifecycle |
Dropping table deletes data and metadata. |
Dropping table deletes metadata only |
|
Control |
Databricks controls everything |
User controls data files |
|
Best for |
Full Databricks-managed workflows |
Shared/external datasets |
No comments:
Post a Comment