Iceberg: Features and Hands-on (Part 2)

Abhishek Sharma

Data Engineering

Tags:

PySpark

iceberg

data lake

AWS

As we have already discussed in the previous blog about Apache Iceberg’s basic concepts, setup process, and how to load data. Further, we will now delve into some of Iceberg’s advanced features, including upsert functionality, schema evolution, time travel, and partitioning.

Upsert Functionality

One of Iceberg’s key features is its support for upserts. Upsert, which stands for update and insert, allows you to efficiently manage changes to your data. With Iceberg, you can perform these operations seamlessly, ensuring that your data remains accurate and up-to-date without the need for complex and time-consuming processes.

Schema Evolution

Schema evolution is another of its powerful features. Over time, the schema of your data may need to change due to new requirements or updates. Iceberg handles schema changes gracefully, allowing you to add, remove, or modify columns without having to rewrite your entire dataset. This flexibility ensures that your data architecture can evolve in tandem with your business needs.

Time Travel

Iceberg also provides time travel capabilities, enabling you to query historical data as it existed at any given point in time. This feature is particularly useful for debugging, auditing, and compliance purposes. By leveraging snapshots, you can easily access previous states of your data and perform analyses on how it has changed over time.

Setup Iceberg on the local machine using the local catalog option or Hive

You can also configure Iceberg in your Spark session like this:

CODE: https://gist.github.com/velotiotech/77b4679d2a7141d21b3dd31c243954fa.js

Some configurations must pass while setting up Iceberg.

Create Tables in Iceberg and Insert Data

CODE: https://gist.github.com/velotiotech/a11b6169caf70b506a5d4be2c421b11e.js

CODE: https://gist.github.com/velotiotech/6c89444c2c4b2c07c909e4cd3310c077.js

We can either create the sample table using Spark SQL or directly write the data by mentioning the DB name and table name, which will create the Iceberg table for us.

‍

You can see the data we have inserted. Apart from appending, you can use the overwrite method as well as Delta Lake tables. You can also see an example of how to read the data from an iceberg table.

Handling Upserts

This Iceberg feature is similar to Delta Lake. You can update the records in existing Iceberg tables without impacting the complete data. This is also used to handle the CDC operations. We can take input from any incoming CSV and merge the data in the existing table without any duplication. It will always have a single Record for each primary key. This is how Iceberg maintains the ACID properties.

Incoming Data

CODE: https://gist.github.com/velotiotech/b3434b6eff713b576bbe5d027e15133d.js

We will merge this data into our existing Iceberg Table using Spark SQL.

CODE: https://gist.github.com/velotiotech/1714db016057bd41bac33df09131930b.js

Here, we can see the data once the merge operation has taken place.

Schema Evolution

Iceberg supports the following schema evolution changes:

Add – Add a new column to the iceberg table
Drop – If any columns get removed from the existing tables
Rename – Change the name of the columns from the existing table
Update – Change the data type or partition columns of the Iceberg table
Reorder – Change in the order of the Iceberg table

After updating the schema, there will be no need to overwrite or re-write the data again. Like previously, your table has four columns, and all of them have data. If you added two more columns, you wouldn’t need to rewrite the data now that you have six columns. You can still easily access the data. This unique feature was lacking in Delta Lake but is present here. These are just some characteristics of the Iceberg scheme evolutions.

If we add any columns, they won’t impact the existing columns.
If we delete or drop any columns, they won’t impact other columns.
Updating a column or field does not change values in any other column.

Iceberg uses unique IDs to track each column added to a table.

Let’s run some queries to update the schema, or let’s try to delete some columns.

CODE: https://gist.github.com/velotiotech/3b9c200c8462cc6aa1665f906ef4b4b6.js

After adding another column, if we try to access the data again from the table, we can do so without seeing any kind of error. This is also how Iceberg solves schema-related problems.

Partition Evolution and Sort Order Evolution

Iceberg came up with this option, which was missing in Delta Lake. When you evolve a partition spec, the old data written with an earlier spec remains unchanged. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately. Because of this, when you start writing queries, you get split planning. This is where each partition layout plans files separately using the filter it derives for that specific partition layout.

Similar to partition spec, Iceberg sort order can also be updated in an existing table. When you evolve a sort order, the old data written with an earlier order remains unchanged.

CODE: https://gist.github.com/velotiotech/1e5d5e4e9e33e467b91192207d1a4405.js

‍

Copy on write(COW) and merge on read(MOR) as well

Iceberg supports both COW and MOR while loading the data into the Iceberg table. We can set up configuration for this by either altering the table or while creating the iceberg table.

Copy-On-Write (COW) – Best for tables with frequent reads, infrequent writes/updates, or large batch updates:

When your requirement is to frequently read but less often write and update, you can configure this property in an Iceberg table. In COW, when we update or delete any rows from the table, a new data file with another version is created, and the latest version holds the latest updated data. The data is rewritten when updates or deletions occur, making it slower and can be a bottleneck when large updates occur. As its name specifies, it creates another copy on write of data.

When reading occurs, it is an ideal process as we are not updating or deleting anything we are only reading so we can read the data faster.

Merge-On-Read (MOR) – Best for tables with frequent writes/updates:

This is just opposite of the COW, as we do not rewrite the data again on the update or deletion of any rows. It creates a change log with updated records and then merges this into the original data file to create a new state of file with updated records.

‍

‍

‍

Query engine and integration supported:

Conclusion

After performing this research, we learned about the Iceberg’s features and its compatibility with various metastore for integrations. We got the basic idea of configuring Iceberg on different cloud platforms and locally well. We had some basic ideas for Upsert, schema evolution and partition evolution.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Iceberg: Features and Hands-on (Part 2)

Upsert Functionality

Schema Evolution

Time Travel

Setup Iceberg on the local machine using the local catalog option or Hive

You can also configure Iceberg in your Spark session like this:

CODE: https://gist.github.com/velotiotech/77b4679d2a7141d21b3dd31c243954fa.js

Some configurations must pass while setting up Iceberg.

Create Tables in Iceberg and Insert Data

CODE: https://gist.github.com/velotiotech/a11b6169caf70b506a5d4be2c421b11e.js

CODE: https://gist.github.com/velotiotech/6c89444c2c4b2c07c909e4cd3310c077.js

We can either create the sample table using Spark SQL or directly write the data by mentioning the DB name and table name, which will create the Iceberg table for us.

‍

You can see the data we have inserted. Apart from appending, you can use the overwrite method as well as Delta Lake tables. You can also see an example of how to read the data from an iceberg table.

Handling Upserts

Incoming Data

CODE: https://gist.github.com/velotiotech/b3434b6eff713b576bbe5d027e15133d.js

We will merge this data into our existing Iceberg Table using Spark SQL.

CODE: https://gist.github.com/velotiotech/1714db016057bd41bac33df09131930b.js

Here, we can see the data once the merge operation has taken place.

Schema Evolution

Iceberg supports the following schema evolution changes:

Add – Add a new column to the iceberg table
Drop – If any columns get removed from the existing tables
Rename – Change the name of the columns from the existing table
Update – Change the data type or partition columns of the Iceberg table
Reorder – Change in the order of the Iceberg table

If we add any columns, they won’t impact the existing columns.
If we delete or drop any columns, they won’t impact other columns.
Updating a column or field does not change values in any other column.

Iceberg uses unique IDs to track each column added to a table.

Let’s run some queries to update the schema, or let’s try to delete some columns.

CODE: https://gist.github.com/velotiotech/3b9c200c8462cc6aa1665f906ef4b4b6.js

After adding another column, if we try to access the data again from the table, we can do so without seeing any kind of error. This is also how Iceberg solves schema-related problems.

Partition Evolution and Sort Order Evolution

Similar to partition spec, Iceberg sort order can also be updated in an existing table. When you evolve a sort order, the old data written with an earlier order remains unchanged.

CODE: https://gist.github.com/velotiotech/1e5d5e4e9e33e467b91192207d1a4405.js

‍

Copy on write(COW) and merge on read(MOR) as well

Iceberg supports both COW and MOR while loading the data into the Iceberg table. We can set up configuration for this by either altering the table or while creating the iceberg table.

Copy-On-Write (COW) – Best for tables with frequent reads, infrequent writes/updates, or large batch updates:

When reading occurs, it is an ideal process as we are not updating or deleting anything we are only reading so we can read the data faster.

Merge-On-Read (MOR) – Best for tables with frequent writes/updates:

‍

‍

‍

Query engine and integration supported:

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Subscribe to get the latest technology updates

Iceberg: Features and Hands-on (Part 2)

Abhishek Sharma

Upsert Functionality

Schema Evolution

Time Travel

Setup Iceberg on the local machine using the local catalog option or Hive

Create Tables in Iceberg and Insert Data

Handling Upserts

Schema Evolution

Partition Evolution and Sort Order Evolution

Copy on write(COW) and merge on read(MOR) as well

Copy-On-Write (COW) – Best for tables with frequent reads, infrequent writes/updates, or large batch updates:

Merge-On-Read (MOR) – Best for tables with frequent writes/updates:

Conclusion

MORE POSTS BY THIS AUTHOR

Abhishek Sharma

You may also like

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Sagar Jaswani

Data Engineering: Beyond Big Data

Pratyush Pranav

Data QA: The Need of the Hour

Rita Kushwaha

Iceberg: Features and Hands-on (Part 2)

Upsert Functionality

Schema Evolution

Time Travel

Setup Iceberg on the local machine using the local catalog option or Hive

Create Tables in Iceberg and Insert Data

Handling Upserts

Schema Evolution

Partition Evolution and Sort Order Evolution

Copy on write(COW) and merge on read(MOR) as well

Copy-On-Write (COW) – Best for tables with frequent reads, infrequent writes/updates, or large batch updates:

Merge-On-Read (MOR) – Best for tables with frequent writes/updates:

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Data Engineering: Beyond Big Data

Data QA: The Need of the Hour

Iceberg - Introduction and Setup (Part - 1)

Confluent Kafka vs. Amazon Managed Streaming for Apache Kafka (AWS MSK) vs. on-premise Kafka

Mage: Your New Go-To Tool for Data Orchestration

The Data Lake Revolution: Unleashing the Power of Delta Lake

Unlocking the Potential of Knowledge Graphs: Exploring Graph Databases

Spatial Data Analytics : The What, Why, and How?

Apache Flink - A Solution for Real-Time Analytics

An Introduction to Stream Processing & Analytics

Modern Data Stack: The What, Why and How?

Best Practices for Kafka Security

Parallelizing Heavy Read and Write Queries to SQL Datastores using Spark and more!

ClickHouse - The Newest Data Store in Your Big Data Arsenal

How to Load Unstructured Data into Apache Hive

Building an ETL Workflow Using Apache NiFi and Hive

Unit Testing Data at Scale using Deequ and Apache Spark

Elasticsearch - Basic and Advanced Concepts

BigQuery 101: All the Basics You Need to Know

Your Quintessential Guide to AWS Athena

Real Time Analytics for IoT Data using Mosquitto, AWS Kinesis and InfluxDB

Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow

The Ultimate Beginner’s Guide to Jupyter Notebooks

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting