Oops! Something went wrong while submitting the form.
We use cookies to improve your browsing experience on our website, to show you personalised content and to analize our website traffic. By browsing our website, you consent to our use of cookies. Read privacy policy.
As we have already discussed in the previous blog about Apache Iceberg’s basic concepts, setup process, and how to load data. Further, we will now delve into some of Iceberg’s advanced features, including upsert functionality, schema evolution, time travel, and partitioning.
Upsert Functionality
One of Iceberg’s key features is its support for upserts. Upsert, which stands for update and insert, allows you to efficiently manage changes to your data. With Iceberg, you can perform these operations seamlessly, ensuring that your data remains accurate and up-to-date without the need for complex and time-consuming processes.
Schema Evolution
Schema evolution is another of its powerful features. Over time, the schema of your data may need to change due to new requirements or updates. Iceberg handles schema changes gracefully, allowing you to add, remove, or modify columns without having to rewrite your entire dataset. This flexibility ensures that your data architecture can evolve in tandem with your business needs.
Time Travel
Iceberg also provides time travel capabilities, enabling you to query historical data as it existed at any given point in time. This feature is particularly useful for debugging, auditing, and compliance purposes. By leveraging snapshots, you can easily access previous states of your data and perform analyses on how it has changed over time.
Setup Iceberg on the local machine using the local catalog option or Hive
You can also configure Iceberg in your Spark session like this:
We can either create the sample table using Spark SQL or directly write the data by mentioning the DB name and table name, which will create the Iceberg table for us.
You can see the data we have inserted. Apart from appending, you can use the overwrite method as well as Delta Lake tables. You can also see an example of how to read the data from an iceberg table.
Handling Upserts
This Iceberg feature is similar to Delta Lake. You can update the records in existing Iceberg tables without impacting the complete data. This is also used to handle the CDC operations. We can take input from any incoming CSV and merge the data in the existing table without any duplication. It will always have a single Record for each primary key. This is how Iceberg maintains the ACID properties.
Here, we can see the data once the merge operation has taken place.
Schema Evolution
Iceberg supports the following schema evolution changes:
Add – Add a new column to the iceberg table
Drop – If any columns get removed from the existing tables
Rename – Change the name of the columns from the existing table
Update – Change the data type or partition columns of the Iceberg table
Reorder – Change in the order of the Iceberg table
After updating the schema, there will be no need to overwrite or re-write the data again. Like previously, your table has four columns, and all of them have data. If you added two more columns, you wouldn’t need to rewrite the data now that you have six columns. You can still easily access the data. This unique feature was lacking in Delta Lake but is present here. These are just some characteristics of the Iceberg scheme evolutions.
If we add any columns, they won’t impact the existing columns.
If we delete or drop any columns, they won’t impact other columns.
Updating a column or field does not change values in any other column.
Iceberg uses unique IDs to track each column added to a table.
Let’s run some queries to update the schema, or let’s try to delete some columns.
After adding another column, if we try to access the data again from the table, we can do so without seeing any kind of error. This is also how Iceberg solves schema-related problems.
Partition Evolution and Sort Order Evolution
Iceberg came up with this option, which was missing in Delta Lake. When you evolve a partition spec, the old data written with an earlier spec remains unchanged. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately. Because of this, when you start writing queries, you get split planning. This is where each partition layout plans files separately using the filter it derives for that specific partition layout.
Similar to partition spec, Iceberg sort order can also be updated in an existing table. When you evolve a sort order, the old data written with an earlier order remains unchanged.
Iceberg supports both COW and MOR while loading the data into the Iceberg table. We can set up configuration for this by either altering the table or while creating the iceberg table.
Copy-On-Write (COW) – Best for tables with frequent reads, infrequent writes/updates, or large batch updates:
When your requirement is to frequently read but less often write and update, you can configure this property in an Iceberg table. In COW, when we update or delete any rows from the table, a new data file with another version is created, and the latest version holds the latest updated data. The data is rewritten when updates or deletions occur, making it slower and can be a bottleneck when large updates occur. As its name specifies, it creates another copy on write of data.
When reading occurs, it is an ideal process as we are not updating or deleting anything we are only reading so we can read the data faster.
Merge-On-Read (MOR) – Best for tables with frequent writes/updates:
This is just opposite of the COW, as we do not rewrite the data again on the update or deletion of any rows. It creates a change log with updated records and then merges this into the original data file to create a new state of file with updated records.
Query engine and integration supported:
Conclusion
After performing this research, we learned about the Iceberg’s features and its compatibility with various metastore for integrations. We got the basic idea of configuring Iceberg on different cloud platforms and locally well. We had some basic ideas for Upsert, schema evolution and partition evolution.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
As we have already discussed in the previous blog about Apache Iceberg’s basic concepts, setup process, and how to load data. Further, we will now delve into some of Iceberg’s advanced features, including upsert functionality, schema evolution, time travel, and partitioning.
Upsert Functionality
One of Iceberg’s key features is its support for upserts. Upsert, which stands for update and insert, allows you to efficiently manage changes to your data. With Iceberg, you can perform these operations seamlessly, ensuring that your data remains accurate and up-to-date without the need for complex and time-consuming processes.
Schema Evolution
Schema evolution is another of its powerful features. Over time, the schema of your data may need to change due to new requirements or updates. Iceberg handles schema changes gracefully, allowing you to add, remove, or modify columns without having to rewrite your entire dataset. This flexibility ensures that your data architecture can evolve in tandem with your business needs.
Time Travel
Iceberg also provides time travel capabilities, enabling you to query historical data as it existed at any given point in time. This feature is particularly useful for debugging, auditing, and compliance purposes. By leveraging snapshots, you can easily access previous states of your data and perform analyses on how it has changed over time.
Setup Iceberg on the local machine using the local catalog option or Hive
You can also configure Iceberg in your Spark session like this:
We can either create the sample table using Spark SQL or directly write the data by mentioning the DB name and table name, which will create the Iceberg table for us.
You can see the data we have inserted. Apart from appending, you can use the overwrite method as well as Delta Lake tables. You can also see an example of how to read the data from an iceberg table.
Handling Upserts
This Iceberg feature is similar to Delta Lake. You can update the records in existing Iceberg tables without impacting the complete data. This is also used to handle the CDC operations. We can take input from any incoming CSV and merge the data in the existing table without any duplication. It will always have a single Record for each primary key. This is how Iceberg maintains the ACID properties.
Here, we can see the data once the merge operation has taken place.
Schema Evolution
Iceberg supports the following schema evolution changes:
Add – Add a new column to the iceberg table
Drop – If any columns get removed from the existing tables
Rename – Change the name of the columns from the existing table
Update – Change the data type or partition columns of the Iceberg table
Reorder – Change in the order of the Iceberg table
After updating the schema, there will be no need to overwrite or re-write the data again. Like previously, your table has four columns, and all of them have data. If you added two more columns, you wouldn’t need to rewrite the data now that you have six columns. You can still easily access the data. This unique feature was lacking in Delta Lake but is present here. These are just some characteristics of the Iceberg scheme evolutions.
If we add any columns, they won’t impact the existing columns.
If we delete or drop any columns, they won’t impact other columns.
Updating a column or field does not change values in any other column.
Iceberg uses unique IDs to track each column added to a table.
Let’s run some queries to update the schema, or let’s try to delete some columns.
After adding another column, if we try to access the data again from the table, we can do so without seeing any kind of error. This is also how Iceberg solves schema-related problems.
Partition Evolution and Sort Order Evolution
Iceberg came up with this option, which was missing in Delta Lake. When you evolve a partition spec, the old data written with an earlier spec remains unchanged. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately. Because of this, when you start writing queries, you get split planning. This is where each partition layout plans files separately using the filter it derives for that specific partition layout.
Similar to partition spec, Iceberg sort order can also be updated in an existing table. When you evolve a sort order, the old data written with an earlier order remains unchanged.
Iceberg supports both COW and MOR while loading the data into the Iceberg table. We can set up configuration for this by either altering the table or while creating the iceberg table.
Copy-On-Write (COW) – Best for tables with frequent reads, infrequent writes/updates, or large batch updates:
When your requirement is to frequently read but less often write and update, you can configure this property in an Iceberg table. In COW, when we update or delete any rows from the table, a new data file with another version is created, and the latest version holds the latest updated data. The data is rewritten when updates or deletions occur, making it slower and can be a bottleneck when large updates occur. As its name specifies, it creates another copy on write of data.
When reading occurs, it is an ideal process as we are not updating or deleting anything we are only reading so we can read the data faster.
Merge-On-Read (MOR) – Best for tables with frequent writes/updates:
This is just opposite of the COW, as we do not rewrite the data again on the update or deletion of any rows. It creates a change log with updated records and then merges this into the original data file to create a new state of file with updated records.
Query engine and integration supported:
Conclusion
After performing this research, we learned about the Iceberg’s features and its compatibility with various metastore for integrations. We got the basic idea of configuring Iceberg on different cloud platforms and locally well. We had some basic ideas for Upsert, schema evolution and partition evolution.
Velotio Technologies is an outsourced software product development partner for top technology startups and enterprises. We partner with companies to design, develop, and scale their products. Our work has been featured on TechCrunch, Product Hunt and more.
We have partnered with our customers to built 90+ transformational products in areas of edge computing, customer data platforms, exascale storage, cloud-native platforms, chatbots, clinical trials, healthcare and investment banking.
Since our founding in 2016, our team has completed more than 90 projects with 220+ employees across the following areas:
Building web/mobile applications
Architecting Cloud infrastructure and Data analytics platforms