Apache Hive to BigQuery | Google Cloud Blog

[ad_1]

Are you trying to migrate a considerable amount of Hive ACID tables to BigQuery?

ACID enabled Hive tables help transactions that settle for updates and delete DML operations. On this weblog, we are going to discover migrating Hive ACID tables to BigQuery. The strategy explored on this weblog works for each compacted (main / minor) and non-compacted Hive tables. Let’s first perceive the time period ACID and the way it works in Hive.

ACID stands for 4 traits of database transactions:

Atomicity (an operation both succeeds fully or fails, it doesn’t depart partial knowledge)
Consistency (as soon as an utility performs an operation the outcomes of that operation are seen to it in each subsequent operation)
Isolation (an incomplete operation by one consumer doesn’t trigger surprising uncomfortable side effects for different customers)
Sturdiness (as soon as an operation is full it will likely be preserved even within the face of machine or system failure)

Beginning in Model zero.14, Hive helps all ACID properties which permits it to make use of transactions, create transactional tables, and run queries like Insert, Replace, and Delete on tables.

Underlying the Hive ACID desk, information are within the ORC ACID model. To help ACID options, Hive shops desk knowledge in a set of base information and all of the insert, replace, and delete operation knowledge in delta information. On the learn time, the reader merges each the bottom file and delta information to current the most recent knowledge. As operations modify the desk, a whole lot of delta information are created and have to be compacted to keep up satisfactory efficiency. There are two forms of compactions, minor and main.

Minor compaction takes a set of current delta information and rewrites them to a single delta file per bucket.
Main compaction takes a number of delta information and the bottom file for the bucket and rewrites them into a brand new base file per bucket. Main compaction is dearer however is more practical.

Organizations configure automated compactions, however additionally they must carry out guide compactions when automated fails. If compaction just isn’t carried out for a very long time after a failure, it leads to a whole lot of small delta information. Operating compaction on these massive numbers of small delta information can change into a really useful resource intensive operation and might run into failures as effectively.

Among the points with Hive ACID tables are:

NameNode capability issues attributable to small delta information.
Desk Locks throughout compaction.
Operating main compactions on Hive ACID tables is a useful resource intensive operation.
Longer time taken for knowledge replication to DR attributable to small information.

Advantages of migrating Hive ACIDs to BigQuery

Among the advantages of migrating Hive ACID tables to BigQuery are:

As soon as knowledge is loaded into managed BigQuery tables, BigQuery manages and optimizes the info saved within the inner storage and handles compaction. So there won’t be any small file challenge like we’ve got in Hive ACID tables.
The locking challenge is resolved right here as BigQuery storage learn API is gRPC based mostly and is extremely parallelized.
As ORC information are fully self-describing, there is no such thing as a dependency on Hive Metastore DDL. BigQuery has an in-built schema inference characteristic that may infer the schema from an ORC file and helps schema evolution with none want for instruments like Apache Spark to carry out schema inference.