User:Anudeepvrm/sandbox

Comparison with Traditional Databases[edit]

The storage and querying operations of Hive closely resemble with that of traditional databases. While Hive works on an SQL-dialect, there are a lot of differences in structure and working of Hive in comparison to relational databases. The differences are mainly because Hive is built on top of Hadoop ecosystem and has to comply with the restrictions of Hadoop and MapReduce.

Schema is applied to a table in traditional databases. However, the table enforces the schema at the time of loading the data. This enables the database to make sure that the data entered follows the representation of the table as specified by the user. This design is called schema on write^[1]. Hive, when it saves its data into the tables, does not verify it against the table schema during load time. It, instead, follows a run time check. This model is called schema on read^[1]. The two approaches have their own advantages and drawbacks. Checking data against table schema during the load time adds extra overhead which is why traditional databases take a longer time to load data. Quality check is performed against the data at the load time to ensure that the data is not corrupt. Early detection of corrupt data ensures early exception handling. Since the tables have schema ready after the data load, it has better query time performance. Hive, on the other hand, can load data dynamically without any schema check, ensuring a fast initial load but displays comparatively slower performance at query time. Hive does have an advantage when the schema is not available at the load time, instead is generated later dynamically^[1].

Transactions are key operations in traditional databases. A typical RDBMS supports all 4 properties of Transactions (ACID): Atomicity, Consistency, Isolation, and Durability. Transactions in Hive were introduced in Hive 0.13 but were only limited to partition level^[2]. Only in the recent version of Hive 0.14 were these functions fully added to support complete ACID properties. This is because Hadoop does not support row level updates over specific partitions. These partitioned data are immutable and a new table with updated values has to be created. Hive 0.14 and later provides different row level transactions such as INSERT, DELETE and UPDATE^[3]. Enabling INSERT, UPDATE, DELETE transactions require setting appropriate values for configuration properties: hive.support.concurrency, hive.enforce.bucketing, and hive.exec.dynamic.partition.mode^[4].

^ ^a ^b ^c White, Tom (2009-01-01). Hadoop: The Definitive Guide (1st ed.). O'Reilly Media, Inc. ISBN 0596521979.
^ "Introduction to Hive transactions". datametica.com. Retrieved 2016-09-12.
^ "Hive Transactions - Apache Hive - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-09-12.
^ "Configuration Properties - Apache Hive - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-09-12.

[:0-1] White, Tom (2009-01-01). Hadoop: The Definitive Guide (1st ed.). O'Reilly Media, Inc. ISBN 0596521979.

[2] "Introduction to Hive transactions". datametica.com. Retrieved 2016-09-12.

[3] "Hive Transactions - Apache Hive - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-09-12.

[4] "Configuration Properties - Apache Hive - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-09-12.

[1]

[2]

[3]

[4]