Hive

Bucketing in Hive

Bucketing is –

-> Another data organizing technique in Hive like Partitioning.

-> It is a technique for decomposing larger datasets into more manageable chunks.

-> All the same values of a bucketed column will go into same bucket.

-> We can use bucketing directly on a table but it gives the best performance result when do bucketing and partitioning side by side. We can assume it as, first we will create a partition and inside partition, the data will be stored in buckets.

Let’s understand it by example, Suppose we have a table of Employee containing Employee ID, Department and Salary. First of all to get a good performance and a better organization of data, what we can do is, we can create partition based on department. What if even after creating partition, latency is high as so many records are present. To achieve better organization of data we can do bucking in this table. inside a partition if we create bucket to break down the data, we will be achieving the best optimization. We can do bucketing on salary column inside each partition. And internally which salary will go to which bucket is decided by a hashing algorithm.

Please note that a bucket is physically a file, whereas partition is a Directory.

Let’s create one non-Partitioned table named ‘DEPT’ and load data into it from a file.

create table if not exists DEPT ( DEPT_ID int, DEPT_NAME string, EMP_NAME string, SALARY int, LOCATION string ) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;

By default bucketing is not enabled. So enable it as below before creating a bucketed table –

Also since we will be using Dynamic partitioning insert, so set the mode to nonstrict.

Now let’s create a partitioned with bucketed table and insert data from ‘DEPT’ table into it.

create table if not exists dept_bucket (dept_id int,  emp_name string, sal int, location string) partitioned by (dept_name string) clustered by (location) into 4 buckets row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;

Please note – CLUSTERED BY clause is used to divide the table into buckets. we have created 4 buckets in each partition.

Now let’s insert the data into DEPT_BUCKET table. We are using dynamic partitioning, so the query is –

insert into table DEPT_BUCKET partition (dept_name) Select  dept_id, emp_name, salary, location, dept_name from DEPT;

Now the data is inserted and below Partitions are created on table.

In below screenshot we can observe, 5 directories one for each partitions are created and inside each directory (or partition), 4 buckets (or 4 files) are created.

If file 000002_0 contains Bangalore location data in Partition ‘dept_name=Accounts’, so in the same partition, no other bucket would contain Bangalore data.

Few Key Points
  • Bucket is physically a file.
  • We can explicitly set the number of buckets during table creation.
  • Bucketing can sometimes be more efficient when used alone. Suppose we have a datasets in which we have very unique ranges of data, like if we have 500 records in a Dept file, with 100 different department. So in this case if we do partitioning 100 directories will be created, and use of Partition will become ineffective. So instead of Partitioning we can only perform bucketing in this case. As bucket is only a file and moreover we can control number of buckets created.
  • Two Bucketed tables can join faster than any other table in Hive. Because in this case where join is being done on bucketed column, the mapper processing the left table knows exactly where the matching rows of right table are present. So mapper does not have to go anywhere in the table, and this saves a lot of time. But this works only if – Both table are bucketed on same column and both tables have same number of buckets.
  • However, bucketing doesn’t ensure that the table is properly populated. So, we need to handle Data Loading into buckets by our-self.
  • Table Sampling is possible on bucketed table. We will see what is table sampling in the next blog.

That’s all for now. Sharing is Caring.

Thank You! 🙂

2 thoughts on “Bucketing in Hive”

Leave a comment