Length of Arrays(Set), Map and String in Hive

Recently I was asked to calculate quantity of items inside each cell of a table, in face it is calculating the length of each cell. Some columns stores value in string, some are in array, some are in map, each type returns different value when they are empty.

Data Preparation

Assume we have a table that has four kinds of data:

  • each cell of the 1st row would contains data that has multiple values;
  • each cell of the 2nd row would contains only one value;
  • each cell of the 3rd row would be empty (not null);
  • each cell of the 4th row would be null.
select *
from tmp.ym_test


["item1","item2","item3"]{"key1":"value1","key2":"value2"}a sample string

Calculating Length

select size(array), size(key_value), length(string)
from tmp.ym_test




Based on the result above, the conclusion is clear that:

  1. 1st row & 2nd row:
    • length of an array or a map is the quantity of items of that value;
    • length of a string is the quantity of characters of that string, including spaces
  1. 3rd row:

    • all returns 0
  2. 4th row:

    • length of an array or a map value is -1;
    • still null for a string value

Not Over Yet

According to the last two conclusions, because of null values of array and map would be calculated as -1 by its length, if i wanted to calculate the sum of length of that row, the result would be short, because null is 0 to me but -1 to the computer; Besides that, the length of a string would be meaningless to me if it returns how many characters are there in the string, i only need to know that string is not empty, or the length of it is not 0, so just return 1 would be good for me. So modified version of the query would be:

if(size(array) == -1, 0, size(array)), 
if(size(key_value) == -1, 0, size(key_value)), 
if(length(string) is null, 0, 1)
from tmp.ym_test



This looks much better.

No Comments Yet