Skip to main content

Posts

Hive Complex Data Types

Array $ vi arrayfile 1,abc,40000,a$b$c,hyd 2,def,3000,d$f,bang 3,abc,40000,a$b$c,hyd 4,def,3000,d$f,bang 5,abc,40000,a$b$c,hyd 6,def,3000,d$f,bang 7,abc,40000,a$b$c,hyd 8,def,3000,d$f,bang 9,abc,40000,a$b$c,hyd 10,def,3000,d$f$d$e$d$e$e$r$g,bang hive> create table array_tab (id int, name string, salary bigint, sub array<string>, city string)     > row format delimited     > fields terminated by ','     > collection items terminated by '$'; hive> load data local inpath '/root/arrayfile' into table array_tab; hive> select * from array_tab; OK 1       abc     40000   ["a","b","c"]   hyd 2       def     3000    ["d","f"]       bang 3       abc...

Hive UDF Examples

There are 2 ways to write UDF's in Hive by extending: org.apache.hadoop.hive.ql.exec.UDF org.apache.hadoop.hive.ql.udf.generic.GenericUDF First example below is simple one which can be used with hadoop primitive types. Second example is bit complex as this can used with complex types arrays, maps etc.  package hive; import org.apache.hadoop.hive.ql.exec.UDF ; import org.apache.hadoop.io.Text; public class SimpleUDFExample extends UDF {        public Text evaluate(Text input) {               if (input == null )                      return null ;               return new Text( "Hello " +input.toString());        } } package h...

MapJoinMemoryExhaustionException on MR job

Exception trace is as follows:- org.apache.hadoop.hive.ql.exec.mapjoin.MapJoinMemoryExhaustionException: 2013-11-20 07:30:48 Processing rows: 1700000 Hashtable size: 1699999 Memory usage: 965243784 percentage: 0.906 at org.apache.hadoop.hive.ql.exec.mapjoin.MapJoinMemoryExhaustionHandler.checkMemoryStatus(MapJoinMemoryExhaustionHandler.java:91) …. Solution: set hive.auto.convert.join=false; before running your query to disable local inmemory joins and force the join to be done as a distributed Map-Reduce phase.

CAP Theorem: Relational Nosql

There are three primary concerns you must balance when choosing a data management system: Consistency means that each client always has the same view of the data.(all nodes see the same data at the same time) Availability means that all clients can always read and write.(a guarantee that every request receives a response about whether it was successful or failed) Partition tolerance means that the system works well across physical network partitions.(the system continues to operate despite arbitrary message loss or failure of part of the system) Now according to CAP theorem you can pick only 2: CA, CP, AP Relational systems are CA systems and typically deals with problem of partitions which can be dealt via replication. NoSQL systems supports horizontal scalability. To scale horizontally, you need strong network partition tolerance which requires giving up either consistency or availability. NoSQL systems typically accomplish this by relaxing relational abilities ...

Hive: java.io.IOException: cannot find dir in pathToPartitionInfo

While running Hive job below exception may arise. Job Submission failed with exception 'java.io.IOException(cannot find dir = /user/hive/warehouse/test/city=paris/out.csv in pathToPartitionInfo: [hdfs://cdh-four:8020/user/hive/warehouse/test/city=paris])' 12/09/19 17:18:44 ERROR exec.Task: Job Submission failed with exception 'java.io.IOException(cannot find dir = /user/hive/warehouse/test/city=paris/out.csv in pathToPartitionInfo: [hdfs://cdh-four:8020/user/hive/warehouse/test/city=paris])' java.io.IOException: cannot find dir = /user/hive/warehouse/test/city=paris/out.csv in pathToPartitionInfo: [hdfs://cdh-four:8020/user/hive/warehouse/test/city=paris] at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:290) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:257) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSp...