QueryDB

Posts

Flume: ERROR : HDFSEventSink. process failed

I 'm using Hadoop 2.2 as a sink in Flume 1.4 If you try to use HDFS then you might get below exception:- [ERROR] HDFSEventSink. process failed Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$RecoverLeaseRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet; at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:791) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) Above Exception is due to incompatiblity of protobuf libraries. So, a possible solution that, I followed is to rename 2 Jar files in /opt/ds/app/flume-1.4.0/lib >protobuf-java-2.4.1.jar-1 >guava-10.0.1.jar-1 Renaming this 2 Jars will cause Flume not to load it and

Flume installation step-by-step

1) Download "apache-flume-1.4.0-bin.tar.gz" 2) Gunzip and Untar the file at /opt/ds/app/flume-1.4.0 3) Change directory to /opt/ds/app/flume-1.4.0/conf 4) Optionally you can edit log directory flume.log.dir=/var/log/flume 5) Edit .bashrc export FLUME_HOME=/opt/ds/app/flume-1.4.0 export FLUME_CONF_DIR=/opt/ds/app/flume-1.4.0/conf export FLUME_CLASSPATH=$FLUME_CONF_DIR export PATH=$PATH:$FLUME_HOME/bin 6) execute from shell > source .bashrc 7) Copy jar to /opt/ds/app/flume-1.4.0/lib hadoop-auth-2.2.0.jar hadoop-common-2.2.0.jar 8) From shell > flume-ng --help

Pig step-by-step installation with integrated HCatalog

1) Download tar file "pig-0.13.0.tar.gz" 2) Gunzip and Untar the file at / opt/ds/app/pig-0.13.0 3) Change directory to /opt/ds/app/pig-0.13.0/conf 4) Create log4j.properties from template file 5) Update pig.properties for HCatalog. For example: hcat.bin=/opt/ds/app/hive-0.13.0/hcatalog/bin/hcat 6) Edit .bashrc export PIG_HOME=/opt/ds/app/pig-0.13.0 export PATH=$PATH:$PIG_HOME/bin export HCAT_HOME=/opt/ds/app/hive-0.13.0/hcatalog export PATH=$PATH:$HCAT_HOME/bin 7) It is assumed that you have already set HADOOP_HOME, JAVA_HOME, HADOOP_COMMON_LIB_NATIVE_DIR, HADOOP_OPTS, YARN_OPTS 8) Optionally, you can create .pigbootup in User home directory 9) Execute command from user home directory > source .bashrc 10) Execute > pig -useHCatalog 11) Say you had created a table in Hive with name "hivetesting". Now, try to load with below command to verify installation. grunt> A = LOAD 'hivetesting' USING org.apache.hcatalog

Hive Installation step- by-step with MySQL Metastore

1) Download hive "apache-hive-0.13.0-bin.tar.gz" 2) Gunzip and Untar at path /opt/ds/app/hive-0.13.0 3) Edit ~/.bashrc and add below lines:- #HIVE export HIVE_HOME=/opt/ds/app/hive-0.13.0 export PATH=$PATH:$HIVE_HOME/bin 4) Change directory to /opt/ds/app/hive-0.13.0/conf 5) Create hive-log4j.properties from template 6) Create hive-env.sh from template. Also,set # if [ "$SERVICE" = "cli" ]; then if [ -z "$DEBUG" ]; then export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit" else export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit" fi fi # The heap size of the jvm stared by hive shell script can be controlled via: # export HADOOP_HEAPSIZE="1024" export HADOOP_CLIENT_OPTS="-Xmx${HADOO

Oozie Coordinator Job Scheduling frequency

Sometimes you may see a situation where-in coordinator job says frequency in minutes for example say 60 but you can see the workflow jobs are running more frequently. On the OOZIE Web console you can see the ‘Created Time’ increments more frequently while ‘Nominal Time’ increments by an hour which is the interval you may want. The issue as here is that start date in coordinator xml is of past. So, in this scenario, Oozie will submit workflows for all the intervals that were missed starting from the start time till it gets in sync with current time. **The nominal time is the actual time interval (hour) that the workflow is supposed to process. So, In such a situation you might want to set Concurrency to decide how many actions to runs in parallel, or Execution strategy, that can be FIFO, LIFO, LAST Only, or Throttle to decide how many jobs can be in waiting status if one is already Running. Example: <controls> <concurrency>1</concurrency> <execu