Suppose we have an input file as follows :-
And a keyword's file like :-
And say we want to produce output something like where there are 4 columns:-
$vi source
abcd deff,12, xyzd,US
din,123,abcd,Pak
|
$vi keyword
abc,xyz
xyz
|
- first column, indicates original value.
- second column, indicates indexes of keywords removed in original value.
- third column, indicates string after keywords are removed.
- fourth column, indicates number of times keywords are removed in original value.
Firstly, let us create desired Tables in Hive as below:
Now, comes the important part to write UDF. Below pseudo-code details you the approach:
Finally, execute below queries to verify:
Hive> create
table source ( inital_data string ) ;
Hive> load
data local inpath '/root/source' into table source;
|
Put Keyword file to HDFS:
$ hadoop fs
-put /root/keyword hdfs://sandbox.hortonworks.com:8020/user/root/keyword
|
We would be writing a Hive UDF "ReplaceKeyword" that would write desired output mentioned above with "$" separator. So,let us create a table with "$" separator in Hive:
Hive> create
table output ( initial_data string, fields_affected string, cleaned_data
string, count_removed_keywords string ) row format delimited fields
terminated by '$';
|
Once we have written UDF we would execute below SQL to generate desired output and write it to HDFS location of Hive table "output":
Hive> add
jar /root/hadoop-examples.jar;
Hive> create
temporary function rep_key as 'hive.ReplaceKeyword';
Hive> insert
overwrite directory '/apps/hive/warehouse/output'
select
rep_key(inital_data,
"hdfs://sandbox.hortonworks.com:8020/user/root/keyword") from
source;
|
package hive;
import …
public class ReplaceKeyword extends GenericUDF{
StringObjectInspector [] elementOI = new
StringObjectInspector[2];
private static final char MY_TOKEN ='$';
@Override
public Object evaluate(DeferredObject[]
arguments) throws HiveException {
//Get
the arguments
//Type cast them and written null
if argument is null
//thisis to append output
StringBuffer buffer = new StringBuffer();
//to have sorted values of indexes
replaced in args0
List index = new TreeList();
BufferedReader br = null;
// to have count of replacement
done
long count = 0;
try{
//Read file
Keyword from file system
//Tokenize input value
// read keyword
file line-by-line
while ((line = br.readLine()) != null) {
// tokenize
keywords on basis of comma
// Do your evaluation and replacement
}
return buffer.append(args0).append(MY_TOKEN).append(index).append(MY_TOKEN).
append(Arrays.toString(valueToks)).append(MY_TOKEN).
append(count).toString();
}catch(Exception e){
throw new HiveException(e);
}
finally{
// Do some action…
}
}
@Override
public String getDisplayString(String[]
arg0) {
return "ReplaceKeyword
"+Arrays.toString(arg0);
}
@Override
public ObjectInspector
initialize(ObjectInspector[] arguments)
throws
UDFArgumentException {
//It will have only 2 arguments
//Write to cast input arguments
to StringObjectInspector and do some pre-validation.
// the return type of our function
is a String, so we provide the correct object inspector
return PrimitiveObjectInspectorFactory.javaStringObjectInspector;
}
}
|
Finally, execute below queries to verify:
hive> dfs -ls /apps/hive/warehouse/output;
Found 1 items
-rw-r--r-- 3 root
hdfs 93 2015-11-24 18:04
/apps/hive/warehouse/output/000000_0
hive> select * from output;
OK
abcd deff,12, xyzd,US
[1, 3] [d deff, 12, d, US] 2
din,123,abcd,Pak
[3] [din, 123, d, Pak] 1
Time taken: 0.273 seconds, Fetched: 2 row(s)
|
Comments
Post a Comment