- We had in-premise Hadoop Cluster which included Kafka, HBase, HDFS, Spark, YARN , etc.
- We planned to migrate our Big Data Jobs and Data to AWS EMR but still keeping Kafka on in-premise CDP cluster.
- After Spawning EMR on AWS. We tried running Spark Job connecting to Kafka on in-premise cluster.
- We did setup all VPC connections & opened 2firewall ports between the two clusters.
- But, since EMR and CDP (in-premise) had different KDC Server & principal, it kept on failing for us to connect to Kafka ( in-premise) from EMR.
- Note, one can set following property to see Kerberos logs -
- -Dsun.security.krb5.debug=true
The easiest option for us were two -
- Setup Cross-Realm Kerberos trust. Such that EMR principal in-premise KDC Server to use kafka service. Refer - https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system-level_authentication_guide/using_trusts
- Setup to Cross-Realm trust using same AD accounts and domain. Refer https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-kerberos-cross-realm.html
So, We copied in-premise Keytab on to EMR, and tried to use that to authenticate with Kafka service. We don't recommend doing it, but we had not other option. Steps are described as below.
- Update krb5.conf such that it aware about both the cluster domains, kdc servers, etc.
[libdefaults]
default_realm = EMR.LOCAL
dns_lookup_realm = false
udp_preference_limit = 1
dns_lookup_kdc = false
rdns = true
ticket_lifetime = 24h
forwardable = true
default_tkt_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96 des3-cbc-sha1
default_tgs_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96 des3-cbc-sha1
permitted_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96 des3-cbc-sha1
[realms]
EMR.LOCAL = {
kdc = ip-90-110-43-74.ec2.internal:88
admin_server = ip-90-110-43-74.ec2.internal:749
default_domain = ec2.internal
}
CDP.INPREMISE.COM = {
kdc = cdp42.cdp.inpremise.com:88
master_kdc = cdp42.cdp.inpremise.com:88
kpasswd = cdp42.cdp.inpremise.com:464
kpasswd_server = cdp42.cdp.inpremise.com:464
}
[domain_realm]
.ec2.internal = EMR.LOCAL
ec2.internal = EMR.LOCAL
cdp42.cdp.inpremise.com = CDP.INPREMISE.COM
.cdp.inpremise.com = CDP.INPREMISE.COM
[logging]
kdc = FILE:/var/log/kerberos/krb5kdc.log
admin_server = FILE:/var/log/kerberos/kadmin.log
default = FILE:/var/log/kerberos/krb5lib.log
- Then we did write jaas.conf as below -
KafkaClient{
com.sun.security.auth.module.Krb5LoginModule required
doNotPrompt=true
useTicketCache=false
principal="inpremiseaccount@CDP.INPREMISE.COM"
useKeyTab=true
serviceName="kafka"
keyTab="inpremiseaccount.keytab"
renewTicket=true
storeKey=true
client=true;
};
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
doNotPrompt=true
useTicketCache=false
serviceName="hbase"
keyTab="awsemraccount.keytab"
principal="awsemraccount@EMR.LOCAL"
storeKey=true
client=true;
};
Note that -
- KafkaClient has configuration to connect to in-premise kafka server running on CDP.
- Client on the other hand includes configuration to connect to EMR HBase service.
- As our target is to run a Spark job on EMR, which reads data from a different Kafka cluster and saves data to EMR HBase.
Once above was done then our Spark command looked like below -
INPREMISEACCOUNT_KEYTAB_PATH= <My_path>/inpremiseaccount.keytab
JAAS_PATH=<My_path>/jaas.conf
TRUSTSTORE_PATH=<My_path>/trustore.jks
KRB5_PATH=<My_path>/krb5.conf
EMRACCOUNT_KEYTAB_PATH=<My_path>/awsemraccount.keytab
spark-shell --master yarn \
--num-executors 2 \
--conf "spark.dynamicAllocation.enabled=false" \
--conf "spark.shuffle.service.enabled=false" \
--jars $mylib \
--conf spark.executor.extraJavaOptions=" -Djava.security.auth.login.config=jaas.conf -Djava.security.krb5.conf=krb5.conf" \
--driver-java-options " -Djava.security.auth.login.config=jaas.conf -Djava.security.krb5.conf=krb5.conf" \
--files "$INPREMISEACCOUNT_KEYTAB_PATH,$JAAS_PATH,$TRUSTSTORE_PATH,$KRB5_PATH" \
--conf "spark.yarn.keytab=$EMRACCOUNT_KEYTAB_PATH" \
--conf "spark.yarn.principal=awsemraccount@EMR.LOCAL"
Comments
Post a Comment