Skip to main content

Posts

Business Value from Machine Learning Methods

Linear Regression - To make predictions for sales forecast, price optimization, marketing optimization, financial risk assessment. Logistic Regression - To predict customer churn, to predict response versus advertisement spending, predict lifetime value of customer, and to monitor how business decisions affect predicted churn rates. Naive Bayes - Build spam detector, analyze customer sentiments, or automatically categorize products, customers or competitors. K-means clustering - Useful for cost modeling and customer segmentation Hierarchical clustering - Model business processes, or to segment customers based on survey responses, hierarchical clustering will probably come in handy. K-nearest neighbor classification - Type of instance based learning. use it for text document classification, financial distress prediction modeling, and competitor analysis and classification. Principal component analysis - Dimensionality reduction method that you can use for detecting fraud, for s...

Types of Data Analytics

In the order of their complexity, they can be classified in 4 types - Descriptive Analytics - Based on Historical & Current data answer question like - "What happened?" Diagnostic Analytics - For deducing & inferring success or failure, like - "Why that happened?" or "Why it went wrong?" or "Why did we receive this growth or success?" Predictive Analytics - Based on what happened or what is happening deriving the answer to question like "What will happen?". This involves complex model-building and analysis in order to predict a future event or trend. Prescriptive Analytics - Optimize processes, structures, and systems through informed action that's based on predictive analytics - what you should do based on what will happen.

Hive - Merge large number of small files

There can be multiple ways to merge files. One such way suggested by Hive is to use - alter table <Table Name>  PARTITION <Partition Name>  CONCATENATE; But, above solution does not work directly. Because , it triggers a MR Job with only map task and no reduce task. So, in output number of files will be equal to number of mapper running. So, one can reduce number of mappers running that will eventually reduce number of files in output. Set below properties and that will cause " CONCATENATE " job to output less mappers JOBS - hive> set hive.merge.mapfiles=true; hive> set hive.merge.mapredfiles=true; hive> set hive.merge.size.per.task=1073741824; hive> set hive.merge.smallfiles.avgsize=1073741824; hive >set mapreduce.input.fileinputformat.split.maxsize=1073741824; hive> set mapred.job.reuse.jvm.num.tasks=5; Also, note that Concatenate can cause data to loss in cause of improper ORC file statistics. Refer - https://issues.a...

Spark & Hive over Spark - Performance Problems Hortonworks

I had been using Spark & Hive to Insert data in to Table. I have following table in Hive - CREATE TABLE `ds_test`(   `name` string) PARTITIONED BY (   `company` string,   `market` string,   `eventdate` string,   `processdate` string) ROW FORMAT SERDE   'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION   'hdfs://hdpprod/apps/hive/warehouse/ds_test' TBLPROPERTIES (   'transient_lastDdlTime'='1524769102') I was inserting data into table using Hive over SQL like below -  sqlContext.sql("INSERT OVERWRITE TABLE ds_test PARTITION(COMPANY = 'MCOM', MARKET, EVENTDATE, PROCESSDATE) Select name, MARKET, EVENTDATE, PROCESSDATE from Table1") Above method was working fine. But, we were facing performance problems -  1) We saw that application was running t...

Hive Analytics Functions - rank() vs dense_rank() vs percent_rank() vs row_umber() vs cume_dist()

RANK - Rank of each row within partition of result set. DENSE_RANK - Mostly, similar to RANK. But, there will be no gaps in ranking. PERCENT_RANK - Relative Rank of row within group of rows. ROW_NUMBER - Sequential number of row within partition of a result set. CUME_DIST - For row r, the number of rows with value lower than or equal to value of r , divided by number of rows evaluated in partition. Practice - hive> create table test (v string ) row format delimited fields terminated by ','; hive> alter table test add columns (t string); hive> load data local inpath '/root/test' overwrite into table test; test data in local looks like below - a,1 a,2 a,3 a,1 a,2 b,1 c,1 c,2 d,1 e,1 Execute below query and analyze the result - hive> select v, t, rank() over (partition by v ), dense_rank() over (partition by v  ), row_number() over (partition by v  ), percent_rank()over (partition by v  ), cume_dist() over (partition by v ...