A to Z of Analytics - Listing of alphabets and usage

Analytics has taken the world by storm and it is the powerhouse for all the digital transformation happening in every industry.

Today everybody is generating tons of data – we as consumers leaving digital footprints on social media, the Internet of Things generating millions of records from sensors, and Mobile phones are used from morning till we sleep. All these varieties of data formats are stored in the Big Data platform. But only storing this data is not going to take us anywhere unless analytics is applied to it. Hence it is extremely important to close the loop with Analytics insights.

Here is my version of A to Z for Analytics:

Artificial Intelligence: AI is the capability of a machine to imitate intelligent human behavior. BMW, Tesla, and Google are using AI for self-driving cars. AI should be used to solve real-world tough problems like climate modeling disease analysis and the betterment of humanity.

Boosting and Bagging: it is the technique used to generate more accurate models by ensembling multiple models together

Crisp-DM: is the cross-industry standard process for data mining. It was developed by a consortium of companies like SPSS, Teradata, Daimler, and NCR Corporation in 1997 to bring the order in developing analytics models. The major 6 steps involved are business understanding, data understanding, data preparation, modeling, evaluation, and deployment.

Data preparation: In analytics deployments, more than 60% of the time is spent on data preparation. As a normal rule is a garbage in garbage out. Hence it is important to cleanse and normalize the data and make it available for consumption by model.

Ensembling: is the technique of combining two or more algorithms to get more robust predictions. It is like combining all the marks we obtain in exams to arrive at the final overall score. Random Forest is one such example of combining multiple decision trees.

Feature selection: Simply put this means selecting only those features or variables from the data that really make sense and removing nonrelevant variables. This uplifts the model's accuracy.

Gini Coefficient: it is used to measure the predictive power of the model typically used in credit scoring tools to find out who will repay and who will default on a loan.

Histogram: This is a graphical representation of the distribution of a set of numeric data, usually a vertical bar graph used for exploratory analytics and data preparation step.

Independent Variable: is the variable that is changed or controlled in a scientific experiment to test the effects on the dependent variable like the effect of increasing the price of Sales.

Jubatus: This is an online Machine Learning Library covering Classification, Regression, Recommendation (Nearest Neighbor Search), Graph Mining, Anomaly Detection, Clustering

KNN: K nearest neighbor algorithm in Machine Learning used for classification problems based on distance or similarity between data points.

Lift Chart: These are widely used in campaign targeting problems, to determine which decile can we target customers for a specific campaign. Also, it tells you how much response you can expect from the new target base.

Model: There are more than 50+ modeling techniques like regressions, decision trees, SVM, GLM, Neural networks, etc present in any technology platform like SAS Enterprise Miner, IBM SPSS, or R. They are broadly categorized under supervised and unsupervised methods into classification, clustering, association rules.

Neural Networks: These are typically organized in layers made up of nodes and mimic the learning as the brain does. Today Deep Learning is an emerging field based on deep neural networks.

Optimization: It is the Use of simulation techniques to identify scenarios that will produce the best results within available constraints e.g. Sale price optimization, identifying optimal Inventory for maximum fulfillment & avoiding stockouts

PMML: this is XML based file format developed by a data mining group to transfer models between various technology platforms and it stands for predictive model markup language.

Quartile: It is dividing the sorted output of the model into 4 groups for further action.

R: Today every university and even corporate is using R for statistical model building. It is freely available and there are licensed versions like Microsoft R. More than 7000 packages are now available at the disposal of data scientists.

Sentiment Analytics: This is the process of determining whether an information or service provided by a business leads to positive, negative, or neutral human feelings or opinions. All the consumer product companies are measuring the sentiments 24/7 and adjusting their marketing strategies.

Text Analytics: It is used to discover & extract meaningful patterns and relationships from the text collection from social media sites such as Facebook, Twitter, Linked-in, Blogs, and Call center scripts.

Unsupervised Learning: These are algorithms where there is only input data and are expected to find some patterns. Clustering and association algorithms like k-means & apriori are the best examples.

Visualization: It is the method of enhanced exploratory data analysis & showing the output of modeling results with highly interactive statistical graphics. Any model output has to be presented to senior management in the most compelling way. Tableau, QlikView, and Spotfire are leading visualization tools.

What-If analysis: It is the method to simulate various business scenarios questions like What if we increased our marketing budget by 20%, what will impact on sales? Monte Carlo simulation is very popular.

What do you think should come for X, Y, and Z?