Explanatory vs. Predictive Models in Machine Learning

Agnijit Das Gupta

Artificial Intelligence / Machine Learning

Tags:

data science

machine learning

My vision on Data Analysis is that there is continuum between explanatory models on one side and predictive models on the other side. The decisions you make during the modeling process depend on your goal. Let’s take Customer Churn as an example, you can ask yourself why are customers leaving? Or you can ask yourself which customers are leaving? The first question has as its primary goal to explain churn, while the second question has as its primary goal to predict churn. These are two fundamentally different questions and this has implications for the decisions you take along the way. The predictive side of Data Analysis is closely related to terms like Data Mining and Machine Learning.

SPSS & SAS

When we’re looking at SPSS and SAS, both of these languages originate from the explanatory side of Data Analysis. They are developed in an academic environment, where hypotheses testing plays a major role. This makes that they have significant less methods and techniques in comparison to R and Python. Nowadays, SAS and SPSS both have data mining tools (SAS Enterprise Miner and SPSS Modeler), however these are different tools and you’ll need extra licenses.

I have spent some time to build extensive macros in SAS EG to seamlessly create predictive models, which also does a decent job at explaining the feature importance. While a Neural Network may do a fair job at making predictions, it is extremely difficult to explain such models, let alone feature importance. The macros that I have built in SAS EG does precisely the job of explaining the features, apart from producing excellent predictions.

Open source TOOLS: R & PYTHON

One of the major advantages of open source tools is that the community continuously improves and increases functionality. R was created by academics, who wanted their algorithms to spread as easily as possible. R has the widest range of algorithms, which makes R strong on the explanatory side and on the predictive side of Data Analysis.

Python is developed with a strong focus on (business) applications, not from an academic or statistical standpoint. This makes Python very powerful when algorithms are directly used in applications. Hence, we see that the statistical capabilities are primarily focused on the predictive side. Python is mostly used in Data Mining or Machine Learning applications where a data analyst doesn’t need to intervene. Python is therefore also strong in analyzing images and videos. Python is also the easiest language to use when using Big Data Frameworks like Spark. With the plethora of packages and ever improving functionality, Python is a very accessible tool for data scientists.

MACHINE LEARNING MODELS

While procedures like Logistic Regression are very good at explaining the features used in a prediction, some others like Neural Networks are not. The latter procedures may be preferred over the former when it comes to only prediction accuracy and not explaining the models. Interpreting or explaining the model becomes an issue for Neural Networks. You can’t just peek inside a deep neural network to figure out how it works. A network’s reasoning is embedded in the behavior of numerous simulated neurons, arranged into dozens or even hundreds of interconnected layers. In most cases the Product Marketing Officer may be interested in knowing what are the factors that are most important for a specific advertising project. What can they concentrate on to get the response rates higher, rather than, what will be their response rate, or revenues in the upcoming year. These questions are better answered by procedures which can be interpreted in an easier way. This is a great article about the technical and ethical consequences of the lack of explanations provided by complex AI models.

Procedures like Decision Trees are very good at explaining and visualizing what exactly are the decision points (features and their metrics). However, those do not produce the best models. Random Forests, Boosting are the procedures which use Decision Trees as the basic starting point to build the predictive models, which are by far some of the best methods to build sophisticated prediction models.

While Random Forests use fully grown (highly complex) Trees, and by taking random samples from the training set (a process called Bootstrapping), then each split uses only a proper subset of features from the entire feature set to actually make the split, rather than using all of the features. This process of bootstrapping helps with lower number of training data (in many cases there is no choice to get more data). The (proper) subsetting of the features has a tremendous effect on de-correlating the Trees grown in the Forest (hence randomizing it), leading to a drop in Test Set error. A fresh subset of features is chosen at each step of splitting, making the method robust. The strategy also stops the strongest feature from appearing each time a split is considered, making all the trees in the forest similar. The final result is obtained by averaging the result over all trees (in case of Regression problems), or by taking a majority class vote (in case of classification problem).

On the other hand, Boosting is a method where a Forest is grown using Trees which are NOT fully grown, or in other words, with Weak Learners. One has to specify the number of trees to be grown, and the initial weights of those trees for taking a majority vote for class selection. The default weight, if not specified is the average of the number of trees requested. At each iteration, the method fits these weak learners, finds the residuals. Then the weights of those trees which failed to predict the correct class is increased so that those trees can concentrate better on the failed examples. This way, the method proceeds by improving the accuracy of the Boosted Trees, stopping when the improvement is below a threshold. One particularly implementation of Boosting, AdaBoost has very good accuracy over other implementations. AdaBoost uses Trees of depth 1, known as Decision Stump as each member of the Forest. These are slightly better than random guessing to start with, but over time they learn the pattern and perform extremely well on test set. This method is more like a feedback control mechanism (where the system learns from the errors). To address overfitting, one can use the hyper-parameter Learning Rate (lambda) by choosing values in the range: (0,1]. Very small values of lambda will take more time to converge, however larger values may have difficulty converging. This can be achieved by a iterative process to select the correct value for lambda, plotting the test error rate against values of lambda. The value of lambda with the lowest test error should be chosen.

In all these methods, as we move from Logistic Regression, to Decision Trees to Random Forests and Boosting, the complexity of the models increase, making it almost impossible to EXPLAIN the Boosting model to marketers/product managers. Decision Trees are easy to visualize, Logisitic Regression results can be used to demonstrate the most important factors in a customer acquisition model and hence will be well received by business leaders. On the other hand, the Random Forest and Boosting methods are extremely good predictors, without much scope for explaining. But there is hope: These models have functions for revealing the most important variables, although it is not possible to visualize why.

USING A BALANCED APPROACH

So I use a mixed strategy: Use the previous methods as a step in Exploratory Data Analysis, present the importance of features, characteristics of the data to the business leaders in phase one, and then use the more complicated models to build the prediction models for deployment, after building competing models. That way, one not only gets to understand what is happening and why, but also gets the best predictive power. In most cases that I have worked, I have rarely seen a mismatch between the explanation and the predictions using different methods. After all, this is all math and the way of delivery should not change end results. Now that's a happy ending for all sides of the business!

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Explanatory vs. Predictive Models in Machine Learning

SPSS & SAS

Open source TOOLS: R & PYTHON

MACHINE LEARNING MODELS

USING A BALANCED APPROACH

data science

machine learning

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Velotio Technologies is an outsourced software product development partner for top technology startups and enterprises. We partner with companies to design, develop, and scale their products. Our work has been featured on TechCrunch, Product Hunt and more.

We have partnered with our customers to built 90+ transformational products in areas of edge computing, customer data platforms, exascale storage, cloud-native platforms, chatbots, clinical trials, healthcare and investment banking.

Since our founding in 2016, our team has completed more than 90 projects with 220+ employees across the following areas:

Building web/mobile applications
Architecting Cloud infrastructure and Data analytics platforms
Designing AI/ML-based solutions
Intelligent Chatbots

Talk to us

Explanatory vs. Predictive Models in Machine Learning

Agnijit Das Gupta

SPSS & SAS

Open source TOOLS: R & PYTHON

MACHINE LEARNING MODELS

USING A BALANCED APPROACH

MORE POSTS BY THIS AUTHOR

Agnijit Das Gupta

You may also like

Policy Insights: Chatbots and RAG in Health Insurance Navigation

Shreyash Panchal

The Responsible Use of Artificial Intelligence - Shaping a Safer Tomorrow

Shivali Bari

Vector Search: The New Frontier in Personalized Recommendations

Afshan Khan

Explanatory vs. Predictive Models in Machine Learning

SPSS & SAS

Open source TOOLS: R & PYTHON

MACHINE LEARNING MODELS

USING A BALANCED APPROACH

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Services

By Company Stage

By Engagement Model

Expertise

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting

Subscribe to get the latest technology updates

Explanatory vs. Predictive Models in Machine Learning

Agnijit Das Gupta

SPSS & SAS

Open source TOOLS: R & PYTHON

MACHINE LEARNING MODELS

USING A BALANCED APPROACH

MORE POSTS BY THIS AUTHOR

Agnijit Das Gupta

You may also like

Policy Insights: Chatbots and RAG in Health Insurance Navigation

Shreyash Panchal

The Responsible Use of Artificial Intelligence - Shaping a Safer Tomorrow

Shivali Bari

Vector Search: The New Frontier in Personalized Recommendations

Afshan Khan

Explanatory vs. Predictive Models in Machine Learning

SPSS & SAS

Open source TOOLS: R & PYTHON

MACHINE LEARNING MODELS

USING A BALANCED APPROACH

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Policy Insights: Chatbots and RAG in Health Insurance Navigation

The Responsible Use of Artificial Intelligence - Shaping a Safer Tomorrow

Vector Search: The New Frontier in Personalized Recommendations

Unlocking Legal Insights: Effortless Document Summarization with OpenAI's LLM and LangChain

Building an Intelligent Recommendation Engine with Collaborative Filtering

Build ML Pipelines at Scale with Kubeflow

Exploring OpenAI Gym: A Platform for Reinforcement Learning Algorithms

Real Time Text Classification Using Kafka and Scikit-learn

Your Complete Guide to Building Stateless Bots Using Rasa Stack

Chatbots With Google DialogFlow: Build a Fun Reddit Chatbot in 30 Minutes

Amazon Lex + AWS Lambda: Beyond Hello World

Machine Learning for your Infrastructure: Anomaly Detection with Elastic + X-Pack

A Quick Guide to Building a Serverless Chatbot With Amazon Lex

Building an Intelligent Chatbot Using Botkit and Rasa NLU

Benefits of Using Chatbots: How Companies Are Using Them to Their Advantange

A Step Towards Machine Learning Algorithms: Univariate Linear Regression

A Quick Introduction to Data Analysis With Pandas

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting