Purely Algorithmic Approaches Are Not Superior

Highly-accurate statistical models may not work in real world contexts as has been proven by the large amount of published research that has resisted replication.

Stephen Chen blasts the sacrifice of model accuracy for predictive power in this report from Predictive Analytics World:

This democratization of Data Science shifts the priority from explanation to production. The how-to can-do approach contrasts with the why-not-to dictums of Statistics classes (or, at least in my time). As a result, the meaning of inference based on triangulated evidence and reasoning, has increasingly shifted to a merely mathematical or computational problem.

What is unique to Data Science/Machine Learning is a set of purely algorithmic approaches (e.g. Genetic Algorithms, Neural Networks, Boosting and Stacking) that have been increasingly hyped because of their superior “inference”, where they can outperform other methods / humans / whatever strawman on hand.

These algorithmic-based approaches are marketed as “learning from data”, but even that concept has been bastardized – the Bayesian approach of adjusting probabilities based on extant data has been reduced to an Algorithm that essentially attempts to fit a curve to as many points as possible. In short, inference has gone from a sniper aiming to hit the target as close as possible, to firing a shotgun hoping one will hit close to the target via computational brute force.

Statistical inference is intimately tied to probability distributions – Gaussian, Poisson, Binomial etc. are evidence-backed probability density functions corresponding to specific event characteristics. There are application domains whereby algorithmic approaches are wholly appropriate (e.g. genetic algorithms in robotics), and even necessary (neural networks and image classification) when it is difficult to operationalize probability density (and the scope of data and context are contained).

… Data Scientists applying the latest algorithmic approach in a domain with known probability densities risk sacrificing predictive power for model accuracy. They are either seduced by the latest “fad” or unaware that algorithmic-based approaches perform poorly beyond the limits of their input data range. More importantly model accuracy should not the sole end in itself because accuracy and overfitting are two sides of the same coin.

Purely Algorithmic Approaches Are Not Superior

Sign up for the CDO Newsletter

Recent Posts