Log Transformation: How it can transform ML model performance?
This Blog will cover different aspects of Feature Scaling with major focus on Log Transformation with an real example to demonstrate the outcomes.
Introduction to Feature Scaling/Transformation:
Feature Scaling brings the features in a standardized or fixed range.
“Imagine comparing binge watching a web series with a movie!, can you? you simply can not”
We need to make all the features on same scale/unit before we feed the data to ML models, if not, this issue will result in higher importance given to feature with large values.
Moreover unscaled features can result in an unstable learning for model, because with large weights, a small change in input can result in large change in output, and this may also cause over-fitting.
Different Scaling/Transformation method:
Below are the most used scaling or transformation methods that can be used in Python (detailed discussion on these is not covered keeping time/length complexity of this Blog)
- Standard Scalar
- Min-Max scalar
- Robust Scalar
- Log Transformation
Log Transformation, when & how?
Log transformation is very familiar to Data Scientists when it comes to highly skewed distribution . A highly skewed feature can be issue with ML model (except Tree-based model) as the tail region act as an outlier (“Outlier in ML!, name is enough..”).
When to apply?
We can use Log Transformation in case we have highly skewed distribution and we want to make tail part (or large values) pull back and thus near to Normal distribution.
For example log of 1000 is 3, log of 100 is 2, that how it works on the tail part of a highly skewed distribution.
The real use case of Log Transformation:
The use case is based on a Regression problem, where target feature was price of a product. The attained accuracy for test data was approx. 90% (wait for the real learning).
I looked at the distribution of the response variable and it was highly skewed (“Revisiting EDA is often a good idea”). I applied Log Transformation to the response variable and rebuilt the model. Accuracy now got a further boost of 4%. This is what Log Transformation is capable of.
Hope the blog was helpful, thanks for your time! :)