Demystifying Log Transformations: A Key to Faster Convergence in Optimization Algorithms

Aakash Goel
4 min readApr 17, 2024

Index of Article

  • Statement
  • Question
  • Answer
  • Some Supporting Terminologies, knowledge

Statement: Log transformation can make data distribution less skewed . Enable optimization algorithm to converge faster.

Question: How it enable optimization algorithm to optimize faster ?


When we apply a log transformation to skewed data, we compress the larger values more than the smaller ones. This essentially “pulls in” the tail of the distribution, making it less spread out and closer to a normal distribution. Here’s how this helps optimization algorithms converge faster:

  1. Equalizes the Scale: Skewed data often have a wide range of values, with some values much larger than others. This inequality in scale can make it difficult for optimization algorithms to navigate the search space efficiently. By compressing the larger values with a log transformation, we reduce this inequality and make the scale of the data more uniform.
  2. Smooths the Gradient: Optimization algorithms, like gradient descent, work by iteratively updating parameters in the direction that minimizes a loss function. In skewed data, the gradient of the loss function can be steep in some regions and shallow in others, causing the optimization process to become slow and inefficient. When we transform the data to reduce skewness, we often make the gradient smoother and more consistent across the entire range of values. This smoother gradient allows optimization algorithms to converge more quickly because they can move more steadily towards the optimal solution without getting stuck in steep regions or oscillating around the minimum.
  3. Outliers: Log transformation can reduce the impact of outliers. Outliers can often cause optimization algorithms to be “distracted”, slowing down convergence. By reducing the impact of outliers, log transformation can help the algorithm to converge more quickly.

3.A.1 In gradient descent, the direction and magnitude of the gradient at each step determine how much the parameters are updated. Outliers, which are data points significantly different from the majority of the data, can have a disproportionately large impact on the gradient calculation due to their extreme values. This can lead to large and erratic updates in parameter values, causing the optimization process to oscillate or diverge rather than converging smoothly towards the optimal solution.

3.A.2 Outliers can slow down the convergence of gradient descent by misleading the algorithm towards suboptimal regions of the parameter space. Instead of converging towards the global minimum of the loss function, the algorithm may get “stuck” in local minima or plateau regions caused by the influence of outliers. This can prolong the optimization process and require more iterations to reach a satisfactory solution.

Some Supporting Terminologies, knowledge

  • Skewed data — It means that the values in your dataset are not evenly distributed. Instead, they’re concentrated more towards one end, while the other end has a few extreme values that pull the average in their direction.
  • Example of Skewed Data — Imagine you have a bunch of numbers representing, let’s say, the salaries of people in a company. Now, picture these numbers plotted on a graph, with the number of people on the y-axis and the salary amounts on the x-axis. If most people in the company earn around the same salary, the graph would look like a tall, narrow mountain, with the peak representing the most common salary. But what if there are a few people who earn exceptionally high salaries? In that case, the graph would look more like a mountain with a long tail stretching out to the right. This is what “skewed” data means.
  • Reducing skewness — Trying to make the distribution of values more balanced and less influenced by these extreme values. This can help us better understand and analyze the data, as well as make certain statistical techniques and machine learning models work more effectively.
  • Log Transformation — Mathematical operation that changes each number in a dataset to the logarithm of that number, usually with base 10 or base e (natural logarithm).

Original Data: 1, 10, 100, 1000, 10000
After Log Transform: 0, 1, 2, 3, 4

In the original data, the difference between 1 and 10 is 9, and the difference between 1000 and 10000 is 9000. So, the differences are not the same. They are much larger for larger numbers.

In the transformed data:

  • The difference between log10(1) and log10(10) is 1.
  • The difference between log10(1000) and log10(10000) is also 1.

If you find the content helpful, a round of applause would be greatly appreciated!