A search for friendly resources
Note: The following code was written in Python and retrieved from various Jupyter notebooks. Removed inline comments to make the article easier to read.
Biased data is tricky and common. It is often desirable to transform skewed data and convert it to values between 0 and 1.
Standard functions used for such conversions include normalization, sigmoid, log, cube root, and hyperbolic tangent. It all depends on what you want to achieve.
This is an example of a skewed column I generated from an 8.9 million row.Amazon book review record(Julian McAuley, UCSD). df.Helpful_Votes returns the total number of helpful (versus unhelpful) votes each book review received.
df_Helpful_Votes Originaldaten
EM
df.Useful_votes.describe()FOR A
Kontakt 4.756338e+06
Media 5.625667e+00
Standard 2.663631e+01
min 0.000000e+00
25% 1.000000e+00
50% 2.000000e+00
75% 4.000000e+00
maximal 2.331100e+04
0, 1, 2, 4, 23311. That's a big jump!
Outlier removal is an option, but you don't want to use it here. My ultimate goal is to develop a machine learning algorithm to predict whether a given review is useful, so the reviews with the most helpful votes are a must.
I'll use the functions listed above to transform the data, explaining the pros and cons along the way. As in most data science, there is no one right function. It depends on the data and the goal of the analyst.
Normalization converts all data points to decimals between 0 and 1. If the minimum is 0, just divide each point by the maximum.
If the minimum is not 0, subtract the minimum from each point and divide by the minimum-maximum difference.
The following function covers both cases.
normalization function
EM
def normalize(column):
top = column.max()
bottom = column.min()
y = (column - bottom)/(top - bottom)
times andhelpful_normalized = normalize(df.Helpful_Votes)
useful_normalized.describe()FOR A
Kontakt 4.756338e+06
Media 2.413310e-04
Standard 1.142650e-03
min 0.000000e+00
25% 4.289820e-05
50% 8.579641e-05
75 % 1.715928e-04
maximal 1.000000e+00
After normalization, the data is just as skewed as it was before. If the goal is simply to turn the data into points between 0 and 1, then normalization is the way to go. Otherwise, normalization must be used in conjunction with other functions.
Then the sigmoid function. It's worth seeing a picture if you've never seen the sigmoid before.
It's a wonderfully smooth curve that guarantees a 0 to 1 range. Let's see how it works in df.Helpful_Votes.
Sigmoidfunktion
EM
def Sigmoide(x):
e = np.exp(1)
y = 1/(1+e**(-x))
times andútil_sigmoide = sigmoide(df.Helpful_Votes)
useful_sigmoid.describe()FOR A
Kontakt 4.756338e+06
Media 8.237590e-01
Standard 1.598215e-01
Mine. 5.000000e-01
25 % 7.310586e-01
50% 8.807971e-01
75% 9.820138e-01
maximal 1.000000e+00
A definite improvement. The new data ranges from 0 to 1 as expected, but the minimum is 0.5. This makes sense when looking at the graph as there are no negative values in the vote count.
Another point to consider is the spread. Here the 75th percentile is within 0.2 of the 100th percentile, comparable to other quartiles. But in the original data, the 75th percentile is more than 20,000 from the 100th percentile and far from any other quartile. In this case, the data has been skewed.
The sigmoid function can be modified to improve results, but let's explore other options first.
Then logarithms. An excellent option to make data less distorted. When using logging with Python, the default base is usuallymi.
recording function
EM
útil_log = np.log(df.Helpful_Votes)
useful_log.describe()FOR A
RuntimeWarning: Division by zero found in registry
Oh! Divide by zero! How did this happen? Maybe a photo will clarify.
record chart
Oh yeah. The range of log is strictly greater than 0. This is a vertical asymptote that runs along the y-axis. As x approaches 0, y approaches negative infinity. In other words, 0 is excluded from the domain.
Many of my data points are 0 because many comments did not receive helpful votes. For a quick fix, I can add 1 to each data point. This works well since the logarithm of 1 is 0. Also, the same border is kept asatPoints are increased by 1.
Recording function + 1
EM
útil_log = np.log(df.Helpful_Votes + 1)
useful_log.describe()FOR A
Kontakt 4.756338e+06
Media 1.230977e+00
Norma 9.189495e-01
min 0.000000e+00
25% 6.931472e-01
50% 1.098612e+00
75% 1.609438e+00
maximal 1.005672e+01
Name: Useful_Voices, dtype: float64
Great. The new range is 0 to 10 and the quartiles reflect the original data.
time to normalize.
Register function + 1 normalized
EM
util_log_normalized = normalize (util_log)
useful_log_normalized.describe()FOR A
Kontakt 4.756338e+06
Media 1.224034e-01
Norm 9.137663e-02
min 0.000000e+00
25% 6.892376e-02
50% 1.092416e-01
75% 1.600360e-01
maximal 1.000000e+00
Name: Useful_Voices, dtype: float64
That seems very reasonable. The Log-Plus normalization function is an excellent way to transform skewed data when the results can still be skewed. However, this case has a major disadvantage.
Why transform the data into values between 0 and 1 at all? Typically, percentages and probability can play a role. In my case, I want the median to be around 50%. After this normalization, the median is 0.1.
For very large numbers, fractional exponents can be tried as a means of transformation. Consider the cube root.
root cubic
EM
help_root_cube = df.Helpful_Votes**(1/3)
util_cube_root.describe()FOR A
Kontakt 4.756338e+06
Media 1.321149e+00
Norm 8.024150d-01
min 0.000000e+00
25% 1.000000e+00
50% 1,259921e+00
75% 1,587401e+00
maximal 2.856628e+01
This is very similar to log, but the range is larger here, from 0 to 28.
normalized cube root
EM
normalized_root_cube_useful = normalisieren (root_cube_useful)
useful_normalized_root_cube.describe()FOR A
Kontakt 4.756338e+06
Media 4.624857e-02
Norm 2.808959e-02
min 0.000000e+00
25 % 3.500631e-02
50 % 4.410519e-02
75% 5.556906e-02
maximal 1.000000e+00
As expected, new data after normalization is more problematic. Now the median is 0.04, a far cry from 50%.
I like playing with numbers so I experimented with some powers. Interesting results came from 1/log_max. I set log_max to the log maximum value. (The logarithm of 23,311 is 10.06.)
maximum log root
EM
log_max = np.log(df.Helpful_Votes.max())
útil_log_max_root = df.Helpful_Votes**(1/log_max)
útil_log_max_root.describe()FOR A
Kontakt 4.756338e+06
Media 9.824853e-01
Norm 3.712224e-01
min 0.000000e+00
25% 1.000000e+00
50% 1.071355e+00
75% 1.147801e+00
maximal 2.718282e+00
A range of 0 to 2.7 is very attractive.
Maximum normalized log root
EM
help_log_max_root_normalized = normalisiert (helpful_log_max_root)
nützliches_log_max_root_normalized.describe()FOR A
Kontakt 4.756338e+06
Media 3.614362e-01
Standard 1.365651e-01
min 0.000000e+00
25% 3.678794e-01
50% 3.941294e-01
75% 4.222525e-01
maximal 1.000000e+00
That sounds good, but the data is heavily clustered in the 25-75% range, a cluster that's probably much more widespread. While the overall layout is more desirable, it falls short here.
It's time for our final standard function, the hyperbolic tangent. Let's start with a chart.
This looks very similar to the sigmoid. A key difference is range. The hyperbolic tangent ranges from -1 to 1 while the sigmoid ranges from 0 to 1.
Since df.Helpful_Votes are non-negative in my data, my output will be 0 to 1.
hyperbolic tangent
EM
useful_hyperbolic_tangent = np.tanh(df.Helpful_Votes)
useful_hyperbolic_tangent.describe()FOR A
Kontakt 4.756338e+06
Media 7.953343e-01
Norm 3.033794e-01
min 0.000000e+00
25% 7.615942e-01
50% 9.640276e-01
75% 9.993293e-01
maximal 1.000000e+00
There is no need for normalization, but there is an obvious problem.
The hyperbolic tangent distorts the data more than the sigmoid. There is a 0.001 difference between the 75th and 100th percentile, much closer than any other quartile. In the original data, this difference is 23,307, a thousand times larger than the difference between other quartiles.
It's not even close.
Percentiles offer another option. Any data point can be ranked by its percentile, and Pandas provides a nice built-in method, .rank, to handle this.
The general idea is that each point is assigned the value of its percentile. Since in my case there are many data points with a small number of helpful votes, there are different ways to select these percentiles. I like the "min" method, where all data points are given the percentile given to the first member of the group. The default method is mean, where all data points with the same value take on the mean percentile for that group.
percentile linearization
EM
size = len(df.Helpful_Votes)-1
útil_percentil_linealización = df.Helpful_Votes.rank(method='min').apply(lambda x: (x-1)/size)
util_percentile_linearization.describe()FOR A
Kontakt 4.756338e+06
Media 4.111921e-01
Norm 3.351097e-01
min 0.000000e+00
25% 1.133505e-01
50% 4.719447e-01
75% 7.059382e-01
maximal 1.000000e+00
Fascinating. Percentile linearization is basically the same as a ranking system. If the difference between the first and second data points is equal to all data points that are part of the unit, percentile linearization is the way to go.
On the other hand, percentile linearization erases critical signs of asymmetry. This data seems to be slightly skewed because there are hundreds of thousands of 1 and 2 vote reviews, not because there are so few reviews with an incredibly high number of votes.
Since I'm mainly dealing with a ranking system, I combined the percentile linearization with my own piecewise linearization function. Records also played a role.
Whatever your choice or style, never limit yourself to standard options. Mathematical resources are extremely rich, and their intersection with data science deserves more recognition and flexibility.
I'd love to hear about other functions people use to transform distorted data. I've heard of the boxcox function, although I haven't studied it in detail.
For more details on Wade's Amazon Book Review project, visit his Github page.useful_comments.