transforming distorted data (2023)

A search for friendly resources

Note: The following code was written in Python and retrieved from various Jupyter notebooks. Removed inline comments to make the article easier to read.

transforming distorted data (1)

Biased data is tricky and common. It is often desirable to transform skewed data and convert it to values ​​between 0 and 1.

Standard functions used for such conversions include normalization, sigmoid, log, cube root, and hyperbolic tangent. It all depends on what you want to achieve.

This is an example of a skewed column I generated from an 8.9 million row.Amazon book review record(Julian McAuley, UCSD). df.Helpful_Votes returns the total number of helpful (versus unhelpful) votes each book review received.

df_Helpful_Votes Originaldaten

Kontakt 4.756338e+06
Media 5.625667e+00
Standard 2.663631e+01
min 0.000000e+00
25% 1.000000e+00
50% 2.000000e+00
75% 4.000000e+00
maximal 2.331100e+04

0, 1, 2, 4, 23311. That's a big jump!

Outlier removal is an option, but you don't want to use it here. My ultimate goal is to develop a machine learning algorithm to predict whether a given review is useful, so the reviews with the most helpful votes are a must.

I'll use the functions listed above to transform the data, explaining the pros and cons along the way. As in most data science, there is no one right function. It depends on the data and the goal of the analyst.

Normalization converts all data points to decimals between 0 and 1. If the minimum is 0, just divide each point by the maximum.

If the minimum is not 0, subtract the minimum from each point and divide by the minimum-maximum difference.

The following function covers both cases.

normalization function

def normalize(column):
top = column.max()
bottom = column.min()
y = (column - bottom)/(top - bottom)
times and
helpful_normalized = normalize(df.Helpful_Votes)
Kontakt 4.756338e+06
Media 2.413310e-04
Standard 1.142650e-03
min 0.000000e+00
25% 4.289820e-05
50% 8.579641e-05
75 % 1.715928e-04
maximal 1.000000e+00

After normalization, the data is just as skewed as it was before. If the goal is simply to turn the data into points between 0 and 1, then normalization is the way to go. Otherwise, normalization must be used in conjunction with other functions.

(Video) 4 TRIPPY Ways To Blend Footage (DATAMOSHING) | Video Editing Tutorials

Then the sigmoid function. It's worth seeing a picture if you've never seen the sigmoid before.

transforming distorted data (2)

It's a wonderfully smooth curve that guarantees a 0 to 1 range. Let's see how it works in df.Helpful_Votes.


def Sigmoide(x):
e = np.exp(1)
y = 1/(1+e**(-x))
times and
útil_sigmoide = sigmoide(df.Helpful_Votes)
Kontakt 4.756338e+06
Media 8.237590e-01
Standard 1.598215e-01
Mine. 5.000000e-01
25 % 7.310586e-01
50% 8.807971e-01
75% 9.820138e-01
maximal 1.000000e+00

A definite improvement. The new data ranges from 0 to 1 as expected, but the minimum is 0.5. This makes sense when looking at the graph as there are no negative values ​​in the vote count.

Another point to consider is the spread. Here the 75th percentile is within 0.2 of the 100th percentile, comparable to other quartiles. But in the original data, the 75th percentile is more than 20,000 from the 100th percentile and far from any other quartile. In this case, the data has been skewed.

The sigmoid function can be modified to improve results, but let's explore other options first.

(Video) Convert Low-Res Graphic to High-Res in Photoshop! #Shorts

Then logarithms. An excellent option to make data less distorted. When using logging with Python, the default base is usuallymi.

recording function

útil_log = np.log(df.Helpful_Votes)
RuntimeWarning: Division by zero found in registry

Oh! Divide by zero! How did this happen? Maybe a photo will clarify.

record chart

transforming distorted data (3)

Oh yeah. The range of log is strictly greater than 0. This is a vertical asymptote that runs along the y-axis. As x approaches 0, y approaches negative infinity. In other words, 0 is excluded from the domain.

Many of my data points are 0 because many comments did not receive helpful votes. For a quick fix, I can add 1 to each data point. This works well since the logarithm of 1 is 0. Also, the same border is kept asatPoints are increased by 1.

Recording function + 1

útil_log = np.log(df.Helpful_Votes + 1)
Kontakt 4.756338e+06
Media 1.230977e+00
Norma 9.189495e-01
min 0.000000e+00
25% 6.931472e-01
50% 1.098612e+00
75% 1.609438e+00
maximal 1.005672e+01
Name: Useful_Voices, dtype: float64

Great. The new range is 0 to 10 and the quartiles reflect the original data.

time to normalize.

Register function + 1 normalized

util_log_normalized = normalize (util_log)
Kontakt 4.756338e+06
Media 1.224034e-01
Norm 9.137663e-02
min 0.000000e+00
25% 6.892376e-02
50% 1.092416e-01
75% 1.600360e-01
maximal 1.000000e+00
Name: Useful_Voices, dtype: float64

That seems very reasonable. The Log-Plus normalization function is an excellent way to transform skewed data when the results can still be skewed. However, this case has a major disadvantage.

Why transform the data into values ​​between 0 and 1 at all? Typically, percentages and probability can play a role. In my case, I want the median to be around 50%. After this normalization, the median is 0.1.

For very large numbers, fractional exponents can be tried as a means of transformation. Consider the cube root.

(Video) How Humans Domesticated Just About Everything | Compilation

root cubic

help_root_cube = df.Helpful_Votes**(1/3)
Kontakt 4.756338e+06
Media 1.321149e+00
Norm 8.024150d-01
min 0.000000e+00
25% 1.000000e+00
50% 1,259921e+00
75% 1,587401e+00
maximal 2.856628e+01

This is very similar to log, but the range is larger here, from 0 to 28.

normalized cube root

normalized_root_cube_useful = normalisieren (root_cube_useful)
Kontakt 4.756338e+06
Media 4.624857e-02
Norm 2.808959e-02
min 0.000000e+00
25 % 3.500631e-02
50 % 4.410519e-02
75% 5.556906e-02
maximal 1.000000e+00

As expected, new data after normalization is more problematic. Now the median is 0.04, a far cry from 50%.

I like playing with numbers so I experimented with some powers. Interesting results came from 1/log_max. I set log_max to the log maximum value. (The logarithm of 23,311 is 10.06.)

maximum log root

log_max = np.log(df.Helpful_Votes.max())
útil_log_max_root = df.Helpful_Votes**(1/log_max)
Kontakt 4.756338e+06
Media 9.824853e-01
Norm 3.712224e-01
min 0.000000e+00
25% 1.000000e+00
50% 1.071355e+00
75% 1.147801e+00
maximal 2.718282e+00

A range of 0 to 2.7 is very attractive.

Maximum normalized log root

help_log_max_root_normalized = normalisiert (helpful_log_max_root)
Kontakt 4.756338e+06
Media 3.614362e-01
Standard 1.365651e-01
min 0.000000e+00
25% 3.678794e-01
50% 3.941294e-01
75% 4.222525e-01
maximal 1.000000e+00

That sounds good, but the data is heavily clustered in the 25-75% range, a cluster that's probably much more widespread. While the overall layout is more desirable, it falls short here.

It's time for our final standard function, the hyperbolic tangent. Let's start with a chart.

transforming distorted data (4)

This looks very similar to the sigmoid. A key difference is range. The hyperbolic tangent ranges from -1 to 1 while the sigmoid ranges from 0 to 1.

(Video) Using the DISTORT tool, straighten and transform images in Photoshop | Photoshop Tutorial

Since df.Helpful_Votes are non-negative in my data, my output will be 0 to 1.

hyperbolic tangent

useful_hyperbolic_tangent = np.tanh(df.Helpful_Votes)
Kontakt 4.756338e+06
Media 7.953343e-01
Norm 3.033794e-01
min 0.000000e+00
25% 7.615942e-01
50% 9.640276e-01
75% 9.993293e-01
maximal 1.000000e+00

There is no need for normalization, but there is an obvious problem.

The hyperbolic tangent distorts the data more than the sigmoid. There is a 0.001 difference between the 75th and 100th percentile, much closer than any other quartile. In the original data, this difference is 23,307, a thousand times larger than the difference between other quartiles.

It's not even close.

Percentiles offer another option. Any data point can be ranked by its percentile, and Pandas provides a nice built-in method, .rank, to handle this.

The general idea is that each point is assigned the value of its percentile. Since in my case there are many data points with a small number of helpful votes, there are different ways to select these percentiles. I like the "min" method, where all data points are given the percentile given to the first member of the group. The default method is mean, where all data points with the same value take on the mean percentile for that group.

percentile linearization

size = len(df.Helpful_Votes)-1
útil_percentil_linealización = df.Helpful_Votes.rank(method='min').apply(lambda x: (x-1)/size)
Kontakt 4.756338e+06
Media 4.111921e-01
Norm 3.351097e-01
min 0.000000e+00
25% 1.133505e-01
50% 4.719447e-01
75% 7.059382e-01
maximal 1.000000e+00

Fascinating. Percentile linearization is basically the same as a ranking system. If the difference between the first and second data points is equal to all data points that are part of the unit, percentile linearization is the way to go.

On the other hand, percentile linearization erases critical signs of asymmetry. This data seems to be slightly skewed because there are hundreds of thousands of 1 and 2 vote reviews, not because there are so few reviews with an incredibly high number of votes.

Since I'm mainly dealing with a ranking system, I combined the percentile linearization with my own piecewise linearization function. Records also played a role.

Whatever your choice or style, never limit yourself to standard options. Mathematical resources are extremely rich, and their intersection with data science deserves more recognition and flexibility.

(Video) Quick Guide: Magnet Field - MGA94 to MGA2020 Conformal and Distortion

I'd love to hear about other functions people use to transform distorted data. I've heard of the boxcox function, although I haven't studied it in detail.

For more details on Wade's Amazon Book Review project, visit his Github page.useful_comments.


1. How to Use SPSS: Transform or Recode a Variable
(Biostatistics Resource Channel)
2. The Data-driven Descent Algorithm for Image Distortion Estimation (4) Middle Font
3. 3DEqualizer4 R4 [advanced] - Lens Distortion Pipeline / Export Distortion Data to Nuke
4. Why all world maps are wrong
5. 1-Minute Photoshop - Distort Text Without Rasterizing
6. How To Reset Echo Dot
(One Hour Smart Home)


Top Articles
Latest Posts
Article information

Author: Rubie Ullrich

Last Updated: 07/03/2023

Views: 5657

Rating: 4.1 / 5 (72 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Rubie Ullrich

Birthday: 1998-02-02

Address: 743 Stoltenberg Center, Genovevaville, NJ 59925-3119

Phone: +2202978377583

Job: Administration Engineer

Hobby: Surfing, Sailing, Listening to music, Web surfing, Kitesurfing, Geocaching, Backpacking

Introduction: My name is Rubie Ullrich, I am a enthusiastic, perfect, tender, vivacious, talented, famous, delightful person who loves writing and wants to share my knowledge and understanding with you.