Last updated on May 28, 2020 byeditorial team
Author(s):Swetha Lakshmanan

Before we dive into this topic, let's start with some definitions.
"Re-scale"A vector means adding or subtracting a constant and then multiplying or dividing by a constant, just as you would to change units of measurement for data, for example, to convert a temperature from Celsius to Fahrenheit.
"Normalization"A vector generally means dividing by a norm of the vector. It is also often referred to as the minimum and range scale of the vector to get all elements between 0 and 1, making all numeric values in the column equal.recordfor a commonScale.
"Standardize"A vector usually means subtracting a measure of location and dividing by a measure ofScale. For example, if the vector contains random values with a Gaussian distribution, you could subtract the mean and divide by the standard deviation, getting a "standard normal".random variablewith mean 0 and standard deviation 1.
After reading this post you will know:
- Why should you standardize/normalize/scale your data?
- How to standardize your numeric attributes to a mean of 0 and variance of unit using a standard scalar
- How to normalize your numeric attributes between the range of 0 and 1 using the min-max scalar
- How to normalize with a robust scalar
- When should you choose standardization or normalization?
Let's start.
Why you should standardize/normalize variables:
Standardization:
Standardizing features around the center and 0 with a standard deviation of 1 is important when comparing measures with different units. Variables measured on different scales do not contribute equally to the analysis and may end up forming a baseline.
Example: A variable that is between 0 and 1000 outperforms a variable that is between 0 and 1. Using these variables without standardization produces the variable with the highest rank weight of 1000 in the analysis. This problem can be avoided by transforming the data to comparable scales. Typical data standardization techniques compensate for the range and/or variability of the data.
Normalization:
Also, the purpose of normalization is to change the values of the numeric columns torecordon a common scale without distorting differences in value ranges. Formachine learning, not all data sets require normalization. It is only needed when the resources have different ranks.
For example, consider a data set that contains two characteristics, age and income (x2). Where age ranges from 0 to 100 while income ranges from 0 to 100,000 and above. Income is about 1,000 times greater than old age. These two characteristics are therefore in very different realms. If we carry out a deeper analysis, e.g. B. multivariatelinear regressionFor example, the imputed income will have a greater impact on the result itself due to its higher value. However, this does not necessarily mean that it is more important as a predictor. So we normalize the data to bring all the variables to the same range.
When should you use normalization and standardization?
normalizationThis is a good technique when you don't know the distribution of your data, or when you know that the distribution is not Gaussian (a bell curve). Normalization is useful when your data have different scales and thealgorithmIt does not make assumptions about the disclosure of your data, such ask-nearest neighborsand artificialneural networks.
Standardizationit assumes that your data has a Gaussian distribution (bell curve). This does not necessarily have to be the case, but the technique is more efficient when the attribute distribution is Gaussian. Standardization is useful when your data have different scales and thealgorithmYou use assumptions about your data with a Gaussian distribution, such aslinear regression, logistic regression and linear discriminant analysis.
Record:
i used thoseLending Club Loan Recordby Kaggle to demonstrate examples in this article.
Import libraries:
matterpandasas pdimport numpy as np import seaborn as snsimport matplotlib.pyplot as plt
Import registration:
Let's import three columns: loan amount, int_rate, and rate, and the first 30,000 rows in the dataset (to reduce computation time)
cols = ['prestamo_amnt', 'int_rate', 'rate']data = pd.read_csv('loan.csv', nrows = 30000, usecols = cols)
If you import all the data, there will bemissing valuesin some columns. You can simply remove the lines withmissing valuesUse ofpandaspost-drop welding method.
Basic analysis:
Now let's look at the basic statistical values of our data set.
data.describe()

Different variables have different ranges of values, that is, different orders of magnitude. Not only are the minimum and maximum values different, but they are also spread over areas of different widths.
Standardization (standard scalar):
As we discussed earlier, standardization (or z-score normalization) means centering the variable at zero and normalizing the variance at 1. The procedure involves subtracting the mean from each observation and then dividing by the standard deviation:
The result of standardization is that the features are scaled to have the properties of a standard normal distribution.
μ=0 y σ=1
where μ is the mean (average) and σ is the standard deviation of the mean.
CODE:
sci-kit-learn's StandardScaler removes the mean and scales the data to unit variance. We can import the StandardScalar method from sci-kit learn and apply it to our dataset.
aus sklearn.preprocessing import StandardScalerscaler = StandardScaler() data_scaled = scaler.fit_transform(datos)
Now we are going to check the mean and standard deviation values
print(data_scaled.mean(axis=0))print(data_scaled.std(axis=0))

As expected, the mean of each variable is now approximately zero, and the standard deviation is set to 1. Therefore, all variable values are in the same range.
print('Minimum Values (Loan Amount, Int. Rate and Commission): ', data_scaled.min(axis=0))print('Maximum Amounts (Loan Amount, Int. Rate and Commission): ', data_scaled. max(axis=0))

However, the minimum and maximum values vary according to the initial dispersion of the variable and are strongly influenced by the presence of outliers.
Normalization (scalar min-max):
With this approach, the data is scaled to a fixed range, typically 0 to 1.
Unlike standardization, the price of this narrow range is that we end up with smaller standard deviations that can suppress the effect of outliers. Therefore, MinMax Scalar is sensitive to outliers.
A min-max scale is normally done using the following equation:

CODE:
Let's import MinMaxScalar from scikit-learn and apply it to our dataset.
aus sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler() data_scaled = scaler.fit_transform(data)
Now we are going to check the mean and standard deviation values.
print('mean (loan amount, interest rate, and fee): ', data_scaled.mean(axis=0))print('std (loan amount, interest rate, and fee): ', data_scaled.std(axis =0))

After MinMaxScaling, the distributions are not centered at zero and the standard deviation is not 1.
print('Min (loan amount, internal rate and fee): ', data_scaled.min(axis=0))print('Max (loan amount, internal rate and fee): ', data_scaled.max(axis=0 ))

However, the minimum and maximum values are standardized across all variables, as opposed to standardization.
Robust scaling (scaling to median and quantiles):
The scale with median and quantiles consists of subtracting the median from all the observations and then dividing by the interquartile difference. Scale features withthe statisticsrobust to outliers.
The interquartile difference is the difference between the 75th and 25th quantiles:
IQR = 75. Quantile — 25. Quantile
The equation to calculate the scaled values:
X_scaled = (X — X. Mediana) / IQR
CODE:
First, import RobustScalar from Scikit Learn.
aus sklearn.preprocessing import RobustScalerscaler = RobustScaler() data_scaled = scaler.fit_transform(datos)
Now check the mean and standard deviation values.
print('mean (loan amount, interest rate, and fee): ', data_scaled.mean(axis=0))print('std (loan amount, interest rate, and fee): ', data_scaled.std(axis =0))

As you can see, the distributions are not centered at zero and the standard deviation is not 1.
print('Min (loan amount, internal rate and fee): ', data_scaled.min(axis=0))print('Max (loan amount, internal rate and fee): ', data_scaled.max(axis=0 ))

Also, the min and max values are not set to a specific upper and lower bound like in MinMaxScaler.
I hope you found this article useful. Have fun with your studies!
references
- https://www.udemy.com/feature-engineering-for-machine-learning/
- https://www.geeksforgeeks.org/python-how-and-where-to-apply-feature-scaling/
How, when and why should you normalize/standardize/resize your data?was originally posted onTowards AI - Multidisciplinary Scientific Journalon Medium, where people continue the conversation by starring and responding to that story.
Posted viaAI rumor
FAQs
How when and why should you normalize standardize rescale your data? ›
Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks. Standardization assumes that your data has a Gaussian (bell curve) distribution.
Should you standardize and normalize data? ›If you see a bell-curve in your data then standardization is more preferable. For this, you will have to plot your data. If your dataset has extremely high or low values (outliers) then standardization is more preferred because usually, normalization will compress these values into a small range.
What are the 3 reasons why there is a need to normalize a database? ›Objectives of database normalization
To correct duplicate data and database anomalies. To avoid creating and updating any unwanted data connections and dependencies. To prevent unwanted deletions of data.
Further, data normalization aims to remove data redundancy, which occurs when you have several fields with duplicate information. By removing redundancies, you can make a database more flexible. In this light, normalization ultimately enables you to expand a database and scale.
Why it is important to normalize your data values? ›The main objective of database normalization is to eliminate redundant data, minimize data modification errors, and simplify the query process. Ultimately, normalization goes beyond simply standardizing data, and can even improve workflow, increase security, and lessen costs.
What is the difference between standardize and normalize data? ›In statistics, Standardization is the subtraction of the mean and then dividing by its standard deviation. In Algebra, Normalization is the process of dividing of a vector by its length and it transforms your data into a range between 0 and 1.
What is the difference between standardize and normalize? ›Normalization typically means rescales the values into a range of [0,1]. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).
What is the benefit of standardizing? ›The benefits of standardization. Fundamentally, standardization means that your employees have an established, time-tested process to use. When done well, standardization can decrease ambiguity and guesswork, guarantee quality, boost productivity, and increase employee morale.
How do you normalize data properly? ›- Calculate the range of the data set. ...
- Subtract the minimum x value from the value of this data point. ...
- Insert these values into the formula and divide. ...
- Repeat with additional data points.
The goal of standardization is to enforce a level of consistency or uniformity to certain practices or operations within the selected environment.
When should I standardize variables? ›
You should standardize the variables when your regression model contains polynomial terms or interaction terms. While these types of terms can provide extremely important information about the relationship between the response and predictor variables, they also produce excessive amounts of multicollinearity.
What are the two main purpose of normalization? ›There are two main objectives of the normalization process: eliminate redundant data (storing the same data in more than one table) and ensure data dependencies make sense (only storing related data in a table).
When to Normalisation is needed? ›The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges.
What it means to normalize data? ›When you normalize a data set, you are reorganizing it to remove any unstructured or redundant data to enable a superior, more logical means of storing that data. The main goal of data normalization is to achieve a standardized data format across your entire system.
What happens when you standardize data? ›Data standardization means your data is internally consistent — each of your data sources has the same format and labels. When your data is neatly organized with logical descriptions and labels, everyone in your organization can understand it and put it to use.
What are four advantages of standardization? ›Rationalize different varieties of products. Decrease the volume of products in the store and also the manufacturer cost. Improve the management and design. Speed up the management of orders.
What is standardization and why is it important statistics? ›In statistics, standardization is the process of putting different variables on the same scale. This process allows you to compare scores between different types of variables. Typically, to standardize variables, you calculate the mean and standard deviation for a variable.
Why do we need to standardize and normalize data in machine learning? ›Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.
Why do we want to normalize or standardize the data before doing PCA sometimes? ›Before PCA, we standardize/ normalize data. Usually, normalization is done so that all features are at the same scale. For example, we have different features for a housing prices prediction dataset.
What is the difference between normalized scaling and standardized scaling? ›The two most discussed scaling methods are Normalization and Standardization. Normalization typically means rescales the values into a range of [0,1]. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).