Regression in Data Mining
Regression involves
predictor variable (the values which are known) and
response variable (values to be predicted).
The two basic types of regression are:1. Linear regression
- It is simplest form of regression. Linear regression attempts to model the relationship between two variables by fitting a linear equation to observe the data.
- Linear regression attempts to find the mathematical relationship between variables.
- If outcome is straight line then it is considered as linear model and if it is curved line, then it is a non linear model.
- The relationship between dependent variable is given by straight line and it has only one independent variable.
Y = α + Β X - Model 'Y', is a linear function of 'X'.
- The value of 'Y' increases or decreases in linear manner according to which the value of 'X' also changes.

2. Multiple regression model
- Multiple linear regression is an extension of linear regression analysis.
- It uses two or more independent variables to predict an outcome and a single continuous dependent variable.
Y = a0 + a1 X1 + a2 X2 +.........+ak Xk +e
where,
'Y' is the response variable.
X1 + X2 + Xk are the independent predictors.
'e' is random error.
a0, a1, a2, ak are the regression coefficients.
Naive Bays Classification Solved example
Bike Damaged example: In the following table attributes are given such as color, type, origin and subject can be yes or no.
| Bike No | Color | Type | Origin | Damaged? |
|---|
| 10 | Blue | Moped | Indian | Yes |
| 20 | Blue | Moped | Indian | No |
| 30 | Blue | Moped | Indian | Yes |
| 40 | Red | Moped | Indian | No |
| 50 | Red | Moped | Japanese | Yes |
| 60 | Red | Sports | Japanese | No |
| 70 | Red | Sports | Japanese | Yes |
| 80 | Red | Sports | Indian | No |
| 90 | Blue | Sports | Japanese | No |
| 100 | Blue | Moped | Japanese | Yes |
Solution:
Required formula:
P (c | x) = P (c | x) P (c) / P (x) In the fields of science, engineering and statistics, the accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity's true value.
Where,
P (c | x) is the posterior of probability.
P (c) is the prior probability.
P (c | x) is the likelihood.
P (x) is the prior probability.
It is necessary to classify a <Blue, Indian, Sports>, is unseen sample, which is not given in the data set.
So the probability can be computed as:
P (Yes) = 5/10
P (No) = 5/10
| Color | |
| P(Blue|Yes) = 3/5 | P(Blue|No) = 2/5 |
| P(Red|Yes) = 2/5 | P(Red|No) = 3/5 |
| Type | |
| P(Sports|Yes) = 1/5 | P(Sports|No) = 3/5 |
| P(Moped|Yes) = 4/5 | P(Moped|No) = 2/5 |
| Origin | |
| P(Indian|Yes) = 2/5 | P(Indian|No) = 3/5 |
| P(Japanese|Yes) = 3/5 | P(Japanese|No) = 2/5 |
So, unseen example X = <Blue, Indian, Sports>
P(X|Yes). P(Yes) = P(Blue|Yes). P(Indian|Yes). P(Sports|Yes). P(Yes)
= 3/5*2/5*1/5*5/10 = 0.024
P(X|No). P(No) = P(Blue|No). P(Indian|No). P(Sports|No).P(No)
= 2/5*3/5*3/5*5/10 = 0.072
So, 0.072 > 0.024 so example can be classified as
NO.