Monday, 30 July 2018

BigMart Sales Prediction Using Multiple Linear Regression

BigMart Sales Prediction Problem is one of the basic problems, which can be used to start Data Science journey. Here I will give you a clear view of approaching a Regression problem with proper data analysis and exploration.
Retail is another industry which extensively uses analytics to optimize business processes. Tasks like product placement, inventory management, customized offers, product bundling, etc. are being smartly handled using data science techniques.


Problem Statement


The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.

Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

Here we will discuss a production-ready type code and process for a small toy problem like "BigMart Sales Prediction".
We will walk through the code and process Step by Step:
Step 1: 
     let us understand the problem and try to find the approaches to find a better solution.

     m1: Average sales of past data.
        MSE = (e1^2 +e e2^2 +............+ en^2) / n

        Limitation: the prediction error will be very high.

     m2: Area wise average sales 

        Limitation: High prediction error and high MSE value

     m3: Linear Regression

         A Statistical technique for predictive modelling which tells the relation between Independent and dependent variable.

         Y = a1x1 + a2x2 + a3x3 + ------------ + anxn
        Through this linear regression equation, we will give multiple lines, Now We have to choose the best fit line, which will explain the relationship between the independent and dependent variable.
 

  Best fit line

There are three approaches to find the best fit line:

App1: Sum of Residuals
     It might be cancelling out the positive and negative errors.

App2: Sum of the absolute value of residuals
    The absolute value would prevent the cancellation of errors but sometimes it is very hard to differentiate between two models error value. 

App3: Sum of Square of residuals 
   It prevents the cancellation of error and also gives a significant difference between the two models error.

                                                                                                            To be continued.......

Thursday, 9 November 2017

Using multiple linear regression Predict the profit of startups

Predict the profit of startups by using multiple linear regression

Here is my first Blog, In this, I will mention how to develop a machine learning model to predict the profit of startups. For this model, we have some prior data with the following attributes.


=> R&D Spend Amount invested in Research and development.
=> Administration: Amount invested in administration.
=> Marketing Spend: Amount invested in marketing.
=> State: company starts from this state.

So, based on these data we will learn step by step development of Machine learning model to predict the profit. Here in this post, we will use python.

Let's Start.


Step 1: 


First, We will import some python libraries which we are going to use for our model development.


#Pandas is used for importing datasets.
#Numpy is used for mathematical functions.
#Matplotlib is a plotting library used for plotting graphs.
#Seaborn is a Python visualization library based on matplotlib.
#Warning messages are typically issued in situations where it is useful to alert the user of some condition in a program, where that condition (normally) doesn’t warrant raising an exception and terminating the program. For more info click here

Step 2 :

Import 50 startups datasets using the read_csv function of pandas.

Step 3 :

check the number of null value in data attributes and type of data.













We don't have any null value attribute.

Step 4 :

Let's Analyze the attributes one by one. data.head()
R&D Spend: Numerical values.

Administration: Numerical values.

Marketing Spend: Numerical values.

State: Categorical values.

we will not get any useful information by Visualizing the Numerical values.
So, we will focus on the Categorical variable "State".
Lets see the counterplot bar graph of State's categorical variable.






















Here we will plot this bar graph by using the countplot function of the seaborn library.
Note: In this graph, we can observe that all three categorical values are equally distributed.
we can also visualize this in the pie chart.


















Lets plot bar graph and pie chart together in two subplots.




















Step 5:

After visualization of attributes of startup dataset . Now we are going to develop our model.
Here we will separate input variable and target variables.
So, In our input variable we will keep independent variable and in target variable we will keep
dependent variable here profit is our dependent variable.



Step 6:
OK, Now we have our input variable and target variable in x and y simultaneously.
let's check the data type of x .







Now we can see the x data type is 'O' = object because of State attribute , we can't apply linear regression model on object data type , So we need to convert it into float type.
#We will use dummy variable concept to replace categorical variable.

e.g:
  New York = D1 (dummy 1) if yes : 1 else : 0
  California = D2 (dummy 2) if yes : 1 else : 0
  Florida = D3 (dummy 3) if yes : 1 else : 0

but we will go with one less number of dummy variable.
i.e.  D1 and D2 .
if D1=1 and D2=0 it is in New York
if D1=0 and D2=1 it is in California 
if both D1=0 and D2 = 0 than it is in Florida 

%%Our machine learning model is intelligent enough to understand it %%

So let's use dummy variable 

To do this, we are using LabelEncoder and OneHotEncoder from sklearn.preprocessing.
Now we can see that our input variable is float type. 

%%If you want to know more about it you can query in comment section%%
Step : 7

To develop linear model we need some amount of data to train our machine, and after training we also need to test our model.So we will divide our data into training dataset and testing dataset.
we will divide it in the ratio  80:20 .
lets do it.
we will provide more data to train our model than to test the model.

Step : 8
Finally, we are going to build our model by fitting the multiple linear regression into our training set.
Step :9 

Our model is ready.  Now,lets test it by using testing dataset.


Here y_pred is predicted profit by our newly developed multi_linear model.
=>>>Do a comparison between y_pred and y_test.

Our model is developed but is not fully optimized yet.

Step 10: 

Now I am going to apply Backward Elimination to optimize our Model.(I will explain this method in my next post)



And Here is the summary of our optimized machine learning model.



































You can see P value is 0.000. It means that now our model is fully optimized for 5% significant level. (Explanation of this Optimization will be mentioned in my next post.)
Thank you.


Tuesday, 24 February 2015

Fun with Technology:

This blog belongs to all who have a dream and also have a capacity to achieve that.
I wish you guys for your great start with this Fun with technology.....

BigMart Sales Prediction Using Multiple Linear Regression

BigMart Sales Prediction Problem is one of the basic problems, which can be used to start Data Science journey. Here I will give you a clea...