Monday, 30 July 2018

BigMart Sales Prediction Using Multiple Linear Regression

BigMart Sales Prediction Problem is one of the basic problems, which can be used to start Data Science journey. Here I will give you a clear view of approaching a Regression problem with proper data analysis and exploration.
Retail is another industry which extensively uses analytics to optimize business processes. Tasks like product placement, inventory management, customized offers, product bundling, etc. are being smartly handled using data science techniques.


Problem Statement


The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.

Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

Here we will discuss a production-ready type code and process for a small toy problem like "BigMart Sales Prediction".
We will walk through the code and process Step by Step:
Step 1: 
     let us understand the problem and try to find the approaches to find a better solution.

     m1: Average sales of past data.
        MSE = (e1^2 +e e2^2 +............+ en^2) / n

        Limitation: the prediction error will be very high.

     m2: Area wise average sales 

        Limitation: High prediction error and high MSE value

     m3: Linear Regression

         A Statistical technique for predictive modelling which tells the relation between Independent and dependent variable.

         Y = a1x1 + a2x2 + a3x3 + ------------ + anxn
        Through this linear regression equation, we will give multiple lines, Now We have to choose the best fit line, which will explain the relationship between the independent and dependent variable.
 

  Best fit line

There are three approaches to find the best fit line:

App1: Sum of Residuals
     It might be cancelling out the positive and negative errors.

App2: Sum of the absolute value of residuals
    The absolute value would prevent the cancellation of errors but sometimes it is very hard to differentiate between two models error value. 

App3: Sum of Square of residuals 
   It prevents the cancellation of error and also gives a significant difference between the two models error.

                                                                                                            To be continued.......

BigMart Sales Prediction Using Multiple Linear Regression

BigMart Sales Prediction Problem is one of the basic problems, which can be used to start Data Science journey. Here I will give you a clea...