Thursday, 14 March 2013

Panel data Analysis: An Inception

#this post is created as a solution for assignments given on 13/02/2013 in IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

Panel (data) analysis is a statistical method, widely used in social science, epidemiology, and econometrics, which deals with two-dimensional panel data.The data are usually collected over time and over the same individuals and then a regression is run over these two dimensions.A common panel data regression model looks like y_{it}=a+bx_{it}+\epsilon_{it}, where y is the dependent variable, x is the independent variable, a and b are coefficients, i and t are indices for individuals and time. The error \epsilon_{it} is very important in this analysis. Assumptions about the error term determine whether we speak of fixed effects or random effects. In a fixed effects model, \epsilon_{it} is assumed to vary non-stochastically over i or t making the fixed effects model analogous to a dummy variable model in one dimension. 

We will be busing 3 models for this purpose:

  • Pooled affect model

  • Fixed affect model

  • Random affect model 

    Assignment #1: 
    Do Panel data analysis on data "Produc" using package "plm" on three types of model and then determine which model is the best for this data set by using the following functions: 
    pFtest : for determining between fixed and poole
    plmtest : for determining between pooled and random 
    phtest: for determining between random and fixed

    Solution: Commands used are:
    First we load the data by using following commands:
    > data(Produc , package ="plm")
    > head(Produc)
    Snapshot of commands and result is given below:

    The description for the header of data set is as under.It contains the following datatypes

    - state : the state
    - year : the year
    - pcap: private capital stock
    - hwy : highway and streets
    - water: water and sewer facilities
    - util: other public buildings and structures
    - pc: public capital
    - gsp: gross state products
    - emp: labor input measured by the employement in non–agricultural payrolls
    - unemp: state unemployment rate

    Here, we assume that "pcap" is dependent variable and other variables are independent, so we try to estimate "pcap" by using pooled affect model

    Commands and snapshot of result is given below:
    > pool <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("pooling"), index = c("state","year"))
    > summary(pool)


    Then we try to estimate "pcap" by using fixed affect model.
    Commands and snapshot of result is given below:

    > fixed <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("within"), index = c("state","year"))
    > summary(fixed)

    Then we try to estimate "pcap" by using Random affect model.
    Commands and snapshot of result is given below:

    > random <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("random"), index = c("state","year"))
    > summary(random)

    Comparison
    The comparison between the models would be a Hypothesis testing where always null hypothesis will validate pooled data analysis.
    H0: Null Hypothesis: the individual index and time based params are all zero
    H1: Alternate Hypothesis: atleast one of the index and time based params is non zero

    Pooled vs Fixed
    Null Hypothesis: Pooled Affect Model
    Alternate Hypothesis : Fixed Affect Model

    Command:
    > pFtest(fixed,pool)
    Result:
    data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp)
    F = 56.6361, df1 = 47, df2 = 761, p-value < 2.2e-16



    From the result, we can see that the p value is negligible, so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Affect Model over Pooled Affect model.

    Pooled vs Random
    Null Hypothesis: Pooled Affect Model
    Alternate Hypothesis: Random Affect Model
    Command :
    > plmtest(pool)
    Result:
            Lagrange Multiplier Test - (Honda)
    data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp)
    normal = 57.1686, p-value < 2.2e-16
    alternative hypothesis: significant effects
    Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Random Affect Model.
    Random vs Fixed
    Null Hypothesis: No Correlation . Random Affect Model
    Alternate Hypothesis: Fixed Affect Model
    Command:
     > phtest(fixed,random)
    Result:
            Hausman Test
    data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp)
    chisq = 93.546, df = 7, p-value < 2.2e-16
    alternative hypothesis: one model is inconsistent
    Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Affect Model.

    Conclusion: 
    So after making all the comparisons we can see that Fixed affect model is preferred over Pooled Affect Model, Random Affect model is preferred over Pooled Affect Model, and finally Fixed affect model is preferred to Random Affect model .
    So, we  come to the conclusion that Fixed Affect Model is best suited to do the panel data analysis for "Produc" data set and significant correlation observed with the regressor variables and index impact exists.