Thursday, 14 March 2013

Panel data Analysis: An Inception

#this post is created as a solution for assignments given on 13/02/2013 in IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

Panel (data) analysis is a statistical method, widely used in social science, epidemiology, and econometrics, which deals with two-dimensional panel data.The data are usually collected over time and over the same individuals and then a regression is run over these two dimensions.A common panel data regression model looks like y_{it}=a+bx_{it}+\epsilon_{it}, where y is the dependent variable, x is the independent variable, a and b are coefficients, i and t are indices for individuals and time. The error \epsilon_{it} is very important in this analysis. Assumptions about the error term determine whether we speak of fixed effects or random effects. In a fixed effects model, \epsilon_{it} is assumed to vary non-stochastically over i or t making the fixed effects model analogous to a dummy variable model in one dimension. 

We will be busing 3 models for this purpose:

  • Pooled affect model

  • Fixed affect model

  • Random affect model 

    Assignment #1: 
    Do Panel data analysis on data "Produc" using package "plm" on three types of model and then determine which model is the best for this data set by using the following functions: 
    pFtest : for determining between fixed and poole
    plmtest : for determining between pooled and random 
    phtest: for determining between random and fixed

    Solution: Commands used are:
    First we load the data by using following commands:
    > data(Produc , package ="plm")
    > head(Produc)
    Snapshot of commands and result is given below:

    The description for the header of data set is as under.It contains the following datatypes

    - state : the state
    - year : the year
    - pcap: private capital stock
    - hwy : highway and streets
    - water: water and sewer facilities
    - util: other public buildings and structures
    - pc: public capital
    - gsp: gross state products
    - emp: labor input measured by the employement in non–agricultural payrolls
    - unemp: state unemployment rate

    Here, we assume that "pcap" is dependent variable and other variables are independent, so we try to estimate "pcap" by using pooled affect model

    Commands and snapshot of result is given below:
    > pool <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("pooling"), index = c("state","year"))
    > summary(pool)


    Then we try to estimate "pcap" by using fixed affect model.
    Commands and snapshot of result is given below:

    > fixed <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("within"), index = c("state","year"))
    > summary(fixed)

    Then we try to estimate "pcap" by using Random affect model.
    Commands and snapshot of result is given below:

    > random <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("random"), index = c("state","year"))
    > summary(random)

    Comparison
    The comparison between the models would be a Hypothesis testing where always null hypothesis will validate pooled data analysis.
    H0: Null Hypothesis: the individual index and time based params are all zero
    H1: Alternate Hypothesis: atleast one of the index and time based params is non zero

    Pooled vs Fixed
    Null Hypothesis: Pooled Affect Model
    Alternate Hypothesis : Fixed Affect Model

    Command:
    > pFtest(fixed,pool)
    Result:
    data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp)
    F = 56.6361, df1 = 47, df2 = 761, p-value < 2.2e-16



    From the result, we can see that the p value is negligible, so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Affect Model over Pooled Affect model.

    Pooled vs Random
    Null Hypothesis: Pooled Affect Model
    Alternate Hypothesis: Random Affect Model
    Command :
    > plmtest(pool)
    Result:
            Lagrange Multiplier Test - (Honda)
    data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp)
    normal = 57.1686, p-value < 2.2e-16
    alternative hypothesis: significant effects
    Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Random Affect Model.
    Random vs Fixed
    Null Hypothesis: No Correlation . Random Affect Model
    Alternate Hypothesis: Fixed Affect Model
    Command:
     > phtest(fixed,random)
    Result:
            Hausman Test
    data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp)
    chisq = 93.546, df = 7, p-value < 2.2e-16
    alternative hypothesis: one model is inconsistent
    Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Affect Model.

    Conclusion: 
    So after making all the comparisons we can see that Fixed affect model is preferred over Pooled Affect Model, Random Affect model is preferred over Pooled Affect Model, and finally Fixed affect model is preferred to Random Affect model .
    So, we  come to the conclusion that Fixed Affect Model is best suited to do the panel data analysis for "Produc" data set and significant correlation observed with the regressor variables and index impact exists.

Wednesday, 13 February 2013

Interpreting the Historical Volatility of Market

#this post is created as a solution for assignments given on 13/02/2013 in IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

Assignment #1: 1) create log of returns data (use Closing Price Nifty data, from 01.01.2012 to 01.01.2013) and calculate historical volatility.

 

 

Solution: Commands used are:
 readData<-read.csv(file.choose() , header=T)
closePrice<-readData[,5] // Reading Closing Price Column
closePrice.ts<-ts(closePrice , frequenxy=252)  // making a time series
varLag<- lag(closePrice.ts , k=-1) // calculating stock price for time (t-1)
logNum<- log(closePrice.ts , base=exp(1)) - log(varLag , base=exp(1)) // Calculating log
LogReturns<-logNum/log(varLag , base=exp(1)) // calculating log for returns data

Snapshot of commands and result is given below:



Now, we calculate Historical volatility as follows:
sqrt<-(252)^0.5
histVolaitility<-sd(logreturns)*sqrt

Snapshot of commands and result is given below:



Assignment #2 :create an ACF plot for the log returns data calculated previously and interpret the findings. Also do ADF test and interpret the findings.

Soln -:

// The following command is used to create ACF plot

acf(logReturns)

Snapshot of commands and result is given below:




Grahical Interpreation
-  the two horizontal dotted lines represent confidence interval for the hypothesis (95% in default case)
- As all the co-relations plots(vertical lines) lie inside those two blue dotted lines , it can be suggested that the returns data is "Stationary" in nature.

using ADF test

Command used
adf.test(logReturns)
we get the following result:


 
To interpret the result , we construct the Null Hypothesis,
Null Hypothesis -: The returns data is not Stationary
Alternative Hypothesis -: Returns Data is stationary

As from the test results p-value = 0.01 which is less than 0.05 value as stated for 95%confidence interval, Null Hypothesis is rejected.

Results -: given data is stationary in nature

Thursday, 7 February 2013

Exploring Returns and Logit

#this post is created as a solution for assignments given on 05/02/2013 in IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

Assignment #1: Find returns of NSE data of greater than 6 months having selected the 10th data point as start and 95th data point as end.

Solution: 

Steps:

1. Extract the data (from 5/7/2011 to 7/2/2013) in a separate .csv file.
2. Find the Returns applying Time series and Lag
3. Draw the plot.

Commands used:
z<-read.csv(file.choose(),header=T)
Closedata<-z$Close
Close.ts<-ts(Closedata)
Close.ts<-ts(Closedata,deltat= 1/252)
znew<-ts(data=Close.ts[10:95],frequency=1,deltat=1/252)
znew.ts<-ts(znew)
znew.diff<-diff(znew)
zdenominator<-lag(znew.ts,K=-1)
Returndata<-znew.diff/zdenominator
plot(Returndata,main=" Returndata for 10 th to 95th day of NSE data downloaded ")


Snapshot of Result


Assignment #2: 1-700 data is available, Predict the data from 701-850, use the GLM estimation using LOGIT Analysis for the same.

Solution:

Commands:

  z<-read.csv(file.choose(),header=T)

  z1<-z[1:700,1:9]

  head(z1)

  z1$ed<-factor(z1$ed)

  z1.est<-glm(default ~ age + ed + employ + address + income + debtinc + creddebt + othedebt,  data=z1, family ="binomial")

 summary(z1.est)

 forecast<-z[701:850,1:8]

 forecast$ed<-factor(forecast$ed)

 forecast$probability<-predict(z1.est,newdata=forecast,type="response")

 head(forecast)





 

Tuesday, 22 January 2013

Exploring Regression and ANOVA

#this post is created as a solution for assignments given on 22/01/2013 in IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

Assignment #1: We believe for one kind of car, groove(Independent Variable) is impacting mileage (Dependent Variable). We have to fit 'lm' and comment on the applicability of 'lm'. 

Solution: 

Steps:

1. Extract the data in a separate .csv file.
2.Assign Groove and Mileage to separate variables and apply Regression on them
3. Find the Residual and draw the Q-Q plot.
Results:


plot(k1,res):

 
Q-Q Norm(res):

 
Q-Q Line(res):



We can see that the generated plot is not scattered enough, so linearity is not applicable in this case.
Assignment #2:
Using data of alpha and pluto, find the following:
1. First find the linear regression:


2.Calculate the residuals:
3.plot(p1,res1):
4. Standard residual:


5.Q-Q Norm(res1):
 6. Q-QLine(res1) :


Assignment #3:
Justify Null Hypothesis using ANOVA:
Answer:



We found from the result that, p=0.687
Using 95% confidence interval, we can see that as p>0.05
So, we can't reject the Null Hypothesis and we accept it.

Tuesday, 15 January 2013

The Matrix Revolutionised!!!

#this post is created as a solution for assignment for IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

As our learning progressed, we explored some more features of R(Matrix Representation,Regression Analysis, Normal Distribution).
Based on our learning, we are submitting the following asignments:
Assignment # 1
a. We have to create two matrix.
b. We have to select highlighted columns.
c. We have to use "cbind" command to join those two columns and create a new matrix.
The solution is given below: 
Sol -:
Matrix 2 assignment and generation
> mat1<-c(1:10)
> dim(mat1)<-c(2,5)
Matrix 2 assignment and generation
> mat2<-c(11:16)
> dim(mat2)<-c(2,3)
Taking 3rd column from matrix1 and 2nd column from matrix 2, we use the cbind(for column binding) and rbind(for row binding) functions as shown -



Assignment # 2
We have to Multiply 2 matrices
Sol -:
Command to multiply 2 matrices
> multip <- z1 %*% z2


Assignment #3-:
1.To download NSE data dated from 1st Dec, 2012 to 31st Dec, 2012 in the form of a .csv file.
2.To find regression between the High Price and the opening share price and calculate the residuals. Soln- :
Command for finding the Regression :
> reg1<-lm(HighPrice ~ OpenPrice , data = NSEData)
The above arguments are explained below:
NSEData - Object with file historical data
High Price - Dependent variable
Open Price - Independent variable
The snapshot of the data collected is given below:

 The Residuals calculated are given below:

Assignment # 4
We have to Generate and plot a Normal distribution, with arbitrary mean and standard deviation taken.Soln -:
To generate normally distributed random numbers function used is -:
dnorm(N, mean,sd)
where N is the no of observations
mean is the mean vector
sd - standard deviation
The command ran are given below:
We have got the following normal distribution curve for the taken mean and standard deviation:


Tuesday, 8 January 2013

The exordium

Journey Begins.....

R or rather the R Statistical package, very simply put is the open source equivalent of SAS.  R can pretty much do everything SAS can do in terms of Statistical analysis and there are some pretty cool things R can do which SAS can’t. Say someone wants to build a predictive model using Logistic regression, well R can do it; ARIMA model, yes; Decision Trees, yes; Association rule mining,yes;etc.Many of R's standard functions are written in R itself, which makes it easy for users to follow the algorithmic choices made. It's applied in insurance,finance, marketing etc.      

 In a nutshell, R is here to stay and to grow.

The R project for Statistical Computing
Assignment 1: Draw a histogram after concatenating 3 data points.
Soln : 
Commands used are as under -:
> x<-c(1,2,3)
> plot(x, type = "h")

Assignment 2: Drawing a line graph with points and naming the graph and the axis.  

Soln : We gathered the data from National Stock Exchange web site. Let z be the variable that contains data from the .csv file selected. Reading from the csv file is done as under -:   

> z<-read.csv(file.choose(), header=T)

This command prompts the user to select the data file from the saved location. 

zcol1 be the variable that contains contents of column 3 from the excel data.

the following commands were used.
> zcol1<-z[,3]
> plot(zcol1 , type="b" , main="NSE Graph" , xlab="Time" , ylab="indices").

Assignment 3: Merge two columns from the table obtained. Create a scatter plot by using share HIGH and LOW values from the NSE Historical data as obtained from the .csv file.
Soln :HIGH values as obtained in previous ques 
> zcol1<-z[,3]
LOW values are in column 4 from the csv file
> zcol2<-z[,4]
To plot the scatter plot 
> plot(zcol1,zcol2)


Assignment 4 :
To find the volatility between the merged values obtained from NSE historical data and obtain the range for the same.
Soln :-
For this, we would require the maximum value amongst the HIGH values and the minimum values amongst the LOW values.
Merging both the columns into one vector variable 'y' to get the HIGH and LOW values together.
> y<-c(zcol1,zcol2)
> summary(y)
 will give the min and the max value as under -:
   Min.    1st Qu.  Median    Mean   3rd Qu.    Max.
   4888    5660    5723        5758    5884       6021 

> range(y)
will give the desired range of volatility
[1] 4888.20 6020.75