Acquaintance in the world of R

Thursday, 14 March 2013

Panel data Analysis: An Inception

#this post is created as a solution for assignments given on 13/02/2013 in IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

Panel (data) analysis is a statistical method, widely used in social science, epidemiology, and econometrics, which deals with two-dimensional panel data.The data are usually collected over time and over the same individuals and then a regression is run over these two dimensions.A common panel data regression model looks like $y_{it}=a+bx_{it}+\epsilon_{it}$ , where y is the dependent variable, x is the independent variable, a and b are coefficients, i and t are indices for individuals and time. The error $\epsilon_{it}$ is very important in this analysis. Assumptions about the error term determine whether we speak of fixed effects or random effects. In a fixed effects model, $\epsilon_{it}$ is assumed to vary non-stochastically over $i$ or $t$ making the fixed effects model analogous to a dummy variable model in one dimension.

We will be busing 3 models for this purpose:

Pooled affect model
Fixed affect model
Random affect model

Assignment #1:
Do Panel data analysis on data "Produc" using package "plm" on three types of model and then determine which model is the best for this data set by using the following functions:
pFtest : for determining between fixed and poole
plmtest : for determining between pooled and random
phtest: for determining between random and fixed

Solution: Commands used are:

First we load the data by using following commands:
> data(Produc , package ="plm")
> head(Produc)

Snapshot of commands and result is given below:

The description for the header of data set is as under.It contains the following datatypes

- state : the state

- year : the year

- pcap: private capital stock

- hwy : highway and streets

- water: water and sewer facilities

- util: other public buildings and structures

- pc: public capital

- gsp: gross state products

- emp: labor input measured by the employement in non–agricultural payrolls

- unemp: state unemployment rate

Here, we assume that "pcap" is dependent variable and other variables are independent, so we try to estimate "pcap" by using pooled affect model

Commands and snapshot of result is given below:
> pool <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("pooling"), index = c("state","year"))
> summary(pool)

Then we try to estimate "pcap" by using fixed affect model.
Commands and snapshot of result is given below:

> fixed <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("within"), index = c("state","year"))
> summary(fixed)

Then we try to estimate "pcap" by using Random affect model.
Commands and snapshot of result is given below:

> random <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("random"), index = c("state","year"))
> summary(random)

Comparison

The comparison between the models would be a Hypothesis testing where always null hypothesis will validate pooled data analysis.

H0: Null Hypothesis: the individual index and time based params are all zero

H1: Alternate Hypothesis: atleast one of the index and time based params is non zero

Pooled vs Fixed

Null Hypothesis: Pooled Affect Model

Alternate Hypothesis : Fixed Affect Model

Command:

> pFtest(fixed,pool)

Result:
data: log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp)
F = 56.6361, df1 = 47, df2 = 761, p-value < 2.2e-16

From the result, we can see that the p value is negligible, so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Affect Model over Pooled Affect model.

Pooled vs Random

Null Hypothesis: Pooled Affect Model

Alternate Hypothesis: Random Affect Model

Command :

> plmtest(pool)

Result:

        Lagrange Multiplier Test - (Honda)

data: log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp)

normal = 57.1686, p-value < 2.2e-16
alternative hypothesis: significant effects

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Random Affect Model.

Random vs Fixed

Null Hypothesis: No Correlation . Random Affect Model

Alternate Hypothesis: Fixed Affect Model

Command:

> phtest(fixed,random)

Result:

        Hausman Test

data: log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp)

chisq = 93.546, df = 7, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Affect Model.

Conclusion:

So after making all the comparisons we can see that Fixed affect model is preferred over Pooled Affect Model, Random Affect model is preferred over Pooled Affect Model, and finally Fixed affect model is preferred to Random Affect model .

So, we come to the conclusion that Fixed Affect Model is best suited to do the panel data analysis for "Produc" data set and significant correlation observed with the regressor variables and index impact exists.

Wednesday, 13 February 2013

Interpreting the Historical Volatility of Market

#this post is created as a solution for assignments given on 13/02/2013 in IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

Assignment #1: 1) create log of returns data (use Closing Price Nifty data, from 01.01.2012 to 01.01.2013) and calculate historical volatility.

Solution: Commands used are:

readData<-read.csv(file.choose() , header=T)
closePrice<-readData[,5] // Reading Closing Price Column
closePrice.ts<-ts(closePrice , frequenxy=252) // making a time series
varLag<- lag(closePrice.ts , k=-1) // calculating stock price for time (t-1)
logNum<- log(closePrice.ts , base=exp(1)) - log(varLag , base=exp(1)) // Calculating log
LogReturns<-logNum/log(varLag , base=exp(1)) // calculating log for returns data

Snapshot of commands and result is given below:

Now, we calculate Historical volatility as follows:
sqrt<-(252)^0.5
histVolaitility<-sd(logreturns)*sqrt

Snapshot of commands and result is given below:

Assignment #2 :create an ACF plot for the log returns data calculated previously and interpret the findings. Also do ADF test and interpret the findings.

Soln -:

// The following command is used to create ACF plot

acf(logReturns)

Snapshot of commands and result is given below:

Grahical Interpreation
- the two horizontal dotted lines represent confidence interval for the hypothesis (95% in default case)
- As all the co-relations plots(vertical lines) lie inside those two blue dotted lines , it can be suggested that the returns data is "Stationary" in nature.

using ADF test

Command used
adf.test(logReturns)
we get the following result:

To interpret the result , we construct the Null Hypothesis,
Null Hypothesis -: The returns data is not Stationary
Alternative Hypothesis -: Returns Data is stationary

As from the test results p-value = 0.01 which is less than 0.05 value as stated for 95%confidence interval, Null Hypothesis is rejected.

Results -: given data is stationary in nature

Thursday, 7 February 2013

Exploring Returns and Logit

#this post is created as a solution for assignments given on 05/02/2013 in IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

Assignment #1: Find returns of NSE data of greater than 6 months having selected the 10th data point as start and 95th data point as end.

Solution:

Steps:

1. Extract the data (from 5/7/2011 to 7/2/2013) in a separate .csv file.
2. Find the Returns applying Time series and Lag
3. Draw the plot.

Commands used:
z<-read.csv(file.choose(),header=T)
Closedata<-z$Close
Close.ts<-ts(Closedata)
Close.ts<-ts(Closedata,deltat= 1/252)
znew<-ts(data=Close.ts[10:95],frequency=1,deltat=1/252)
znew.ts<-ts(znew)
znew.diff<-diff(znew)
zdenominator<-lag(znew.ts,K=-1)
Returndata<-znew.diff/zdenominator
plot(Returndata,main=" Returndata for 10 th to 95th day of NSE data downloaded ")

Snapshot of Result

Assignment #2: 1-700 data is available, Predict the data from 701-850, use the GLM estimation using LOGIT Analysis for the same.

Solution:

Commands:

z<-read.csv(file.choose(),header=T)

z1<-z[1:700,1:9]

head(z1)

z1$ed<-factor(z1$ed)

z1.est<-glm(default ~ age + ed + employ + address + income + debtinc + creddebt + othedebt, data=z1, family ="binomial")

summary(z1.est)

forecast<-z[701:850,1:8]

forecast$ed<-factor(forecast$ed)

forecast$probability<-predict(z1.est,newdata=forecast,type="response")

head(forecast)

Tuesday, 22 January 2013

Exploring Regression and ANOVA

#this post is created as a solution for assignments given on 22/01/2013 in IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

Assignment #1: We believe for one kind of car, groove(Independent Variable) is impacting mileage (Dependent Variable). We have to fit 'lm' and comment on the applicability of 'lm'.

Solution:

Steps:

1. Extract the data in a separate .csv file.
2.Assign Groove and Mileage to separate variables and apply Regression on them
3. Find the Residual and draw the Q-Q plot.
Results:

plot(k1,res):

Q-Q Norm(res):

Q-Q Line(res):

We can see that the generated plot is not scattered enough, so linearity is not applicable in this case.
Assignment #2:
Using data of alpha and pluto, find the following:
1. First find the linear regression:

2.Calculate the residuals:

3.plot(p1,res1):

4. Standard residual:

5.Q-Q Norm(res1):

6. Q-QLine(res1) :

Assignment #3:
Justify Null Hypothesis using ANOVA:
Answer:

We found from the result that, p=0.687
Using 95% confidence interval, we can see that as p>0.05
So, we can't reject the Null Hypothesis and we accept it.

Tuesday, 15 January 2013

The Matrix Revolutionised!!!

#this post is created as a solution for assignment for IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

As our learning progressed, we explored some more features of R(Matrix Representation,Regression Analysis, Normal Distribution).
Based on our learning, we are submitting the following asignments:

Assignment # 1

a. We have to create two matrix.

b. We have to select highlighted columns.

c. We have to use "cbind" command to join those two columns and create a new matrix.

The solution is given below:

Sol -:

Matrix 2 assignment and generation
> mat1<-c(1:10)
> dim(mat1)<-c(2,5)
Matrix 2 assignment and generation
> mat2<-c(11:16)
> dim(mat2)<-c(2,3)
Taking 3rd column from matrix1 and 2nd column from matrix 2, we use the cbind(for column binding) and rbind(for row binding) functions as shown -

Assignment # 2
We have to Multiply 2 matrices
Sol -:
Command to multiply 2 matrices
> multip <- z1 %*% z2

Assignment #3-:
1.To download NSE data dated from 1st Dec, 2012 to 31st Dec, 2012 in the form of a .csv file.
2.To find regression between the High Price and the opening share price and calculate the residuals. Soln- :
Command for finding the Regression :
> reg1<-lm(HighPrice ~ OpenPrice , data = NSEData)
The above arguments are explained below:
NSEData - Object with file historical data
High Price - Dependent variable
Open Price - Independent variable
The snapshot of the data collected is given below:

The Residuals calculated are given below:

Assignment # 4
We have to Generate and plot a Normal distribution, with arbitrary mean and standard deviation taken.Soln -:
To generate normally distributed random numbers function used is -:
dnorm(N, mean,sd)
where N is the no of observations
mean is the mean vector
sd - standard deviation
The command ran are given below:

We have got the following normal distribution curve for the taken mean and standard deviation:

Tuesday, 8 January 2013

The exordium

Journey Begins.....

R or rather the R Statistical package, very simply put is the open source equivalent of SAS. R can pretty much do everything SAS can do in terms of Statistical analysis and there are some pretty cool things R can do which SAS can’t. Say someone wants to build a predictive model using Logistic regression, well R can do it; ARIMA model, yes; Decision Trees, yes; Association rule mining,yes;etc.Many of R's standard functions are written in R itself, which makes it easy for users to follow the algorithmic choices made. It's applied in insurance,finance, marketing etc.

In a nutshell, R is here to stay and to grow.

The R project for Statistical Computing

Assignment 1: Draw a histogram after concatenating 3 data points.

Soln :

Commands used are as under -:

> x<-c(1,2,3)

> plot(x, type = "h")

Assignment 2: Drawing a line graph with points and naming the graph and the axis.

Soln : We gathered the data from National Stock Exchange web site. Let z be the variable that contains data from the .csv file selected. Reading from the csv file is done as under -:

> z<-read.csv(file.choose(), header=T)

This command prompts the user to select the data file from the saved location.

zcol1 be the variable that contains contents of column 3 from the excel data.

the following commands were used.
> zcol1<-z[,3]
> plot(zcol1 , type="b" , main="NSE Graph" , xlab="Time" , ylab="indices").

Assignment 3: Merge two columns from the table obtained. Create a scatter plot by using share HIGH and LOW values from the NSE Historical data as obtained from the .csv file.

Soln :HIGH values as obtained in previous ques

> zcol1<-z[,3]

LOW values are in column 4 from the csv file

> zcol2<-z[,4]

To plot the scatter plot

> plot(zcol1,zcol2)

Assignment 4 :

To find the volatility between the merged values obtained from NSE historical data and obtain the range for the same.

Soln :-

For this, we would require the maximum value amongst the HIGH values and the minimum values amongst the LOW values.

Merging both the columns into one vector variable 'y' to get the HIGH and LOW values together.

> y<-c(zcol1,zcol2)

> summary(y)

will give the min and the max value as under -:

Min. 1st Qu. Median Mean 3rd Qu. Max.
4888 5660 5723 5758 5884 6021

> range(y)
will give the desired range of volatility
[1] 4888.20 6020.75

Thursday, 14 March 2013

Panel data Analysis: An Inception

#this post is created as a solution for assignments given on 13/02/2013 in IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

We will be busing 3 models for this purpose:

Pooled affect model

Fixed affect model

Random affect model

Wednesday, 13 February 2013

Interpreting the Historical Volatility of Market

#this post is created as a solution for assignments given on 13/02/2013 in IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

Assignment #1: 1) create log of returns data (use Closing Price Nifty data, from 01.01.2012 to 01.01.2013) and calculate historical volatility.

Assignment #2 :create an ACF plot for the log returns data calculated previously and interpret the findings. Also do ADF test and interpret the findings.

Thursday, 7 February 2013

Exploring Returns and Logit

#this post is created as a solution for assignments given on 05/02/2013 in IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

Assignment #1: Find returns of NSE data of greater than 6 months having selected the 10th data point as start and 95th data point as end.

Solution:

Steps:

Assignment #2: 1-700 data is available, Predict the data from 701-850, use the GLM estimation using LOGIT Analysis for the same.

Solution:

Tuesday, 22 January 2013

Exploring Regression and ANOVA

#this post is created as a solution for assignments given on 22/01/2013 in IT & Business Applications Lab, Spring Semester, VGSoM, IIT Kharagpur Class of 2014.

Assignment #1: We believe for one kind of car, groove(Independent Variable) is impacting mileage (Dependent Variable). We have to fit 'lm' and comment on the applicability of 'lm'.

Solution:

Steps:

Tuesday, 15 January 2013

The Matrix Revolutionised!!!

Tuesday, 8 January 2013

The exordium

Journey Begins.....

Soln : We gathered the data from National Stock Exchange web site. Let z be the variable that contains data from the .csv file selected. Reading from the csv file is done as under -:

> z<-read.csv(file.choose(), header=T)

This command prompts the user to select the data file from the saved location.

zcol1 be the variable that contains contents of column 3 from the excel data.