R - Notes
The following are basically my notes while studying R and is meant as a reference point for myself
Just a few pointers to anyone preparing for R or studying R:
- Take a quick look at your statistical math basics before proceeding
- Before applying any formula on your base data, try to understand what the formula is and how it was derived (this will make it easier for one to understand)
- Use it in tangent with the Data Analysis in Excel
- Refer to the cheat sheets available on https://www.rstudio.com/resources/cheatsheets/
- Segregate the workbench for each module
- There are best practices that can be incorporated while programming in R
- Try and jot notes when and where one can...
- Refer to existing data-sets embedded in R before jumping into a data.gov file
- Refer to R programs written already in Azure ML
rnorm() by default
has mean 0 and variance 1
head() has its own
built in precision
*default settings in
R can be modified by the options() function
example:
options(digits = 15)
#will display 15
digits (Max digit for option display --> 22 and min digit --> 0): Error
if > 22 --> Error in options(digits = 30) :
#invalid 'digits'
parameter, allowed 0...22
#Infinity Operations
Inf/0 --> Inf
Inf * 0 --> Inf
Inf + 0 + (0/0)
--> NaN
Inf + 0 --> Inf
*The ls() lists all
the variable stored in R memory at a given point in time
*rm() will remove
contents from the list
*To figure out the
commands in R use the following command ? followed by the function that needs
to be leveraged:
?c()
?rand()
?max()
*Functions and
Datastructures
sin()
integrate()
plot()
paste()
*Again single valued
functions and multi valued functions
*A special vector is
called a factor
gl() --> generate
levels
*creating a function
in R
test<-function p="" x="">-function>
{
x=5
return (x*x+(x^2))
}
*for loop in R
l*apply() vs
sapply()
*Binding elements
rbind() --> bind
elements in a matrix in a row manner
cbind() --> bind
elements in a matrix in a columnar manner
*Every vector/matrix
has a data mode....
logical
numerical
*Can be found using
mode()
*dimensions in
matrices
=defines the number
of rows and columns in a matrix
*can be used with
dimnames(),rownames(),columnnames()
*Navigating through
R package libraries really bad....
*HMISC -->
Harrell misc... Contains many functions useful for data analysis, high-level
graphics, utility operations, functions for computing sample size and power,
importing and annotating datasets, imputing missing values, advanced table
making, variable clustering, character string manipulation, conversion of R
objects to LaTeX code, and recoding variables.
*R search path is
the R working directory
getwd() --> get
working directory
setwd()
*to read in a table
format:
testfile <- filename="" p="" read.table="">->
read.csv
read.csv2
read.fwf (fixed
width file)
*readLines()
scan()--> reads a
content of a file into a list or vector
f*ile() connections
can create connections to files for read/write purposes
write.table(line,file="myfile",append=TRUE)
f1<-file p="">-file>
close(f1)-->
close the file connection
write.table(dataFieldName,filename)
write.csv
write.csv2
base::sink Send
R Output to a File
dump()
dput() --> save
complicated R objects (in ASCII format)
dget() -->
inverse of dput()
*file in conjunction
with open="w" option
R has its own
internal binary object
use save() &
load() for binary format
*RODBC Package
Common Functions
odbcDriverConnect(Connection)
sqlQuery()
sqlTable()
sqlFetch()
sqlColumns()
close(Connection)
*specify the version
of the driver TDS_Version=8.0 and which port to use default:1433.
Ex:
sqlFetch(conn,"Tablename")
query<- from="" p="" selet="" t1="" t2="" test="">->
sqlQuery(conn,Query)
sqlColumns(conn,"Tablename")
sqlColumns(conn,"Tablename")[c("COLUMN_NAME"),c("TYPE_NAME")]
check dimensions of
a table using dim()
*summary() ->
gives a range of stats on the underlying vector,list,matrix
Which function
should you use to display the structure of an R object?
Str()
Log(dataframe) to
investigate the data
Calculate Groups
tapply()
aggregate()
by()
Attach()
Detach()
Convert to frequency
using prop.table()
Simulations in R
MCMC (Markov Chain
Monte Carlo)
Encryption
Performance Testing
Drawback -->
Uncertainity
Pseudo Random Number
Generator - The Mersenne Twister
Mersenne Prime
set.seed(number)
rnorm(3)
Uniform distribution
- runif(5,min=1,max=2)
Normal distribution
- rnorm(5,mean=2,sd=1)
Gamma distribution -
rgamma(5,shape=2,rate=1)
Binomial
distribution -rbinom(5,size=100,prob=.3)
Multinomial
Distribution - rmultinom(5,size=100,prob=c(.2,.4,.7))
Regression:
eruption.lm = lm(eruptions ~ waiting, data=faithful)
coeffs = coefficients(eruption.lm)
coeffs
coeffs[1]
coeffs[2]
waiting = 80 # the waiting time
duration = coeffs[1] + coeffs[2]*waiting
duration --> Predicted value
loadd ggplot2 or ggplot using load("gplot")
Compare models using ANOVA
X1 <- nbsp="" span="" style="font-family: 'Lucida Console', 'courier new', monospace; font-size: 13px; line-height: 19.5px;">lm(y ~ x1 + x2 + x3 + x4, data=mydata)->
Y1 <- lm="" span="" x1="" x2="" y="">
anova(X1, Y1)->
anova(X1, Y1)->
Comments