Intro to R

R was written by statisticians Ross Ihaka and Robert Gentleman

Data is plural of Datum

Questions that are to be asked before ingesting data can be abbreviated as FLOPS
 Format of your data ?
 Location of data ?
 One-time activity ?
 Preprocessing required ?
 Size of the dataset ?

Is your data variable categorical or numerical ?
Is your data normally distributed or skewed ?

General Tips

  • If you ever see a plus sign in your R console instead of the greater-than sign, it’s stuck! Hit the escape key to get back to the cursor.
  • To get help on any function, type a question mark and then the name of the function (for example, ?sqrt).
  • To add new observations to a data frame use the rbind() function.
  • Data Structure: for example arrays (or lists), hash tables, structs, and sets.
  • Variable Names: they are case-sensitive, don’t start them with a number, and don’t use spaces.
  • Vectors: created with the c() function, or the seq() function.
  • Data frames: created with the data.frame function, or by reading in a csv file with the read.csv() function.
  • Observation: row of values within a data frame

Reading in Files – To read in a csv file, first set the working directory (or specify full path in the read command):

setwd()
getwd()
DataFrame = read.csv("filename.csv") # OR
DataFrame = read.csv("/path../filename.csv")

Summarizing and Subsetting Data – To summarize a data frame, use str or summary functions:

str(DataFrame)
summary(DataFrame)

To create a subset of a data frame, use the subset function:

NewDataFrame = subset(DataFrame, Variable1 <= 20 | Variable2 == 1)
NewDataFrame = subset(DataFrame, Variable1 > 100)
NewDataFrame = subset(DataFrame, Variable1 < 10 && Variable2 != 1)

The pipe symbol means “or”, the ampersand symbol means “and”, the double-equals means “exactly equal to”, and the exclamation followed by an equals sign means “not equal to”.

Basic Data Analysis – Some useful functions for computing statistics about a variable:

mean(DataFrame$Variable1)
sd(DataFrame$Variable1)
summary(DataFrame$Variable1)
which.max(DataFrame$Variable1)
which.min(DataFrame$Variable1)

Scatterplots:

plot(DataFrame$Variable1, DataFrame$Variable2)
plot(DataFrame$Variable1, DataFrame$Variable2, xlab="X-axis Label", ylab="Y-axis Label", main="Main Title of Plot")

Histograms:

hist(DataFrame$Variable1)
hist(DataFrame$Variable1, xlab="X-axis Label", ylab="Y-axis Label", main="Main Title of Plot")

Boxplots:

boxplot(DataFrame$Variable1)
boxplot(DataFrame$Variable1 ~ DataFrame$Variable2)
boxplot(DataFrame$Variable1 ~ DataFrame$Variable2, xlab="X-axis Label", ylab="Y-axis Label", main="Main Title of Plot")

Summary Tables

Tables of counts:

table(DataFrame$Variable1)
table(DataFrame$Variable2 == 1)
table(DataFrame$Variable1, DataFrame$Variable2)

Table of summary statistics (like pivot tables in Excel):

tapply(DataFrame$Variable1, DataFrame$Variable2, mean)
tapply(DataFrame$Variable1, DataFrame$Variable2, min, na.rm=TRUE)

External libraries – External libraries enable the user to add new features to the standard R environment. To import a library, you need to the install.packages command

install.packages("ROCR")

It will download the library in the R environment. But you still need to explicitly import it by using the library command

library(ROCR)

More on tapply() using a very simple example – Say x is a list of 10 integers –> 2, 4, 6, 7, 9, 11, 12, 15, 16, 17 and say g contains the associated groups: group a or b or c associated with each of x This means g is a, a, a, a, b, b, b, c, c, c where a,b,c represent group a, group b, group c resp.

x=c(2, 4, 6, 7, 9, 11, 12, 15, 16, 17)
g=c("a","a","a","a","b","b","b","c","c","c")

To show mean of each group we will use this tapply() formula

tapply(x, g, mean)

The result will be grouped by g

       a        b        c
    4.75000 10.66667 16.00000

To show max (or min or range) by group we will use the following tapply() formula

tapply(x, g, max)

The result will be the max of group a, group b and group c

         a  b  c
         7 12 17

Condition in tapply formula: To check all the entries of x for each group g where x>9 we can use the following tapply() formula

tapply(x>9, g, sum)

The result will count the total number of entries x in each group that are greater than 9

         a b c
         0 2 3
Advertisements