R was written by statisticians **Ross Ihaka** and **Robert Gentleman**

* Data* is plural of Datum

Questions that are to be asked before ingesting data can be abbreviated as *FLOPS*

** F**ormat of your data ?

** L**ocation of data ?

** O**ne-time activity ?

** P**reprocessing required ?

** S**ize of the dataset ?

Is your data variable categorical or numerical ?

Is your data normally distributed or skewed ?

**General Tips**

- If you ever see a plus sign in your R console instead of the greater-than sign, it’s stuck! Hit the escape key to get back to the cursor.
- To get help on any function, type a question mark and then the name of the function (for example,
*?sqrt*). - To add new observations to a data frame use the
**rbind**() function. **Data Structure:**for example arrays (or lists), hash tables, structs, and sets.**Variable Names:**they are case-sensitive, don’t start them with a number, and don’t use spaces.**Vectors:**created with the c() function, or the seq() function.**Data frames:**created with the data.frame function, or by reading in a csv file with the read.csv() function.**Observation:**row of values within a data frame

**Reading in Files – **To read in a csv file, first set the working directory (or specify full path in the read command):

setwd() getwd() DataFrame = read.csv("filename.csv") # OR DataFrame = read.csv("/path../filename.csv")

**Summarizing and Subsetting Data – **To summarize a data frame, use **str** or **summary** functions:

str(DataFrame) summary(DataFrame)

To create a subset of a data frame, use the **subset** function:

NewDataFrame = subset(DataFrame, Variable1 <= 20 | Variable2 == 1) NewDataFrame = subset(DataFrame, Variable1 > 100) NewDataFrame = subset(DataFrame, Variable1 < 10 && Variable2 != 1)

The pipe symbol means “or”, the ampersand symbol means “and”, the double-equals means “exactly equal to”, and the exclamation followed by an equals sign means “not equal to”.

**Basic Data Analysis – **Some useful functions for computing statistics about a variable:

mean(DataFrame$Variable1) sd(DataFrame$Variable1) summary(DataFrame$Variable1) which.max(DataFrame$Variable1) which.min(DataFrame$Variable1)

**Scatterplots:**

plot(DataFrame$Variable1, DataFrame$Variable2) plot(DataFrame$Variable1, DataFrame$Variable2, xlab="X-axis Label", ylab="Y-axis Label", main="Main Title of Plot")

**Histograms:**

hist(DataFrame$Variable1) hist(DataFrame$Variable1, xlab="X-axis Label", ylab="Y-axis Label", main="Main Title of Plot")

**Boxplots:**

boxplot(DataFrame$Variable1) boxplot(DataFrame$Variable1 ~ DataFrame$Variable2) boxplot(DataFrame$Variable1 ~ DataFrame$Variable2, xlab="X-axis Label", ylab="Y-axis Label", main="Main Title of Plot")

**Summary Tables**

**Tables of counts:**

table(DataFrame$Variable1) table(DataFrame$Variable2 == 1) table(DataFrame$Variable1, DataFrame$Variable2)

Table of summary statistics (like pivot tables in Excel):

tapply(DataFrame$Variable1, DataFrame$Variable2, mean) tapply(DataFrame$Variable1, DataFrame$Variable2, min, na.rm=TRUE)

**External libraries – **External libraries enable the user to add new features to the standard R environment. To import a library, you need to the install.packages command

install.packages("ROCR")

It will download the library in the R environment. But you still need to explicitly import it by using the library command

library(ROCR)

**More on tapply() using a very simple example – **Say x is a list of 10 integers –> 2, 4, 6, 7, 9, 11, 12, 15, 16, 17 and say g contains the associated groups: group a or b or c associated with each of x This means g is a, a, a, a, b, b, b, c, c, c where a,b,c represent group a, group b, group c resp.

x=c(2, 4, 6, 7, 9, 11, 12, 15, 16, 17) g=c("a","a","a","a","b","b","b","c","c","c")

To show **mean** of each group we will use this tapply() formula

tapply(x, g, mean)

The result will be grouped by g

a b c 4.75000 10.66667 16.00000

To show **max** (or **min** or **range**) by group we will use the following tapply() formula

tapply(x, g, max)

The result will be the max of group a, group b and group c

a b c 7 12 17

**Condition in tapply formula**: To check all the entries of x for each group g where x>9 we can use the following tapply() formula

tapply(x>9, g, sum)

The result will count the total number of entries x in each group that are greater than 9

a b c 0 2 3