Lab 3: Data Understanding and Data Preparation
Data preparation, data munging, and other things related to preparing data could be a class to itself.
Here we will go over some of the basis regarding relevant data funtions in R.
One of the relevant things to know is how to get help!
? cbind help(cbind) Vector
A single set of values in a particular order.
We can create a vector using the concatenate command (c). Let's say we want to capture the ages for 4 students. ages<-c(18,19,18,23) To see this we can just type the name of object: ages To pick a specific value, we can, indicate it. ages Or a range ages[2:4] Other Vectors names<-c("Sally", "Jason", "Bob", "Susy") #Text female<-c(TRUE, FALSE, FALSE, TRUE) grades<-c(20, 15, 13, 19) #25 points possible
We can apply funtions to Vectors names.length<-nchar(names) #Calc number of letters and return vector names.length #Prints integer vector vector
You can also include logic names.length.gt4<-nchar(names) > 4 names.length.gt4 #1.what type of vector is returned? Describe the function.
Or names.length.gt4b<-names.length > 4 names.length.gt4b #2.what type of vector is returned? Describe the function.
look at select we can do when combining vectors names[female] #3. what type of vector is returned and what content? names[names.length.gt4] #4.what type of vector is returned and what content?
We can also do math on the entire vector at once our grades were out of 25, lets curve 3 points. curve<-grades+3 Now we can calculate a percentage percent<-grades/25100 #same as 4 percent
We can also take the log logpercent<-log(percent) Create a matix by combining vectors mat<-cbind(ages, grades) mat #show entire matrix
Matrices can be specified by mat[row,column] mat[2,1] #Row=2, Column=1 mat[1,] #Row=1 and all columns mat[,1] #Column=1, all rows
Now let's combine data of different types
5. A matrix has to be of the same type. View the matrix below. Do you potentially see issues with forcing all data to be of type string? mat2<-cbind(names, ages, grades) mat2
A data frame is a more flexible format and one we will use for the majority of our analyses. df1<-data.frame(cbind(names,ages,grades)) df1<-data.frame(cbind(names,ages,grades))
In reality most of the time we will be working with files (but can also use file browser) getwd() setwd("/home/analytics/MGMT6963/labs/data") list.files()
(if this doesn't show "batting.csv" you set the wrong working directory)
We don't have to specify the full path here
This is the baseball batting data batting1=read.csv(file="batting.csv", header=TRUE,sep=",") teams1=read.csv(file="teams.csv", header=TRUE,sep=",") batting=read.csv(file="batting.csv", header=TRUE,sep=",", na.strings = "NULL") teams=read.csv(file="teams.csv", header=TRUE,sep=",", na.strings =
6. Review the data from the the two commands above.
See the structure of each using the following.
Describe how the data for batting and batting1 is being processed differently and why it matters.
Now let's view the data. This type of data is called a
View(teams) #show data browser names(teams) #show the names dim(teams) #show the dimensions of the data frame head(teams, 2) #show the first
2 records tail(teams, 4) #show the final 2 records teams$yearID #show the years in the data frame summary(teams) #summarize all variables str(teams) #shows the structure of an R Object
7 Use some of the commands above on the badding data. How might you understand the structure.
Provide a list of at least 5 things that you find out.
Notice the differences, factors, integers, numeric
League ID (just note that this is a factor object)
This is the type of object that incorporates different
and can be translated into "dummy variables" quite easily teams$lgID #show the variable and levels
recode it as a character as.character(teams$lgID) #translate factor to string and print