R for Data Management and Cleaning
(Week-2 and 3)
You are welcome to Module-2 of this R-program course. If you have not completed Module -1. Please compete Module 1 first . And before heading into module-2, lets have a short quiz on sessions of previous week.
Module-1 Revise Quiz
Please fill the above data!
coin : 0
Name : Apu
Surname : 9
Total Questions:
Correct: | Wrong:
Attempt: | Percentage:
Data Object Types
Integer: Real numbers without decimals. For examples 1,7,9,3000. Use suffix "L" to specify integer.
Logical: Also known as boolean data type.It can only have two values either TRUE or FALSE.
Numeric: It includes all real numbers with or without decimals: For example: 1, 4.5556, 5.99, 1.9
Integer: Real numbers without decimals. For examples 1,7,9,3000. Use suffix "L" to specify integer.
Character : This data type is used to specify character or string or word values in a variable. Use Single quote or double quote to represent string. For example: 'Hari', “shyam”, 'red', “a”
Complex: It is used to specify purely imaginary values in R. We use the suffix i to specify the imaginary part. For example 34+2i.
Factor: Categorical variable (1: male ; 2: Female; 3: Other).
Note: table(obj) gives frequency table of the obj
contrasts(obj) is applicable for factor variable and help us to identify the reference variable
Data structure
String: A string is a sequence of characters. For example: "Hello world" is a string that includes characters: H,e,l,l,o, ,w,o,r,l,dVector:It is basic data structure that contains list of identical items with single dimension. (contains elements of same type: integer, numeric, character, logical, factor) : {1,3,5,8,9}
Matrix: It is two-dimensional data structure where data are arranged into rows and columns.two dimensional vector
List: A List is a collection of similar or different types of data.We can use list() function to create list.
Array: Arrays are data structure that can store data of same type in multiple dimensions. The only difference between vectors, matrices, and arrays are
Data frame: A data frame is a two-dimensional data structure which can store data in tabular format. It has rows and columns. Rows indicates observation and Column indicates variables
- Vectors are uni-dimensional arrays
- Matrices are two-dimensional arrays
- Arrays can have more than two dimensions
Note: To identify data type or data structure use two functions:
class(object_name)
mode(object_name)
Learn to create Dataframe/ dataset
Lets create one dataframe. We can create dataframe using function data.frame().
dataframe1 <- data.frame(
first_col = c(value11, value12, ...),
second_col = c(val21, val22, ...),
...
)
Here, first_col and second_col are vectors each of similar data type.
In order to create dataframe, we need to create vectors of each variables. Here, I'm generating vectors of participant ID, gender, Age, weight and height.
PID<-c(1,2,3,4,5,6,7)
gender<-c("F", "M","F","F","M","M","M")
Age<-c(23,32,21,22,24,32,29)
weight<-c(67,89,45,65,59,90,56)
height<-c(1.5,1.7,1.8,1.3,1.65,1.9,1.4)
dataset<-data.frame(PID,gender,Age,weight,height)
Dataset or data-frame exist in dataset[row,column] format.
Select specific cell:
dataset[3,5] # This will select 3rd row 5th column value
Select specific column:
dataset[, 4] # Here row is empty indicates keep all row values of 4th column
dataset$var1 # select var1 column of dataset
dataset[[var_name]] #selects all values of column named var_name
Select specific row:
dataset[3, ] # Here column is empty indicates keep all column values of 3rd row
Import dataset in R
You can import dataset in two ways a) through code and b) visually.
Download datasets and code books here:
Low_back pain dataset (in csv)
Low_back pain dataset (in stata)
Low_back pain dataset (in spss)
Visual method:
- In order to import dataset, follow the steps below:
- First of all go to top right pannel where you can see import dataset. Click the button. and select format of dataset (SPSS, STATA, Excel ,CSV, Text, SAS)
- Add path to dataset, rename dataset
- Import dataset
Import CSV
dataset <- read.csv( file = “locationof the file/filename.csv”, header = TRUE, sep = “,”)
dataset <- read.csv( file = file.choose(), header = TRUE, sep = “,”)
Import excel datasets
library(readxl)
dataset<-read_excel(file=“location_of_file/filename.xlsx”, sheet = “sheet1”)
Import *.sav and *.dta files
library(haven)
dataset<-read_dta(file=“location_of_file/filename.dta”)
dataset<-read_sav(file=“location_of_file/filename.sav”)
Import from google sheet
library(gsheet)
dataset<-gsheet2tbl(url="https://drive.google.com/file/d/1yhFs7ju5qPWwE-8qRMxr6O_BOt521vwv/view?usp=sharing", sheetid = "Sheet1")
Import from kobotoolbox
install_github("mrdwab/koboloadeR")
library(koboloadeR)
#download specific dataset directly form Kobo-toolbox
dataset<-kobo_data_downloader(formid ="######" ,
user = "Username:*******",
api = "https://kc.kobotoolbox.org/api/v1/",
check=T
)
Note:
- You can just specify the name with extension inside double quote instead of full path if your dataset is in your own directory( use getwd() and setwd() functions for setting up directory.)
- While adding path to file. Add the drive name followed by :/ and enter tab key and select right folder and file thereafter.
Export dataset in R
In order to export dataset into different format, you can use different functions. Lets see some examples:
Export to CSV format
write.csv(dataset, file="location_to_export/filename.csv")
#Export to dta and sav formats for SPSS and STATA
write_dta(data, path, version = 14, label = attr(data, "label"))
write_sav(data, path, compress = FALSE)
#Export into R object (.RData and .RDS
save.image(file = ".RData", version=4.1.0)
saveRDS(dataset_name, "path to save file\file.rds")
Save R-scripts
You can save R-scripts by just clicking the Save button just above the top left panel
Save R-History
In order to save history of codes that are already ran in console can be save with the help of function savehistory()
savehistory(file = "file_history.Rhistory")
Working with Dataframe
Step-1: Find the dimension of the dataframe. (row count and column count=??)
dim(dataset) # display no. of rows and column
Step-2: Visualize dataset
View(dataframe_name) #Display dataframe
str(dataframe_name) #display variables with their type
glimpse(dataframe_name) #display variables with their type from “dplyr” package
Step-3: Find out the data types of specific variable
class(dataset$var_name)
mode(dataset$var_name)
str(dataset$var_name)
glimpse(dataset$var_name)