Performing Data Analysis Using R
R is one of the most powerful languages for data analysis, statistical computing, and data visualization. Here’s a structured approach to performing data analysis using R.
1️⃣Importing Data
The first step is loading the dataset. R supports multiple file formats.
(a) Read CSV File
data <- read.csv("data.csv", header = TRUE, stringsAsFactors = FALSE)
🔹 header = TRUE
→ Uses the first row as column names
🔹 stringsAsFactors = FALSE
→ Prevents automatic conversion of strings to factors
(b) Read Excel File (readxl
package)
install.packages("readxl")
library(readxl)
data <- read_excel("data.xlsx", sheet = 1)
(c) Read Data from a Database (DBI
package)
install.packages("DBI")
library(DBI)
conn <- dbConnect(RSQLite::SQLite(), "database.db")
data <- dbGetQuery(conn, "SELECT * FROM table_name")
dbDisconnect(conn)
2️⃣ Exploring the Data
Before analysis, check the structure and contents of the dataset.
(a) View the First & Last Few Rows
head(data) # First 6 rows
tail(data) # Last 6 rows
(b) Check Structure and Summary
str(data) # Structure of dataset
summary(data) # Summary statistics
(c) Get Column Names
colnames(data)
3️⃣ Data Cleaning & Preprocessing
Data often requires cleaning before analysis.
(a) Handling Missing Values
sum(is.na(data)) # Count missing values
data <- na.omit(data) # Remove rows with missing values
data[is.na(data)] <- 0 # Replace NA with 0
(b) Convert Data Types
data$Age <- as.numeric(data$Age) # Convert column to numeric
data$Category <- as.factor(data$Category) # Convert to categorical
(c) Remove Duplicates
data <- unique(data) # Remove duplicate rows
4️⃣Data Visualization
R provides powerful visualization libraries like ggplot2 and base R.
(a) Histogram (Distribution)
hist(data$Age, col = "blue", main = "Age Distribution", xlab = "Age")
(b) Scatter Plot
plot(data$Height, data$Weight, col = "red", pch = 19, main = "Height vs Weight")
(c) Boxplot (Outlier Detection)
boxplot(data$Income, main = "Income Distribution", col = "green")
(d) ggplot2 for Advanced Visualization
install.packages("ggplot2")
library(ggplot2)
ggplot(data, aes(x = Height, y = Weight)) +geom_point(color = “blue”) +
labs(title = “Height vs Weight Scatter Plot”)
5️⃣Statistical Analysis
R is widely used for statistical computations.
(a) Mean, Median, and Standard Deviation
mean(data$Age, na.rm = TRUE) # Average
median(data$Age, na.rm = TRUE) # Middle value
sd(data$Age, na.rm = TRUE) # Standard deviation
(b) Correlation Analysis
cor(data$Height, data$Weight, use = "complete.obs") # Pearson correlation
(c) T-Test (Compare Two Groups
t.test(data$Income ~ data$Gender)
(d) Linear Regression
model <- lm(Weight ~ Height, data = data)
summary(model)
6️⃣Machine Learning in R
R supports machine learning using packages like caret
, randomForest
, and e1071
.
(a) Train a Simple Linear Model
r
install.packages("caret")
library(caret)
model <- train(Weight ~ Height, data = data, method = “lm”)summary(model)
(b) Decision Tree
install.packages("rpart")
library(rpart)
tree_model <- rpart(Species ~ ., data = iris, method = “class”)plot(tree_model)
text(tree_model)
7️⃣Exporting Data
After analysis, you may need to save the results.
(a) Save Processed Data to CSV
write.csv(data, "cleaned_data.csv", row.names = FALSE)
(b) Save Model for Future Use
saveRDS(model, "model.rds")
Load it later using:
loaded_model <- readRDS("model.rds")
📌Summary
Step | Function/Package | Purpose |
---|---|---|
Import Data | read.csv() , read_excel() , DBI |
Load data from files/databases |
Explore Data | head() , summary() , str() |
Check dataset structure |
Clean Data | na.omit() , as.numeric() , unique() |
Handle missing values, duplicates |
Visualization | plot() , hist() , ggplot2 |
Graphical analysis |
Statistical Analysis | mean() , cor() , t.test() , lm() |
Basic statistics & regression |
Machine Learning | caret , rpart |
Predictive modeling |
Export Data | write.csv() , saveRDS() |
Save results |