Hands-On Session: AI-Enhanced Statistical Analysis
Step-by-Step Exercise Using Python
For this exercise, we're going to analyze a dataset concerning student performance to understand how different factors might influence their grades. We'll use Python with popular libraries like Pandas for data manipulation and Matplotlib for visualization. Since I can't execute code, I'll provide the steps and expected outcomes.
Step 1: Data Loading
First, we'll load the data. Imagine we have a CSV file named student_performance.csv with columns like: StudentID, StudyTime, SleepTime, ExamScore.
python
import pandas as pd
# Load the dataset
df = pd.read_csv('student_performance.csv')
# Display the first few rows to check data
print(df.head())
Expected Output:
StudentID StudyTime SleepTime ExamScore
0 1 10 7 85
1 2 5 8 70
2 3 12 6 90
3 4 8 9 88
4 5 6 7 75
Step 2: Data Cleaning
We'll check for missing values and handle them, then look for outliers.
python
# Check for missing data
print(df.isnull().sum())
# Let's say 'SleepTime' has some missing values, we'll fill them with the median
df['SleepTime'] = df['SleepTime'].fillna(df['SleepTime'].median())
# Check for outliers in 'StudyTime' using Interquartile Range (IQR)
Q1 = df['StudyTime'].quantile(0.25)
Q3 = df['StudyTime'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers if any
df = df[(df['StudyTime'] >= lower_bound) & (df['StudyTime'] <= upper_bound)]
print(df.shape)
# Check how many rows remain after removing outliers
Expected Output:
- Missing values count for each column.
- New shape of the dataframe after handling outliers.
Step 3: Basic Statistics
We'll calculate mean, median, mode, variance, and standard deviation for ExamScore.
python
import numpy as np
# Mean
mean_score = df['ExamScore'].mean()
print(f'Mean Exam Score: {mean_score:.2f}')
# Median
median_score = df['ExamScore'].median()
print(f'Median Exam Score: {median_score:.2f}')
# Mode
mode_score = df['ExamScore'].mode()[0]
print(f'Mode of Exam Score: {mode_score:.2f}')
# Variance
variance_score = df['ExamScore'].var()
print(f'Variance of Exam Score: {variance_score:.2f}')
# Standard Deviation
std_score = df['ExamScore'].std()
print(f'Standard Deviation of Exam Score: {std_score:.2f}')
Expected Output:
- Statistical measures of ExamScore which give insights into the central tendency, dispersion, and commonality of scores.
Step 4: Regression Analysis
Let's see how StudyTime and SleepTime correlate with ExamScore using simple linear regression.
python
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Prepare data for regression
X = df[['StudyTime', 'SleepTime']]
y = df['ExamScore']
# Fit the model
model = LinearRegression().fit(X, y)
# Coefficients
print(f'Coefficient for StudyTime: {model.coef_[0]:.2f}')
print(f'Coefficient for SleepTime: {model.coef_[1]:.2f}')
# Plot the relationship for StudyTime vs. ExamScore for visualization
plt.scatter(df['StudyTime'], df['ExamScore'], color='blue')
plt.plot(df['StudyTime'], model.predict(X), color='red', linewidth=2)
plt.xlabel('Study Time (Hours)')
plt.ylabel('Exam Score')
plt.title('Study Time vs Exam Score')
plt.show()
Expected Output:
- Coefficients showing how much ExamScore changes with one unit increase in StudyTime or SleepTime.
- A scatter plot with regression line showing the relationship.
Step 5: Visualization
Let's visualize the distribution of ExamScore.
python
import seaborn as sns
# Distribution plot for ExamScore
sns.histplot(df['ExamScore'], kde=True)
plt.title('Distribution of Exam Scores')
plt.xlabel('Exam Score')
plt.ylabel('Frequency')
plt.show()
Expected Output:
- A histogram with a kernel density estimate line, showing how exam scores are distributed.
Reflection on the Exercise
This hands-on session has shown you how to load data, clean it, perform basic statistical analyses, conduct regression to understand relationships, and visualize data to communicate findings. Remember:
The mean, median, and mode give different perspectives on your data. Which one tells the most accurate story depends on the distribution of your data.- Regression analysis helps in predicting outcomes based on input variables, but remember, correlation does not imply causation.
- Always ensure your data storytelling reflects truth, not just the story you want to tell.
This exercise aligns with our Christian call to seek truth, serve with integrity, and use our skills to benefit others, reflecting the stewardship of knowledge and resources God has given us.