Hands-On Session: AI-Enhanced Statistical Analysis

Step-by-Step Exercise Using Python

For this exercise, we're going to analyze a dataset concerning student performance to understand how different factors might influence their grades. We'll use Python with popular libraries like Pandas for data manipulation and Matplotlib for visualization. Since I can't execute code, I'll provide the steps and expected outcomes.


Step 1: Data Loading

First, we'll load the data. Imagine we have a CSV file named student_performance.csv with columns like: StudentID, StudyTime, SleepTime, ExamScore.


python

import pandas as pd


# Load the dataset

df = pd.read_csv('student_performance.csv')


# Display the first few rows to check data

print(df.head())


Expected Output
:

  StudentID StudyTime SleepTime ExamScore

0 1 10 7 85

1 2 5 8 70

2 3 12 6 90

3 4 8 9 88

4 5 6 7 75


Step 2: Data Cleaning

We'll check for missing values and handle them, then look for outliers.


python

# Check for missing data

print(df.isnull().sum())


# Let's say 'SleepTime' has some missing values, we'll fill them with the median

df['SleepTime'] = df['SleepTime'].fillna(df['SleepTime'].median())


# Check for outliers in 'StudyTime' using Interquartile Range (IQR)

Q1 = df['StudyTime'].quantile(0.25)

Q3 = df['StudyTime'].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR


# Remove outliers if any

df = df[(df['StudyTime'] >= lower_bound) & (df['StudyTime'] <= upper_bound)]

print(df.shape) # Check how many rows remain after removing outliers


Expected Output
:

  • Missing values count for each column.
  • New shape of the dataframe after handling outliers.


Step 3: Basic Statistics

We'll calculate mean, median, mode, variance, and standard deviation for ExamScore.


python

import numpy as np


# Mean

mean_score = df['ExamScore'].mean()

print(f'Mean Exam Score: {mean_score:.2f}')


# Median

median_score = df['ExamScore'].median()

print(f'Median Exam Score: {median_score:.2f}')


# Mode

mode_score = df['ExamScore'].mode()[0]

print(f'Mode of Exam Score: {mode_score:.2f}')


# Variance

variance_score = df['ExamScore'].var()

print(f'Variance of Exam Score: {variance_score:.2f}')


# Standard Deviation

std_score = df['ExamScore'].std()

print(f'Standard Deviation of Exam Score: {std_score:.2f}')


Expected Output
:

  • Statistical measures of ExamScore which give insights into the central tendency, dispersion, and commonality of scores.


Step 4: Regression Analysis

Let's see how StudyTime and SleepTime correlate with ExamScore using simple linear regression.


python

from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt


# Prepare data for regression

X = df[['StudyTime', 'SleepTime']]

y = df['ExamScore']


# Fit the model

model = LinearRegression().fit(X, y)


# Coefficients

print(f'Coefficient for StudyTime: {model.coef_[0]:.2f}')

print(f'Coefficient for SleepTime: {model.coef_[1]:.2f}')


# Plot the relationship for StudyTime vs. ExamScore for visualization

plt.scatter(df['StudyTime'], df['ExamScore'], color='blue')

plt.plot(df['StudyTime'], model.predict(X), color='red', linewidth=2)

plt.xlabel('Study Time (Hours)')

plt.ylabel('Exam Score')

plt.title('Study Time vs Exam Score')

plt.show()


Expected Output
:

  • Coefficients showing how much ExamScore changes with one unit increase in StudyTime or SleepTime.
  • A scatter plot with regression line showing the relationship.


Step 5: Visualization

Let's visualize the distribution of ExamScore.


python

import seaborn as sns


# Distribution plot for ExamScore

sns.histplot(df['ExamScore'], kde=True)

plt.title('Distribution of Exam Scores')

plt.xlabel('Exam Score')

plt.ylabel('Frequency')

plt.show()


Expected Output
:

  • A histogram with a kernel density estimate line, showing how exam scores are distributed.


Reflection on the Exercise

This hands-on session has shown you how to load data, clean it, perform basic statistical analyses, conduct regression to understand relationships, and visualize data to communicate findings. Remember:


  • The mean, median, and mode give different perspectives on your data. Which one tells the most accurate story depends on the distribution of your data.
  • Regression analysis helps in predicting outcomes based on input variables, but remember, correlation does not imply causation.
  • Always ensure your data storytelling reflects truth, not just the story you want to tell.


This exercise aligns with our Christian call to seek truth, serve with integrity, and use our skills to benefit others, reflecting the stewardship of knowledge and resources God has given us.