Unit 2 Data Analysis Process

Introduction

The Data Analytics Process is a structured method to explore, analyze, and interpret data to make better decisions.

1. Define the Problem / Objective

Clearly understand what question you are trying to answer.

Example: Why are sales dropping in the last 3 months?

2. Collect the Data

Gather data from various sources like databases, websites, sensors, or surveys.

Example: Collect sales reports, customer feedback, and market trends.

3. Clean and Prepare the Data

Remove duplicates, fix missing values, and organize data for analysis.

Example: Remove entries with no price or incorrect dates.

4. Analyze the Data

Use statistical tools and programming (like Python, Excel, or R) to find patterns and insights.

Example: Find which product categories have low sales and in which regions.

5. Interpret and Visualize Results

Create charts, graphs, and dashboards to explain findings in a clear way.

Example: Use a bar chart to show the drop in sales per region.

6. Make Decisions / Take Action

Use the insights to improve business strategies, operations, or performance.

Example: Increase marketing in low-performing areas or offer discounts on slow-selling items.

Notes: Data Analytics = Ask → Gather → Clean → Analyze → Visualize → Act

It’s all about turning raw data into smart decisions

“knowledge check in data science”

To check your knowledge in data analytics, you can evaluate your understanding and skills through the following methods:

1. Concept Understanding

Test your knowledge of key topics like:

Types of data (structured/unstructured)
Data analytics process
Descriptive, diagnostic, predictive, and prescriptive analytics
Basic statistics (mean, median, standard deviation)

Example Question:
What is the difference between descriptive and predictive analytics?

2. Tools and Skills

Check your practical knowledge of tools like:

Excel (formulas, pivot tables)
SQL (queries to retrieve data)
Python or R (data handling with Pandas/Numpy)
Power BI or Tableau (creating dashboards)

Example Task:
Use Excel to create a dashboard showing sales trends by region.

3. Hands-on Projects

Practice with small datasets to solve real-world problems.

Example Activity:
Analyze a CSV file to find which product had the highest returns.

Exploratory Data Analysis (EDA) – In Brief

Exploratory Data Analysis (EDA) is the process of examining and visualizing data to understand its structure, patterns, and key features before applying any models or making decisions.

Purpose of EDA:

Identify patterns, relationships, and trends in the data
Detect missing values, outliers, or errors
Get a basic idea of how data is distributed
Choose the right analysis or model for further processing

Common EDA Techniques:

Technique	Purpose	Example Tool
Summary Statistics	Mean, Median, Mode, Standard Deviation	Pandas.describe() in Python
Data Visualization	Plot graphs for insights	Matplotlib, Seaborn
Correlation Analysis	Find relationships between variables	corr() function
Value Counts	Frequency of categorical values	value_counts() in Pandas

Example:

You have a dataset of student marks.

Use histograms to see score distribution
Use box plots to spot outliers
Use scatter plots to check relationships (e.g., hours studied vs. marks scored)

Notes: EDA helps you understand your data deeply before applying any machine learning or business decisions.

Type of Exploratory Data Analysis

A Quantitative Analysis Technique

B Graphical Analysis Technique

Quantitative Data Analysis

Quantitative Data Analysis is the process of analyzing numerical data (data that can be measured or counted) using statistical techniques to uncover patterns, relationships, and trends.

Key Features of Quantitative Data:

Expressed in numbers (e.g., marks, sales, age)
Can be analyzed using mean, median, standard deviation, correlation, etc.
Often displayed with charts like histograms, scatter plots, line graphs

Example Use-Case:

Suppose we have data on students’ hours studied and exam scores. We want to analyze the relationship between them.

Python Program for Quantitative Data Analysis

# Import necessary libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# Sample dataset

data = {

‘Hours_Studied’: [1, 2, 3, 4, 5, 6, 7, 8, 9],

‘Exam_Score’: [35, 40, 45, 50, 55, 65, 70, 75, 80]

}

# Create DataFrame

df = pd.DataFrame(data)

# 1. View basic statistics

print(“Summary Statistics:\n”, df.describe())

# 2. Calculate correlation

correlation = df[‘Hours_Studied’].corr(df[‘Exam_Score’])

print(“\nCorrelation between hours studied and score:”, correlation)

# 3. Plot the data

plt.figure(figsize=(8,5))

sns.scatterplot(x=’Hours_Studied’, y=’Exam_Score’, data=df)

plt.title(‘Hours Studied vs Exam Score’)

plt.xlabel(‘Hours Studied’)

plt.ylabel(‘Exam Score’)

plt.grid(True)

plt.show()

Graphical Data Analytics

Graphical Analysis is a method of visualizing data using charts and graphs to identify trends, patterns, outliers, and relationships.

Below are the most commonly used graphical techniques

Python examples:

Histogram

Shows the distribution of a single numeric variable.

import seaborn as sns

import matplotlib.pyplot as plt

data = [55, 60, 61, 62, 65, 65, 66, 68, 70, 75, 80, 85, 90, 95]

sns.histplot(data, bins=5, kde=True)

plt.title(“Histogram of Test Scores”)

plt.xlabel(“Score”)

plt.ylabel(“Frequency”)

plt.show()

Scatter Plot

Purpose:

Shows the relationship between two numeric variables

import seaborn as sns

df = sns.load_dataset(“iris”)

sns.scatterplot(x=’sepal_length’, y=’sepal_width’, hue=’species’, data=df)

plt.title(“Sepal Length vs Width”)

plt.show()

Bar Chart

Purpose:

Compares categorical variables or grouped data.

import pandas as pd

import matplotlib.pyplot as plt

df = pd.DataFrame({

‘Department’: [‘IT’, ‘HR’, ‘Sales’, ‘Marketing’],

‘Employees’: [40, 15, 25, 30]

})

df.plot(kind=’bar’, x=’Department’, y=’Employees’, legend=False)

plt.title(“Number of Employees by Department”)

plt.ylabel(“Employees”)

plt.show()

Pie Chart

Purpose:

Displays the percentage or proportion of parts to a whole.

labels = [‘Python’, ‘Java’, ‘C++’, ‘JavaScript’]

sizes = [40, 25, 20, 15]

plt.pie(sizes, labels=labels, autopct=’%1.1f%%’)

plt.title(“Programming Language Usage”)

plt.show()

Line Chart

Purpose:

Shows trends over time.

import pandas as pd

import matplotlib.pyplot as plt

df = pd.DataFrame({

‘Month’: [‘Jan’, ‘Feb’, ‘Mar’, ‘Apr’],

‘Revenue’: [1000, 1500, 1300, 1700]

})

plt.plot(df[‘Month’], df[‘Revenue’], marker=’o’)

plt.title(“Monthly Revenue”)

plt.xlabel(“Month”)

plt.ylabel(“Revenue in USD”)

plt.grid(True)

plt.show()

Summary Table:

Technique	Best For	Python Tool
Histogram	Data distribution	seaborn, matplotlib
Box Plot	Outliers, spread, quartiles	seaborn
Scatter Plot	Relationship between variables	seaborn, matplotlib
Bar Chart	Categorical comparison	pandas, matplotlib
Pie Chart	Part-to-whole visualization	matplotlib
Line Chart	Trend over time	matplotlib, pandas

Data Analytics: Conclusion and Prediction

In data analytics, the final goal is to extract meaningful insights from data that can help in making informed decisions. Two important outcomes are:

Conclusion (Descriptive Analytics)

What is it?

A conclusion summarizes what the data tells us after analysis. It answers:

“What happened?” or “What is happening?”

Purpose:

Identify trends and patterns
Discover relationships or differences
Highlight key findings

Example:

After analyzing sales data for 12 months:

“Sales increased by 20% in the second half of the year, with the highest revenue in December.”

Techniques Used:

Charts & visualizations
Descriptive statistics (mean, median)
Correlation and comparison

Prediction (Predictive Analytics)

What is it?

A prediction uses past data and mathematical models to forecast future outcomes. It answers:

“What is likely to happen next?”

Purpose:

Estimate future values (e.g., sales, stock prices, user behavior)
Help in planning and decision-making

Example:

Using student attendance and study hours to predict:

“This student has a 90% chance of scoring above 75% in the exam.”

Techniques Used:

Machine Learning models (Linear Regression, Decision Trees, etc.)
Time Series forecasting
Predictive modeling libraries like scikit-learn

Conclusion vs. Prediction – Quick Comparison

Feature	Conclusion	Prediction
Based on	Existing data	Existing + future (inference) data
Answers	What happened	What will happen
Examples	“Most sales happened in June”	“Sales will rise 10% next quarter”
Tools	Summary stats, EDA, visuals	Regression, ML models, forecasting

Unit 3Feature Generation and

16 / 100

SEO Score

Example:

Graphical Data Analytics

Scatter Plot

Purpose:

Bar Chart

Purpose:

Pie Chart

Purpose:

Line Chart

Purpose:

Data Analytics: Conclusion and Prediction

Conclusion (Descriptive Analytics)

What is it?

Purpose:

Example:

Techniques Used:

Prediction (Predictive Analytics)

What is it?

Purpose:

Example:

Techniques Used:

Conclusion vs. Prediction – Quick Comparison

By Sandip Kumar Singh