Unit 2 Data Analysis Process

Introduction

The Data Analytics Process is a structured method to explore, analyze, and interpret data to make better decisions.


1. Define the Problem / Objective

Clearly understand what question you are trying to answer.

Example: Why are sales dropping in the last 3 months?


2. Collect the Data

Gather data from various sources like databases, websites, sensors, or surveys.

Example: Collect sales reports, customer feedback, and market trends.


3. Clean and Prepare the Data

Remove duplicates, fix missing values, and organize data for analysis.

Example: Remove entries with no price or incorrect dates.


4. Analyze the Data

Use statistical tools and programming (like Python, Excel, or R) to find patterns and insights.

Example: Find which product categories have low sales and in which regions.


5. Interpret and Visualize Results

Create charts, graphs, and dashboards to explain findings in a clear way.

Example: Use a bar chart to show the drop in sales per region.


6. Make Decisions / Take Action

Use the insights to improve business strategies, operations, or performance.

Example: Increase marketing in low-performing areas or offer discounts on slow-selling items.

Notes: Data Analytics = Ask → Gather → Clean → Analyze → Visualize → Act

It’s all about turning raw data into smart decisions

 “knowledge check in data science”

To check your knowledge in data analytics, you can evaluate your understanding and skills through the following methods:


1. Concept Understanding

Test your knowledge of key topics like:

  • Types of data (structured/unstructured)
  • Data analytics process
  • Descriptive, diagnostic, predictive, and prescriptive analytics
  • Basic statistics (mean, median, standard deviation)

Example Question:
What is the difference between descriptive and predictive analytics?


2. Tools and Skills

Check your practical knowledge of tools like:

  • Excel (formulas, pivot tables)
  • SQL (queries to retrieve data)
  • Python or R (data handling with Pandas/Numpy)
  • Power BI or Tableau (creating dashboards)

Example Task:
Use Excel to create a dashboard showing sales trends by region.


3. Hands-on Projects

Practice with small datasets to solve real-world problems.

Example Activity:
Analyze a CSV file to find which product had the highest returns.

Exploratory Data Analysis (EDA) – In Brief

Exploratory Data Analysis (EDA) is the process of examining and visualizing data to understand its structure, patterns, and key features before applying any models or making decisions.


Purpose of EDA:

  • Identify patterns, relationships, and trends in the data
  • Detect missing values, outliers, or errors
  • Get a basic idea of how data is distributed
  • Choose the right analysis or model for further processing

Common EDA Techniques:

TechniquePurposeExample Tool
Summary StatisticsMean, Median, Mode, Standard DeviationPandas.describe() in Python
Data VisualizationPlot graphs for insightsMatplotlib, Seaborn
Correlation AnalysisFind relationships between variablescorr() function
Value CountsFrequency of categorical valuesvalue_counts() in Pandas

Example:

You have a dataset of student marks.

  • Use histograms to see score distribution
  • Use box plots to spot outliers
  • Use scatter plots to check relationships (e.g., hours studied vs. marks scored)

Notes: EDA helps you understand your data deeply before applying any machine learning or business decisions.

Type of Exploratory Data Analysis

A Quantitative Analysis Technique

B Graphical Analysis Technique

Quantitative Data Analysis

Quantitative Data Analysis is the process of analyzing numerical data (data that can be measured or counted) using statistical techniques to uncover patterns, relationships, and trends.


Key Features of Quantitative Data:

  • Expressed in numbers (e.g., marks, sales, age)
  • Can be analyzed using mean, median, standard deviation, correlation, etc.
  • Often displayed with charts like histograms, scatter plots, line graphs

Example Use-Case:

Suppose we have data on students’ hours studied and exam scores. We want to analyze the relationship between them.


Python Program for Quantitative Data Analysis

# Import necessary libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# Sample dataset

data = {

‘Hours_Studied’: [1, 2, 3, 4, 5, 6, 7, 8, 9],

‘Exam_Score’: [35, 40, 45, 50, 55, 65, 70, 75, 80]

}

# Create DataFrame

df = pd.DataFrame(data)

# 1. View basic statistics

print(“Summary Statistics:\n”, df.describe())

# 2. Calculate correlation

correlation = df[‘Hours_Studied’].corr(df[‘Exam_Score’])

print(“\nCorrelation between hours studied and score:”, correlation)

# 3. Plot the data

plt.figure(figsize=(8,5))

sns.scatterplot(x=’Hours_Studied’, y=’Exam_Score’, data=df)

plt.title(‘Hours Studied vs Exam Score’)

plt.xlabel(‘Hours Studied’)

plt.ylabel(‘Exam Score’)

plt.grid(True)

plt.show()

Graphical Data Analytics

Graphical Analysis is a method of visualizing data using charts and graphs to identify trends, patterns, outliers, and relationships.

Below are the most commonly used graphical techniques

Python examples:

Histogram

Shows the distribution of a single numeric variable.

import seaborn as sns

import matplotlib.pyplot as plt

data = [55, 60, 61, 62, 65, 65, 66, 68, 70, 75, 80, 85, 90, 95]

sns.histplot(data, bins=5, kde=True)

plt.title(“Histogram of Test Scores”)

plt.xlabel(“Score”)

plt.ylabel(“Frequency”)

plt.show()

Scatter Plot

Purpose:

Shows the relationship between two numeric variables

import seaborn as sns

df = sns.load_dataset(“iris”)

sns.scatterplot(x=’sepal_length’, y=’sepal_width’, hue=’species’, data=df)

plt.title(“Sepal Length vs Width”)

plt.show()

Bar Chart

Purpose:

Compares categorical variables or grouped data.

import pandas as pd

import matplotlib.pyplot as plt

df = pd.DataFrame({

    ‘Department’: [‘IT’, ‘HR’, ‘Sales’, ‘Marketing’],

    ‘Employees’: [40, 15, 25, 30]

})

df.plot(kind=’bar’, x=’Department’, y=’Employees’, legend=False)

plt.title(“Number of Employees by Department”)

plt.ylabel(“Employees”)

plt.show()

Pie Chart

Purpose:

Displays the percentage or proportion of parts to a whole.

labels = [‘Python’, ‘Java’, ‘C++’, ‘JavaScript’]

sizes = [40, 25, 20, 15]

plt.pie(sizes, labels=labels, autopct=’%1.1f%%’)

plt.title(“Programming Language Usage”)

plt.show()

Line Chart

Purpose:

Shows trends over time.

import pandas as pd

import matplotlib.pyplot as plt

df = pd.DataFrame({

    ‘Month’: [‘Jan’, ‘Feb’, ‘Mar’, ‘Apr’],

    ‘Revenue’: [1000, 1500, 1300, 1700]

})

plt.plot(df[‘Month’], df[‘Revenue’], marker=’o’)

plt.title(“Monthly Revenue”)

plt.xlabel(“Month”)

plt.ylabel(“Revenue in USD”)

plt.grid(True)

plt.show()

Summary Table:

TechniqueBest ForPython Tool
HistogramData distributionseaborn, matplotlib
Box PlotOutliers, spread, quartilesseaborn
Scatter PlotRelationship between variablesseaborn, matplotlib
Bar ChartCategorical comparisonpandas, matplotlib
Pie ChartPart-to-whole visualizationmatplotlib
Line ChartTrend over timematplotlib, pandas


Data Analytics: Conclusion and Prediction

In data analytics, the final goal is to extract meaningful insights from data that can help in making informed decisions. Two important outcomes are:


Conclusion (Descriptive Analytics)

What is it?

A conclusion summarizes what the data tells us after analysis. It answers:

“What happened?” or “What is happening?”

Purpose:

  • Identify trends and patterns
  • Discover relationships or differences
  • Highlight key findings

 

Example:

After analyzing sales data for 12 months:

“Sales increased by 20% in the second half of the year, with the highest revenue in December.”

Techniques Used:

  • Charts & visualizations
  • Descriptive statistics (mean, median)
  • Correlation and comparison

Prediction (Predictive Analytics)

What is it?

A prediction uses past data and mathematical models to forecast future outcomes. It answers:

“What is likely to happen next?”

Purpose:

  • Estimate future values (e.g., sales, stock prices, user behavior)
  • Help in planning and decision-making

Example:

Using student attendance and study hours to predict:

“This student has a 90% chance of scoring above 75% in the exam.”

Techniques Used:

  • Machine Learning models (Linear Regression, Decision Trees, etc.)
  • Time Series forecasting
  • Predictive modeling libraries like scikit-learn

Conclusion vs. Prediction – Quick Comparison

FeatureConclusionPrediction
Based onExisting dataExisting + future (inference) data
AnswersWhat happenedWhat will happen
Examples“Most sales happened in June”“Sales will rise 10% next quarter”
ToolsSummary stats, EDA, visualsRegression, ML models, forecasting

Unit 3Feature Generation and

16 / 100 SEO Score

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top