Introduction
The Data Analytics Process is a structured method to explore, analyze, and interpret data to make better decisions.
1. Define the Problem / Objective
Clearly understand what question you are trying to answer.
Example: Why are sales dropping in the last 3 months?
2. Collect the Data
Gather data from various sources like databases, websites, sensors, or surveys.
Example: Collect sales reports, customer feedback, and market trends.
3. Clean and Prepare the Data
Remove duplicates, fix missing values, and organize data for analysis.
Example: Remove entries with no price or incorrect dates.
4. Analyze the Data
Use statistical tools and programming (like Python, Excel, or R) to find patterns and insights.
Example: Find which product categories have low sales and in which regions.
5. Interpret and Visualize Results
Create charts, graphs, and dashboards to explain findings in a clear way.
Example: Use a bar chart to show the drop in sales per region.
6. Make Decisions / Take Action
Use the insights to improve business strategies, operations, or performance.
Example: Increase marketing in low-performing areas or offer discounts on slow-selling items.
Notes: Data Analytics = Ask → Gather → Clean → Analyze → Visualize → Act
It’s all about turning raw data into smart decisions
“knowledge check in data science”
To check your knowledge in data analytics, you can evaluate your understanding and skills through the following methods:
1. Concept Understanding
Test your knowledge of key topics like:
- Types of data (structured/unstructured)
- Data analytics process
- Descriptive, diagnostic, predictive, and prescriptive analytics
- Basic statistics (mean, median, standard deviation)
Example Question:
What is the difference between descriptive and predictive analytics?
2. Tools and Skills
Check your practical knowledge of tools like:
- Excel (formulas, pivot tables)
- SQL (queries to retrieve data)
- Python or R (data handling with Pandas/Numpy)
- Power BI or Tableau (creating dashboards)
Example Task:
Use Excel to create a dashboard showing sales trends by region.
3. Hands-on Projects
Practice with small datasets to solve real-world problems.
Example Activity:
Analyze a CSV file to find which product had the highest returns.
Exploratory Data Analysis (EDA) – In Brief
Exploratory Data Analysis (EDA) is the process of examining and visualizing data to understand its structure, patterns, and key features before applying any models or making decisions.
Purpose of EDA:
- Identify patterns, relationships, and trends in the data
- Detect missing values, outliers, or errors
- Get a basic idea of how data is distributed
- Choose the right analysis or model for further processing
Common EDA Techniques:
Technique | Purpose | Example Tool |
Summary Statistics | Mean, Median, Mode, Standard Deviation | Pandas.describe() in Python |
Data Visualization | Plot graphs for insights | Matplotlib, Seaborn |
Correlation Analysis | Find relationships between variables | corr() function |
Value Counts | Frequency of categorical values | value_counts() in Pandas |
Example:
You have a dataset of student marks.
- Use histograms to see score distribution
- Use box plots to spot outliers
- Use scatter plots to check relationships (e.g., hours studied vs. marks scored)
Notes: EDA helps you understand your data deeply before applying any machine learning or business decisions.
Type of Exploratory Data Analysis
A Quantitative Analysis Technique
B Graphical Analysis Technique
Quantitative Data Analysis
Quantitative Data Analysis is the process of analyzing numerical data (data that can be measured or counted) using statistical techniques to uncover patterns, relationships, and trends.
Key Features of Quantitative Data:
- Expressed in numbers (e.g., marks, sales, age)
- Can be analyzed using mean, median, standard deviation, correlation, etc.
- Often displayed with charts like histograms, scatter plots, line graphs
Example Use-Case:
Suppose we have data on students’ hours studied and exam scores. We want to analyze the relationship between them.
Python Program for Quantitative Data Analysis
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Sample dataset
data = {
‘Hours_Studied’: [1, 2, 3, 4, 5, 6, 7, 8, 9],
‘Exam_Score’: [35, 40, 45, 50, 55, 65, 70, 75, 80]
}
# Create DataFrame
df = pd.DataFrame(data)
# 1. View basic statistics
print(“Summary Statistics:\n”, df.describe())
# 2. Calculate correlation
correlation = df[‘Hours_Studied’].corr(df[‘Exam_Score’])
print(“\nCorrelation between hours studied and score:”, correlation)
# 3. Plot the data
plt.figure(figsize=(8,5))
sns.scatterplot(x=’Hours_Studied’, y=’Exam_Score’, data=df)
plt.title(‘Hours Studied vs Exam Score’)
plt.xlabel(‘Hours Studied’)
plt.ylabel(‘Exam Score’)
plt.grid(True)
plt.show()
Graphical Data Analytics
Graphical Analysis is a method of visualizing data using charts and graphs to identify trends, patterns, outliers, and relationships.
Below are the most commonly used graphical techniques
Python examples:
Histogram
Shows the distribution of a single numeric variable.
import seaborn as sns
import matplotlib.pyplot as plt
data = [55, 60, 61, 62, 65, 65, 66, 68, 70, 75, 80, 85, 90, 95]
sns.histplot(data, bins=5, kde=True)
plt.title(“Histogram of Test Scores”)
plt.xlabel(“Score”)
plt.ylabel(“Frequency”)
plt.show()
Scatter Plot
Purpose:
Shows the relationship between two numeric variables
import seaborn as sns
df = sns.load_dataset(“iris”)
sns.scatterplot(x=’sepal_length’, y=’sepal_width’, hue=’species’, data=df)
plt.title(“Sepal Length vs Width”)
plt.show()
Bar Chart
Purpose:
Compares categorical variables or grouped data.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
‘Department’: [‘IT’, ‘HR’, ‘Sales’, ‘Marketing’],
‘Employees’: [40, 15, 25, 30]
})
df.plot(kind=’bar’, x=’Department’, y=’Employees’, legend=False)
plt.title(“Number of Employees by Department”)
plt.ylabel(“Employees”)
plt.show()
Pie Chart
Purpose:
Displays the percentage or proportion of parts to a whole.
labels = [‘Python’, ‘Java’, ‘C++’, ‘JavaScript’]
sizes = [40, 25, 20, 15]
plt.pie(sizes, labels=labels, autopct=’%1.1f%%’)
plt.title(“Programming Language Usage”)
plt.show()
Line Chart
Purpose:
Shows trends over time.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
‘Month’: [‘Jan’, ‘Feb’, ‘Mar’, ‘Apr’],
‘Revenue’: [1000, 1500, 1300, 1700]
})
plt.plot(df[‘Month’], df[‘Revenue’], marker=’o’)
plt.title(“Monthly Revenue”)
plt.xlabel(“Month”)
plt.ylabel(“Revenue in USD”)
plt.grid(True)
plt.show()
Summary Table:
Technique | Best For | Python Tool |
Histogram | Data distribution | seaborn, matplotlib |
Box Plot | Outliers, spread, quartiles | seaborn |
Scatter Plot | Relationship between variables | seaborn, matplotlib |
Bar Chart | Categorical comparison | pandas, matplotlib |
Pie Chart | Part-to-whole visualization | matplotlib |
Line Chart | Trend over time | matplotlib, pandas |
Data Analytics: Conclusion and Prediction
In data analytics, the final goal is to extract meaningful insights from data that can help in making informed decisions. Two important outcomes are:
Conclusion (Descriptive Analytics)
What is it?
A conclusion summarizes what the data tells us after analysis. It answers:
“What happened?” or “What is happening?”
Purpose:
- Identify trends and patterns
- Discover relationships or differences
- Highlight key findings
Example:
After analyzing sales data for 12 months:
“Sales increased by 20% in the second half of the year, with the highest revenue in December.”
Techniques Used:
- Charts & visualizations
- Descriptive statistics (mean, median)
- Correlation and comparison
Prediction (Predictive Analytics)
What is it?
A prediction uses past data and mathematical models to forecast future outcomes. It answers:
“What is likely to happen next?”
Purpose:
- Estimate future values (e.g., sales, stock prices, user behavior)
- Help in planning and decision-making
Example:
Using student attendance and study hours to predict:
“This student has a 90% chance of scoring above 75% in the exam.”
Techniques Used:
- Machine Learning models (Linear Regression, Decision Trees, etc.)
- Time Series forecasting
- Predictive modeling libraries like
scikit-learn
Conclusion vs. Prediction – Quick Comparison
Feature | Conclusion | Prediction |
Based on | Existing data | Existing + future (inference) data |
Answers | What happened | What will happen |
Examples | “Most sales happened in June” | “Sales will rise 10% next quarter” |
Tools | Summary stats, EDA, visuals | Regression, ML models, forecasting |
Unit 3Feature Generation and