Big Data Analytics | Karan Goel

Course Overview & Reflection

Overall, this is a solid subject with no major negatives. I found some parts repetitive, mainly because I had already covered similar concepts in a previous Machine Learning course.

What I Like

Exposure to a wider range of algorithms
Deeper understanding of how big data techniques actually work
Reinforced core data science concepts with real-world context

What I Don’t Like

Some lectures felt repetitive
I skipped a few later sessions because they overlapped heavily with prior knowledge

Introduction to Big Data

Sources of Big Data

Big Data is generated from a wide variety of sources, including:

Mobile sensors
Social media platforms
Smart grids
Video rendering systems
Medical imaging
Genetic data
Surveillance cameras
Geophysical data

Understanding Big Data: The “Vs”

Big Data isn’t just about size. It’s commonly described using several key characteristics:

Velocity – Speed at which data is generated and processed
Volume – Amount of data produced
Variety – Structured, semi-structured, and unstructured data
Value – Business value extracted from data
Veracity – Accuracy and reliability of data
Variability – Inconsistency and fluctuation in data flows

Structures of Big Data

Unstructured Data

No predefined schema
Examples: emails, videos, social media posts, satellite images

Quasi-Structured Data

Partial organization, often with tags or metadata
Examples: web server logs, network logs

Semi-Structured Data

Has structure but not relational
Examples: XML, JSON, email metadata

Structured Data

Fixed schema, rows and columns
Examples: relational databases, SQL tables, spreadsheets

Business Intelligence vs Data Science

Business Intelligence (BI) focuses on analyzing historical data
Data Science focuses on prediction and future trends

Lecture 1

Data Lifecycle in Big Data Projects

Key Stages of the Data Lifecycle

Discovery
- Domain understanding
- Define success/failure criteria
- Stakeholder interviews
- Initial hypotheses
Data Source Identification
- Identify data across departments and warehouses
Data Preparation
- Sandbox setup
- ETLT (Extract, Transform, Load, Transform)
- Data conditioning (noise, missing values, outliers)
- Visualization (scatter plots, histograms, heatmaps)
- Scaling and normalization (e.g., Z-normalization)
Model Planning
- Select candidate models
- Variable selection
- Tool and language choice
Model Building
- Train/test split and validation
Communication of Results
- Compare against success criteria
- Reporting insights
Operationalization
- Deployment
- Pilot projects

Common Mistakes

Rushing into analysis
Insufficient planning

Key Roles

Business User
Project Sponsor
Project Manager
BI Analyst
Database Administrator
Data Engineer
Data Scientist

Lecture 2

Data Exploration & Preparation

Data Analytics Lifecycle

Discovery
- Domain learning
- Stakeholder interviews
- Resource and goal definition
- Hypothesis formulation (H0, H1)
Data Preparation
- ETLT
- Data conditioning
- Formatting and visualization
Model Planning
- Variable selection
- Candidate model identification
Model Building
- Training, validation, testing
Communicating Results
- Findings, limitations, recommendations
Operationalization
- Deployment
- Monitoring
- Training users

Objectives of Data Exploration

Understand data structure
Assess data quality
Identify patterns and relationships
Formulate hypotheses

Statistical Tools

Mean, median, variance, standard deviation
Correlation and covariance
Hypothesis testing

Common Tests

Two-sample t-test
Welch’s t-test
Wilcoxon Rank-Sum Test

Error Types

Type I Error (α) – False positive
Type II Error (β) – False negative

Lecture 3

Hypothesis Testing & Clustering

Hypothesis Testing Techniques

Student’s t-test
Welch’s t-test
Wilcoxon Rank-Sum Test
ANOVA
- One-way ANOVA
- Two-way ANOVA

ANOVA Limitations

Normality assumptions
Sensitivity to outliers
Requires post-hoc tests (e.g., HSD)

Clustering Algorithms

K-Means

Hard clustering
Sensitive to noise and initialization
Used in image processing and compression

DBSCAN

Density-based clustering
Handles noise well
Struggles with varying densities

Self-Organizing Maps (SOM)

Neural-network-based clustering
Useful for visualization
Not ideal for sparse or very large datasets

Lecture 4

Final Thoughts

Subject provides a strong foundation in big data analytics, especially for understanding end-to-end data workflows and core analytical techniques. While some content overlaps with machine learning subjects, it’s still valuable for reinforcing concepts and seeing them applied at scale.