Big Data Analytics

Course Overview & Reflection

Overall, this is a solid subject with no major negatives. I found some parts repetitive, mainly because I had already covered similar concepts in a previous Machine Learning course.

What I Like

  • Exposure to a wider range of algorithms
  • Deeper understanding of how big data techniques actually work
  • Reinforced core data science concepts with real-world context

What I Don’t Like

  • Some lectures felt repetitive
  • I skipped a few later sessions because they overlapped heavily with prior knowledge

Introduction to Big Data

Sources of Big Data

Big Data is generated from a wide variety of sources, including:

  • Mobile sensors
  • Social media platforms
  • Smart grids
  • Video rendering systems
  • Medical imaging
  • Genetic data
  • Surveillance cameras
  • Geophysical data

Understanding Big Data: The “Vs”

Big Data isn’t just about size. It’s commonly described using several key characteristics:

  • Velocity – Speed at which data is generated and processed
  • Volume – Amount of data produced
  • Variety – Structured, semi-structured, and unstructured data
  • Value – Business value extracted from data
  • Veracity – Accuracy and reliability of data
  • Variability – Inconsistency and fluctuation in data flows

Structures of Big Data

Unstructured Data

  • No predefined schema
  • Examples: emails, videos, social media posts, satellite images

Quasi-Structured Data

  • Partial organization, often with tags or metadata
  • Examples: web server logs, network logs

Semi-Structured Data

  • Has structure but not relational
  • Examples: XML, JSON, email metadata

Structured Data

  • Fixed schema, rows and columns
  • Examples: relational databases, SQL tables, spreadsheets

Business Intelligence vs Data Science

  • Business Intelligence (BI) focuses on analyzing historical data
  • Data Science focuses on prediction and future trends

Lecture 1


Data Lifecycle in Big Data Projects

Key Stages of the Data Lifecycle

  1. Discovery
    • Domain understanding
    • Define success/failure criteria
    • Stakeholder interviews
    • Initial hypotheses
  2. Data Source Identification
    • Identify data across departments and warehouses
  3. Data Preparation
    • Sandbox setup
    • ETLT (Extract, Transform, Load, Transform)
    • Data conditioning (noise, missing values, outliers)
    • Visualization (scatter plots, histograms, heatmaps)
    • Scaling and normalization (e.g., Z-normalization)
  4. Model Planning
    • Select candidate models
    • Variable selection
    • Tool and language choice
  5. Model Building
    • Train/test split and validation
  6. Communication of Results
    • Compare against success criteria
    • Reporting insights
  7. Operationalization
    • Deployment
    • Pilot projects

Common Mistakes

  • Rushing into analysis
  • Insufficient planning

Key Roles

  • Business User
  • Project Sponsor
  • Project Manager
  • BI Analyst
  • Database Administrator
  • Data Engineer
  • Data Scientist

Lecture 2


Data Exploration & Preparation

Data Analytics Lifecycle

  1. Discovery
    • Domain learning
    • Stakeholder interviews
    • Resource and goal definition
    • Hypothesis formulation (H0, H1)
  2. Data Preparation
    • ETLT
    • Data conditioning
    • Formatting and visualization
  3. Model Planning
    • Variable selection
    • Candidate model identification
  4. Model Building
    • Training, validation, testing
  5. Communicating Results
    • Findings, limitations, recommendations
  6. Operationalization
    • Deployment
    • Monitoring
    • Training users

Objectives of Data Exploration

  • Understand data structure
  • Assess data quality
  • Identify patterns and relationships
  • Formulate hypotheses

Statistical Tools

  • Mean, median, variance, standard deviation
  • Correlation and covariance
  • Hypothesis testing

Common Tests

  • Two-sample t-test
  • Welch’s t-test
  • Wilcoxon Rank-Sum Test

Error Types

  • Type I Error (α) – False positive
  • Type II Error (β) – False negative

Lecture 3


Hypothesis Testing & Clustering

Hypothesis Testing Techniques

  • Student’s t-test
  • Welch’s t-test
  • Wilcoxon Rank-Sum Test
  • ANOVA
    • One-way ANOVA
    • Two-way ANOVA

ANOVA Limitations

  • Normality assumptions
  • Sensitivity to outliers
  • Requires post-hoc tests (e.g., HSD)

Clustering Algorithms

K-Means

  • Hard clustering
  • Sensitive to noise and initialization
  • Used in image processing and compression

DBSCAN

  • Density-based clustering
  • Handles noise well
  • Struggles with varying densities

Self-Organizing Maps (SOM)

  • Neural-network-based clustering
  • Useful for visualization
  • Not ideal for sparse or very large datasets

Lecture 4


Final Thoughts

Subject provides a strong foundation in big data analytics, especially for understanding end-to-end data workflows and core analytical techniques. While some content overlaps with machine learning subjects, it’s still valuable for reinforcing concepts and seeing them applied at scale.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Service-Oriented Software Engineering
  • Machine Learning Algorithms and Applications
  • Data Management Systems
  • Computer Vision Algorithms and Systems
  • Irrational Generative Agents – Capstone Project