Big Data Analytics
Course Overview & Reflection
Overall, this is a solid subject with no major negatives. I found some parts repetitive, mainly because I had already covered similar concepts in a previous Machine Learning course.
What I Like
- Exposure to a wider range of algorithms
- Deeper understanding of how big data techniques actually work
- Reinforced core data science concepts with real-world context
What I Don’t Like
- Some lectures felt repetitive
- I skipped a few later sessions because they overlapped heavily with prior knowledge
Introduction to Big Data
Sources of Big Data
Big Data is generated from a wide variety of sources, including:
- Mobile sensors
- Social media platforms
- Smart grids
- Video rendering systems
- Medical imaging
- Genetic data
- Surveillance cameras
- Geophysical data
Understanding Big Data: The “Vs”
Big Data isn’t just about size. It’s commonly described using several key characteristics:
- Velocity – Speed at which data is generated and processed
- Volume – Amount of data produced
- Variety – Structured, semi-structured, and unstructured data
- Value – Business value extracted from data
- Veracity – Accuracy and reliability of data
- Variability – Inconsistency and fluctuation in data flows
Structures of Big Data
Unstructured Data
- No predefined schema
- Examples: emails, videos, social media posts, satellite images
Quasi-Structured Data
- Partial organization, often with tags or metadata
- Examples: web server logs, network logs
Semi-Structured Data
- Has structure but not relational
- Examples: XML, JSON, email metadata
Structured Data
- Fixed schema, rows and columns
- Examples: relational databases, SQL tables, spreadsheets
Business Intelligence vs Data Science
- Business Intelligence (BI) focuses on analyzing historical data
- Data Science focuses on prediction and future trends
Data Lifecycle in Big Data Projects
Key Stages of the Data Lifecycle
- Discovery
- Domain understanding
- Define success/failure criteria
- Stakeholder interviews
- Initial hypotheses
- Data Source Identification
- Identify data across departments and warehouses
- Data Preparation
- Sandbox setup
- ETLT (Extract, Transform, Load, Transform)
- Data conditioning (noise, missing values, outliers)
- Visualization (scatter plots, histograms, heatmaps)
- Scaling and normalization (e.g., Z-normalization)
- Model Planning
- Select candidate models
- Variable selection
- Tool and language choice
- Model Building
- Train/test split and validation
- Communication of Results
- Compare against success criteria
- Reporting insights
- Operationalization
- Deployment
- Pilot projects
Common Mistakes
- Rushing into analysis
- Insufficient planning
Key Roles
- Business User
- Project Sponsor
- Project Manager
- BI Analyst
- Database Administrator
- Data Engineer
- Data Scientist
Data Exploration & Preparation
Data Analytics Lifecycle
- Discovery
- Domain learning
- Stakeholder interviews
- Resource and goal definition
- Hypothesis formulation (H0, H1)
- Data Preparation
- ETLT
- Data conditioning
- Formatting and visualization
- Model Planning
- Variable selection
- Candidate model identification
- Model Building
- Training, validation, testing
- Communicating Results
- Findings, limitations, recommendations
- Operationalization
- Deployment
- Monitoring
- Training users
Objectives of Data Exploration
- Understand data structure
- Assess data quality
- Identify patterns and relationships
- Formulate hypotheses
Statistical Tools
- Mean, median, variance, standard deviation
- Correlation and covariance
- Hypothesis testing
Common Tests
- Two-sample t-test
- Welch’s t-test
- Wilcoxon Rank-Sum Test
Error Types
- Type I Error (α) – False positive
- Type II Error (β) – False negative
Hypothesis Testing & Clustering
Hypothesis Testing Techniques
- Student’s t-test
- Welch’s t-test
- Wilcoxon Rank-Sum Test
- ANOVA
- One-way ANOVA
- Two-way ANOVA
ANOVA Limitations
- Normality assumptions
- Sensitivity to outliers
- Requires post-hoc tests (e.g., HSD)
Clustering Algorithms
K-Means
- Hard clustering
- Sensitive to noise and initialization
- Used in image processing and compression
DBSCAN
- Density-based clustering
- Handles noise well
- Struggles with varying densities
Self-Organizing Maps (SOM)
- Neural-network-based clustering
- Useful for visualization
- Not ideal for sparse or very large datasets
Final Thoughts
Subject provides a strong foundation in big data analytics, especially for understanding end-to-end data workflows and core analytical techniques. While some content overlaps with machine learning subjects, it’s still valuable for reinforcing concepts and seeing them applied at scale.
Enjoy Reading This Article?
Here are some more articles you might like to read next: