Home
Lab 02

Big Data & Hadoop Processing

MapReduce & Analytics

2026
Hadoop · MapReduce · HDFS

📋 Lab Overview

This lab demonstrates a comprehensive Big Data processing pipeline using the Apache Hadoop ecosystem. It covers the setup of HDFS DataNodes, executing distributed MapReduce jobs on large-scale datasets, and visualizing the output data representing water quality violations.

01

Hadoop Cluster Startup & DataNodes

The first step in our distributed architecture was initializing the Hadoop cluster daemons seamlessly.

Terminal Console — Start-All

start-all.cmd
Hadoop terminal output showing startup of cluster

Once started, verifying the active DataNodes and HDFS capacity is crucial for health monitoring.

The first step in our distributed architecture was setting up the Hadoop cluster, configuring the NameNode, and ensuring active DataNodes for HDFS storage layer reliability.

Terminal Console — DataNode Health

hadoop dfsadmin -report
Hadoop terminal output showing active datanodes and HDFS capacity
02

MapReduce Job Execution

Processing the data involved writing Map functions to filter input streams and Reduce functions to aggregate violation counts across different states and zones.

Mapper Example (Conceptual)
public class ViolationMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    public void map(LongWritable key, Text value, Context context) {
        String[] columns = value.toString().split(",");
        if (columns[4].equals("VIOLATION")) {
            context.write(new Text(columns[1]), new IntWritable(1));
        }
    }
}

MapReduce Job Monitor Overview

Hadoop Resource Manager UI
Hadoop UI overview showing running MapReduce Jobs and their status

Terminal Console — MapReduce Job Analytics

Terminal Job Exec 1
MapReduce Job Console Output 1
Terminal Job Exec 2
MapReduce Job Console Output 2
03

Output & Visualizations

After the distributed processing completed, the aggregated output was queried using Hive and visualized via a charting pipeline to represent the findings intuitively.

Visualization — Water Quality Violations

Violations Chart
Bar chart output of violations data aggregated by the Hadoop pipeline

Visualization — BOD Levels (Biochemical Oxygen Demand)

BOD Chart
Chart showing BOD levels across datasets
DOC

Lab Documentation

Read or download the complete authoritative lab report for Big Data & Hadoop Processing below.

Your browser doesn't support embedded PDFs. Click here to download the lab report.

Open PDF in New Tab