Lab 02 — Big Data & Hadoop

📋 Lab Overview

This lab demonstrates a comprehensive Big Data processing pipeline using the Apache Hadoop ecosystem. It covers the setup of HDFS DataNodes, executing distributed MapReduce jobs on large-scale datasets, and visualizing the output data representing water quality violations.

Hadoop Cluster Startup & DataNodes

The first step in our distributed architecture was initializing the Hadoop cluster daemons seamlessly.

Terminal Console — Start-All

start-all.cmd

Hadoop terminal output showing startup of cluster

Once started, verifying the active DataNodes and HDFS capacity is crucial for health monitoring.

The first step in our distributed architecture was setting up the Hadoop cluster, configuring the NameNode, and ensuring active DataNodes for HDFS storage layer reliability.

Terminal Console — DataNode Health

hadoop dfsadmin -report

Hadoop terminal output showing active datanodes and HDFS capacity

MapReduce Job Execution

Processing the data involved writing Map functions to filter input streams and Reduce functions to aggregate violation counts across different states and zones.

Mapper Example (Conceptual)

public class ViolationMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    public void map(LongWritable key, Text value, Context context) {
        String[] columns = value.toString().split(",");
        if (columns[4].equals("VIOLATION")) {
            context.write(new Text(columns[1]), new IntWritable(1));
        }
    }
}

MapReduce Job Monitor Overview

Hadoop Resource Manager UI

Hadoop UI overview showing running MapReduce Jobs and their status

Terminal Console — MapReduce Job Analytics

Terminal Job Exec 1

Terminal Job Exec 2

Output & Visualizations

After the distributed processing completed, the aggregated output was queried using Hive and visualized via a charting pipeline to represent the findings intuitively.

Visualization — Water Quality Violations

Violations Chart

Bar chart output of violations data aggregated by the Hadoop pipeline

Visualization — BOD Levels (Biochemical Oxygen Demand)

BOD Chart

Chart showing BOD levels across datasets

DOC

Lab Documentation

Read or download the complete authoritative lab report for Big Data & Hadoop Processing below.

Open PDF in New Tab