Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2

Friday, October 2nd, 2009

This tutorial will show you how to use Amazon EC2 and Cloudera’s Distribution for Hadoop to run batch jobs for a data intensive web application. During the tutorial, we will perform the following data processing steps:

* Configure and launch a Hadoop cluster on Amazon EC2 using the Cloudera tools
* Load Wikipedia log data into Hadoop from Amazon Elastic Block Store (EBS) snapshots and Amazon S3
* Run simple Pig and Hive commands on the log data
* Write a MapReduce job to clean the raw data and aggregate it to a daily level (page_title, date, count)
* Write a Hive query that finds trending Wikipedia articles by calling a custom mapper script
* Join the trend data in Hive with a table of Wikipedia page IDs
* Export the trend query results to S3 as a tab delimited text file for use in our web application’s MySQL database

Apache LogAnalysis using Pig

Thursday, October 1st, 2009

Analyze your Apache logs using Pig and Amazon Elastic MapReduce.

* Total bytes transferred per hour
* A list of the top 50 IP addresses by traffic per hour
* A list of the top 50 external referrers
* The top 50 search terms in referrals from Bing and Google

You can modify the Pig script to generate additional information.