Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2

http://www.cloudera.com/hadoop-data-intensive-application-tutorial

This tutorial will show you how to use Amazon EC2 and Cloudera’s Distribution for Hadoop to run batch jobs for a data intensive web application. During the tutorial, we will perform the following data processing steps:

* Configure and launch a Hadoop cluster on Amazon EC2 using the Cloudera tools
* Load Wikipedia log data into Hadoop from Amazon Elastic Block Store (EBS) snapshots and Amazon S3
* Run simple Pig and Hive commands on the log data
* Write a MapReduce job to clean the raw data and aggregate it to a daily level (page_title, date, count)
* Write a Hive query that finds trending Wikipedia articles by calling a custom mapper script
* Join the trend data in Hive with a table of Wikipedia page IDs
* Export the trend query results to S3 as a tab delimited text file for use in our web application’s MySQL database

Leave a Reply