Archive for October, 2009

Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2

Friday, October 2nd, 2009

This tutorial will show you how to use Amazon EC2 and Cloudera’s Distribution for Hadoop to run batch jobs for a data intensive web application. During the tutorial, we will perform the following data processing steps:

* Configure and launch a Hadoop cluster on Amazon EC2 using the Cloudera tools
* Load Wikipedia log data into Hadoop from Amazon Elastic Block Store (EBS) snapshots and Amazon S3
* Run simple Pig and Hive commands on the log data
* Write a MapReduce job to clean the raw data and aggregate it to a daily level (page_title, date, count)
* Write a Hive query that finds trending Wikipedia articles by calling a custom mapper script
* Join the trend data in Hive with a table of Wikipedia page IDs
* Export the trend query results to S3 as a tab delimited text file for use in our web application’s MySQL database

Amazon EC2 Core Docs

Friday, October 2nd, 2009

EC2 API reference
Dev Guide
User Guide

Creating a New Image for EC2 by Rebundling a Running Instance

Friday, October 2nd, 2009

There are two primary ways to create an image for EC2:

1. Create an EC2 image from scratch. This process lets you control every detail of what goes into the image and is the easiest way to automate image creation.
2. Rebundle a running EC2 instance into a new image. This approach is the topic of the rest of this article.


Friday, October 2nd, 2009

Pantheon means a group of gods’. Linux, Apache, MySQL, Drupal, Varnish, Hudson, Aegir, these Open Source projects are technology titans for web development. The Pantheon project packages them together seamlessly in machine instances instantly available on Amazon EC2.

site twitter

Apache LogAnalysis using Pig

Thursday, October 1st, 2009

Analyze your Apache logs using Pig and Amazon Elastic MapReduce.

* Total bytes transferred per hour
* A list of the top 50 IP addresses by traffic per hour
* A list of the top 50 external referrers
* The top 50 search terms in referrals from Bing and Google

You can modify the Pig script to generate additional information.

Amazon’s EC2 Generating 220M+ Annually

Thursday, October 1st, 2009

How Big is Amazon’s EC2?
Big. 40,000 servers. I have independently confirmed this with at least two sources close to EC2. Obviously, I can’t reveal them, but they are personally known to me and reliable. The first source gave me the 40,000 number and the second confirmed that the number is close. At most, we’re talking +/- 10,000 servers, so within 25%, but I’m guessing I’m very close. More like +/- 5K. Regardless, within 25% is more than close enough for us to get a pretty good gauge. For our purposes today we’ll go with the 40K number.