... more stuff
at
php-app-engine.com

Archive for the ‘HadoopAndPig’ Category

HADOOP FOR THE LONE ANALYST, WHY AND HOW

Friday, January 15th, 2010

http://blog.tech.stylefeeder.com/2010/01/14/hadoop-for-the-lone-analyst/

Here at StyleFeeder, we spend a lot of time figuring out what our users are doing, and trying to figure out what they want. One of the tools we have brought to bear on these questions is Hadoop. Among the technical tools these days, Hadoop is like the prettiest girl in school, and it’s easy to think you should be bringing her to every conceivable dance. You shouldn’t: there are plenty of problems that Hadoop can’t solve, or for which there are better tools. But there are some problem spaces where it excels: web analytics and preparation for search, to name two. This post is informed by our use of it for web analytics.

This is a long piece, but I figured we might as well get this all up in one place. To skip straight past the blather and into the HOWTO, go here.

Cloud9: Getting started with EC2

Wednesday, January 13th, 2010

http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/content/start-EC2.html

This tutorial will get you started with Cloud9 on Amazon’s EC2 (running the simple word count demo). For a gentler introduction to Hadoop, or if you don’t feel like experimenting with EC2, try my tutorial on getting started with Cloud9 in standalone mode. This tutorial assumes you’ve already downloaded Cloud9 and gotten it set up. Otherwise, see my tutorial on that.

Michael G. Noll – Hadoop and Python

Wednesday, January 13th, 2010

http://www.michael-noll.com/wiki/Hadoop


* Writing An Hadoop MapReduce Program In Python
* Running Hadoop On Ubuntu Linux (Single-Node Cluster)
* Running Hadoop On Ubuntu Linux (Multi-Node Cluster)

Hadoop and HBase in production

Tuesday, January 12th, 2010

http://blog.readpath.com/2009/12/28/hadoop-and-hbase-in-production/

The personalized content scoring features of ReadPath depend on having a good measurement of term frequencies. So to support this, there is a dictionary of all of the terms used in the content database along with their frequencies. The initial implementation of the dictionary wasn’t scaling properly so it was converted to a Map/Reduce job that stores data in HBase. The dictionary processing went from a system that was having trouble keeping up with the incoming stream of content ( ReadPath adds ~1,500 new items / minute) to one that could completely rebuild a dictionary from 250 Million content items in under 3 hours (this equates to ~1,400,000 items / minute).

One of the main items that was keeping me from pulling the trigger on porting to HBase was concerns about data loss. In my first day of playing with HBase, I had a bad server take out the .META. table and result in complete loss of HBase tables. I pulled that server and haven’t had any data loss since, but have also made good use of the HBase Exporter Map/Reduce job that will dump the contents of your tables to HDFS. This can then be easily restored if for some reason the HBase tables become corrupted. These backup and restore techniques are actually much easier than the standard systems used for MySQL at the scale that ReadPath had gotten to.

Re: Analyzing MySQL slow query logs using Pig + Hadoop

Monday, January 11th, 2010

http://www.mail-archive.com/pig-user@hadoop.apache.org/msg01633.html

A word of warning regarding that blog post — it’s written to explain
things, not to show how one would run them in production. So it’s a
bit verbose and does silly things like calling out to awk. Don’t take
it as a style guide :-) .

Someone recently commented that it’s way too long for the job it does,
so I shrunk it — here’s the equivalent, but more terse version:

More


If you need to write one, please look into
http://hadoop.apache.org/pig/docs/r0.5.0/udf.html.
It has some sample UDFs and usage.

HadoopHackDay was a major hit

Monday, January 11th, 2010

http://www.jonathanboutelle.com/mt/archives/2010/01/hadoophackday_w.html

-Hadoop is very resource-intensive! We started out using 1-node clusters to run our jobs against small subsets of data. Very quickly teams started upgrading to 5-node clusters due to the amount of time they were having to wait for results. Final runs against full data sets were powered by 10-node clusters of “medium” ec2 servers. You have no choice but to use cloud computing for these kinds of jobs, because it seems to me that production use could easily require 100s of nodes, and no one would want to buy that many servers for machines that they only use one hour a day.

Hadoop, Pig, and Twitter (NoSQL East 2009)

Friday, January 8th, 2010

http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009

A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.

Problem Solving with Apache Hadoop & Pig

Friday, January 1st, 2010

http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

Practical Problem Solving with Hadoop and Pig Milind Bhandarkar (milindb@yahoo-inc.com)

piglet

Wednesday, December 30th, 2009

http://github.com/iconara/piglet

Piglet is a DSL for writing Pig Latin scripts in Ruby:

a = load ‘input’
b = a.group :c
store b, ‘output’

The code above will be translated to the following Pig Latin:

relation_2 = LOAD ‘input’;
relation_1 = GROUP relation_2 BY c;
STORE relation_1 INTO ‘output’;

The aim is to support most of Pig Latin, but currently there are some limitations.

Building Hadoop Clusters On Linux In EC2

Tuesday, December 22nd, 2009

http://www.higherpass.com/linux/Tutorials/Building-Hadoop-Clusters-On-Linux-In-Ec2/

Learn to build and use multi-node Hadoop clusters running in Amazon EC2. A few bits of knowledge are assumed in this article, first a basic knowledge of Hadoop. If you haven’t used hadoop before you probably want to read Intro To Hadoop Article first.

http://www.higherpass.com/java/Tutorials/Building-Hadoop-Mapreduce-Jobs-In-Java/

Hadoop is a parallel job processing framework from the Apache foundation. The hadoop framework is written in java and supports jar files for job execution. This tutorial is going to cover building a MapReduce job in java. The dataset being used will be the 2000 US Census available as an EBS volume snapshot on Amazon EC2. The census dataset is extemely large, and only a small part of the overall dataset will be explained.

A Benchmark for Hive, PIG and Hadoop

Tuesday, December 22nd, 2009

http://issues.apache.org/jira/secure/attachment/12413737/hive_benchmark_2009-07-12.pdf

A Benchmark for Hive, PIG and Hadoop

pigpy

Sunday, December 20th, 2009

http://code.google.com/p/pigpy/

pypig – a python tool to manage Pig reports

Pig provides an amazing set of tools to create complex relational processes on top of Hadoop, but it has a few missing pieces: # Looping constructs for easily creating multiple similar reports # Caching of intermediate calculations # Data management and cleanup code # Easy testing for report correctness

pypig is an attempt to fill in these holes by providing a python module that knows how to talk to a Hadoop cluster and can create and manage complex report structures.