Archive for the ‘HadoopAndPig’ Category
Friday, January 15th, 2010
http://blog.tech.stylefeeder.com/2010/01/14/hadoop-for-the-lone-analyst/
Here at StyleFeeder, we spend a lot of time figuring out what our users are doing, and trying to figure out what they want. One of the tools we have brought to bear on these questions is Hadoop. Among the technical tools these days, Hadoop is like the prettiest girl in school, and it’s easy to think you should be bringing her to every conceivable dance. You shouldn’t: there are plenty of problems that Hadoop can’t solve, or for which there are better tools. But there are some problem spaces where it excels: web analytics and preparation for search, to name two. This post is informed by our use of it for web analytics.
This is a long piece, but I figured we might as well get this all up in one place. To skip straight past the blather and into the HOWTO, go here.
Posted in HadoopAndPig | 1 Comment »
Wednesday, January 13th, 2010
http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/content/start-EC2.html
This tutorial will get you started with Cloud9 on Amazon’s EC2 (running the simple word count demo). For a gentler introduction to Hadoop, or if you don’t feel like experimenting with EC2, try my tutorial on getting started with Cloud9 in standalone mode. This tutorial assumes you’ve already downloaded Cloud9 and gotten it set up. Otherwise, see my tutorial on that.
Posted in HadoopAndPig, HowTos | No Comments »
Wednesday, January 13th, 2010
http://www.michael-noll.com/wiki/Hadoop
* Writing An Hadoop MapReduce Program In Python
* Running Hadoop On Ubuntu Linux (Single-Node Cluster)
* Running Hadoop On Ubuntu Linux (Multi-Node Cluster)
Posted in HadoopAndPig, HowTos | No Comments »
Tuesday, January 12th, 2010
http://blog.readpath.com/2009/12/28/hadoop-and-hbase-in-production/
The personalized content scoring features of ReadPath depend on having a good measurement of term frequencies. So to support this, there is a dictionary of all of the terms used in the content database along with their frequencies. The initial implementation of the dictionary wasn’t scaling properly so it was converted to a Map/Reduce job that stores data in HBase. The dictionary processing went from a system that was having trouble keeping up with the incoming stream of content ( ReadPath adds ~1,500 new items / minute) to one that could completely rebuild a dictionary from 250 Million content items in under 3 hours (this equates to ~1,400,000 items / minute).
One of the main items that was keeping me from pulling the trigger on porting to HBase was concerns about data loss. In my first day of playing with HBase, I had a bad server take out the .META. table and result in complete loss of HBase tables. I pulled that server and haven’t had any data loss since, but have also made good use of the HBase Exporter Map/Reduce job that will dump the contents of your tables to HDFS. This can then be easily restored if for some reason the HBase tables become corrupted. These backup and restore techniques are actually much easier than the standard systems used for MySQL at the scale that ReadPath had gotten to.
Posted in HadoopAndPig, noSQL | No Comments »
Monday, January 11th, 2010
http://www.mail-archive.com/pig-user@hadoop.apache.org/msg01633.html
A word of warning regarding that blog post — it’s written to explain
things, not to show how one would run them in production. So it’s a
bit verbose and does silly things like calling out to awk. Don’t take
it as a style guide
.
Someone recently commented that it’s way too long for the job it does,
so I shrunk it — here’s the equivalent, but more terse version:
More
If you need to write one, please look into
http://hadoop.apache.org/pig/docs/r0.5.0/udf.html.
It has some sample UDFs and usage.
Posted in HadoopAndPig | No Comments »
Monday, January 11th, 2010
http://www.jonathanboutelle.com/mt/archives/2010/01/hadoophackday_w.html
-Hadoop is very resource-intensive! We started out using 1-node clusters to run our jobs against small subsets of data. Very quickly teams started upgrading to 5-node clusters due to the amount of time they were having to wait for results. Final runs against full data sets were powered by 10-node clusters of “medium” ec2 servers. You have no choice but to use cloud computing for these kinds of jobs, because it seems to me that production use could easily require 100s of nodes, and no one would want to buy that many servers for machines that they only use one hour a day.
Posted in HadoopAndPig, Money, Performance | No Comments »
Friday, January 8th, 2010
http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
Posted in HadoopAndPig | No Comments »
Friday, January 1st, 2010
http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
Practical Problem Solving with Hadoop and Pig Milind Bhandarkar (milindb@yahoo-inc.com)
Posted in HadoopAndPig | No Comments »
Wednesday, December 30th, 2009
http://github.com/iconara/piglet
Piglet is a DSL for writing Pig Latin scripts in Ruby:
a = load ‘input’
b = a.group :c
store b, ‘output’
The code above will be translated to the following Pig Latin:
relation_2 = LOAD ‘input’;
relation_1 = GROUP relation_2 BY c;
STORE relation_1 INTO ‘output’;
The aim is to support most of Pig Latin, but currently there are some limitations.
Posted in HadoopAndPig, Ruby Kids | No Comments »
Tuesday, December 22nd, 2009
http://www.higherpass.com/linux/Tutorials/Building-Hadoop-Clusters-On-Linux-In-Ec2/
Learn to build and use multi-node Hadoop clusters running in Amazon EC2. A few bits of knowledge are assumed in this article, first a basic knowledge of Hadoop. If you haven’t used hadoop before you probably want to read Intro To Hadoop Article first.
http://www.higherpass.com/java/Tutorials/Building-Hadoop-Mapreduce-Jobs-In-Java/
Hadoop is a parallel job processing framework from the Apache foundation. The hadoop framework is written in java and supports jar files for job execution. This tutorial is going to cover building a MapReduce job in java. The dataset being used will be the 2000 US Census available as an EBS volume snapshot on Amazon EC2. The census dataset is extemely large, and only a small part of the overall dataset will be explained.
Posted in HadoopAndPig, HowTos | 1 Comment »
Tuesday, December 22nd, 2009
Posted in HadoopAndPig, Performance | No Comments »
Sunday, December 20th, 2009
http://code.google.com/p/pigpy/
pypig – a python tool to manage Pig reports
Pig provides an amazing set of tools to create complex relational processes on top of Hadoop, but it has a few missing pieces: # Looping constructs for easily creating multiple similar reports # Caching of intermediate calculations # Data management and cleanup code # Easy testing for report correctness
pypig is an attempt to fill in these holes by providing a python module that knows how to talk to a Hadoop cluster and can create and manage complex report structures.
Posted in HadoopAndPig, Open Source Projects | No Comments »
Wednesday, December 16th, 2009
http://www.larsgeorge.com/2009/10/hive-vs-pig.html
While I was looking at Hive and Pig for processing large amounts of data without the need to write MapReduce code I found that there is no easy way to compare them against each other without reading into both in greater detail.
In this post I am trying to give you a 10,000ft view of both and compare some of the more prominent and interesting features. The following table – which is discussed below – compares what I deemed to be such features:
Posted in HadoopAndPig | No Comments »
Wednesday, December 9th, 2009
http://mrflip.github.com/wukong/index.html
Wukong: Hadoop made so easy a Chimpanzee could run it.
Treat your dataset like a
* stream of lines when it’s efficient to process by lines
* stream of field arrays when it’s efficient to deal directly with fields
* stream of lightweight objects when it’s efficient to deal with objects
Wukong is friends with Hadoop the elephant, Pig the query language, and the cat on your command line.
Send Wukong questions to the Infinite Monkeywrench mailing list
Posted in HadoopAndPig, Open Source Projects | No Comments »
Wednesday, December 9th, 2009
http://www.scribd.com/doc/23844299/Map-Reduce-Hadoop-Pig
A Hadoop, MapReduce and Pig summary
Powerpoint 24 Pages
Posted in HadoopAndPig | No Comments »
Tuesday, December 8th, 2009
https://issues.apache.org/jira/browse/PIG-200
To benchmark Pig performance, we need to have a TPC-H like Large Data Set plus Script Collection. This is used in comparison of different Pig releases, Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only).
Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance
I am currently running long-running Pig scripts over data-sets in the order of tens of TBs. Next step is hundreds of TBs.
Posted in HadoopAndPig, Performance | No Comments »
Monday, December 7th, 2009
Posted in HadoopAndPig | No Comments »
Wednesday, December 2nd, 2009
http://www.cascading.org/
Cascading is a feature rich API for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.
The processing API lets the developer quickly assemble complex distributed processes without having to “think” in MapReduce. And to efficiently schedule them based on their dependencies and other available meta-data. Obviously simple data processing applications are supported as well, as complex jobs tend to start simple.
Cascading is Open Source and dual licensed under the GPL and OEM/Commercial Licenses. OEM/Commercial Licenses and Developer Support can be obtained through Concurrent, Inc.
Posted in Big Guys, HadoopAndPig, Open Source Projects | No Comments »
Wednesday, December 2nd, 2009
http://developeraspirations.wordpress.com/2009/11/30/pig-frustrations/
My desires to implement better scalability through pre-processing reports via the Grid have lead me to Pig. Unfortunately, while Pig does remove some of the difficulties of writing for Hadoop (you no longer have to write all of the map-reduce jobs yourself in java), it has many limitations.
Posted in HadoopAndPig | No Comments »
Monday, November 30th, 2009
http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/
In this blog post, we will use Pig to examine the download logs recorded on our server, demonstrating several features that are often glossed over in introductory Pig tutorials—parameter substitution in PigLatin scripts, Pig Streaming, and the use of custom loaders and user-defined functions (UDFs). It’s worth mentioning here that, as of last week, the Cloudera Distribution for Hadoop includes a package for Pig version 0.2 for both Red Hat and Ubuntu, as promised in an earlier post. It’s as simple as apt-get install pig or yum install hadoop-pig.
Posted in HadoopAndPig | No Comments »
Thursday, November 26th, 2009
http://blog.tonybain.com/tony_bain/2009/11/analytics-at-twitter.html
Twitter, like many web 2.0 apps, started life as a MySQL based RBDMS application. Today, Twitter is still using MySQL for much of their online operational functionality (although this is likely to change in the near future – think distributed), but on the analytics side of things Twitter has spent the last 6 months moving away from running SQL queries against MySQL data marts. This was because their need for timely data was becoming a struggle with MySQL, particularly when dealing with very large data volumes and complicated queries. For Web 2.0 the ability to understand, quantify and make timely predictions from user behavior is very much their life blood. When Kevin arrived at Twitter 6 months ago he was tasked with changing the way Twitter analyzed their data. Now the bulk of their analytics is executed using a Hadoop platform with Pig as the “querying language”.
Posted in HadoopAndPig | No Comments »
Wednesday, November 11th, 2009
http://atbrox.com/2009/11/11/how-to-combine-elastic-mapreducehadoop-with-other-amazon-web-services/
Elastic Mapreduce default behavior is to read from and store to S3. When you need to access other AWS services, e.g. SQS queues or database services SimpleDB and RDS (MySQL) the best approach from Python is to use Boto. To get Boto to work with Elastic Mapreduce you need to dynamically load boto on each mapper and reducer, Cloudera’s Jeff Hammerbacher outlined how to do that using Hadoop Distributed Cache and Peter Skomorroch suggested how to load Boto to access Elastic Blockstore (EBS), this posting is based on those ideas and gives a detailed description how to do it.
How to combine Elastic Mapreduce with other AWS Services
This posting shows how to load boto in an Elastic Mapreduce mapper and gives a simple example how to use simpledb from the same mapper. For accessing other AWS services, e.g. SQS from Elastic Mapreduce check out the Boto documentation (it is quite easy when the boto + emr integration is in place).
Posted in HadoopAndPig, HowTos, Misc | No Comments »
Tuesday, November 10th, 2009
http://code.google.com/p/cloudmapreduce/
Cloud MapReduce was developed at Accenture Technology Labs by Huan Liu and Dan Orban. It is a MapReduce implementation on top of the Amazon Cloud OS.
By exploiting a cloud OS’s scalability, Cloud MapReduce achieves three primary advantages over other MapReduce implementations built on a traditional OS:
* It is faster than other implementations (e.g., 60 times faster than Hadoop in one case).
* It is more scalable because it has no single point of bottleneck.
* It is dramatically simpler with only 3,000 lines of code (e.g., two orders of magnitude simpler than Hadoop).
See details in Cloud MapReduce Technical Report.
See Command line options for details on how to specify a job run, and Pre-built AMI for how to use the pre-built AMI image to make running the job easier. A tutorial is coming soon.
Posted in HadoopAndPig, Open Source Projects | No Comments »
Wednesday, October 7th, 2009
http://atbrox.com/2009/10/07/how-to-use-c-compiled-python-for-amazons-elastic-mapreduce-hadoop/
Sometimes it can be useful to compile Python code for Amazon’s Elastic Mapreduce into C++ and then into a binary. The motivation for that could be to integrate with (existing) C or C++ code, or increase performance for CPU-intensive mapper or reducer methods. Here follows a description how to do that:
Based on Shedskin
Posted in HadoopAndPig, HowTos, Misc | No Comments »
Friday, October 2nd, 2009
http://www.cloudera.com/hadoop-data-intensive-application-tutorial
This tutorial will show you how to use Amazon EC2 and Cloudera’s Distribution for Hadoop to run batch jobs for a data intensive web application. During the tutorial, we will perform the following data processing steps:
* Configure and launch a Hadoop cluster on Amazon EC2 using the Cloudera tools
* Load Wikipedia log data into Hadoop from Amazon Elastic Block Store (EBS) snapshots and Amazon S3
* Run simple Pig and Hive commands on the log data
* Write a MapReduce job to clean the raw data and aggregate it to a daily level (page_title, date, count)
* Write a Hive query that finds trending Wikipedia articles by calling a custom mapper script
* Join the trend data in Hive with a table of Wikipedia page IDs
* Export the trend query results to S3 as a tab delimited text file for use in our web application’s MySQL database
Posted in HadoopAndPig, HowTos | No Comments »
Thursday, October 1st, 2009
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2728
Analyze your Apache logs using Pig and Amazon Elastic MapReduce.
* Total bytes transferred per hour
* A list of the top 50 IP addresses by traffic per hour
* A list of the top 50 external referrers
* The top 50 search terms in referrals from Bing and Google
You can modify the Pig script to generate additional information.
Also:
Parsing Logs with Apache Pig and Elastic MapReduce
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2729
This tutorial shows you how to develop a simple, log parsing application using Pig and Amazon Elastic MapReduce. The tutorial walks you through using Pig interactively (via SSH) on a subset of your data, which enables you to prototype your script quickly. The tutorial then takes you through uploading the script to Amazon S3 and running on a larger set of input data.
Posted in HadoopAndPig, HowTos | No Comments »
Wednesday, September 30th, 2009
ec2cluster is a Rails web console, including a REST API, that launches temporary Beowulf clusters on Amazon EC2 for parallel processing. You upload input data and code to Amazon S3, then submit a job request including how many nodes you want in your cluster. ec2cluster will spin up & configure a private beowulf cluster, process the data in parallel across the nodes, upload the output results to an Amazon S3 bucket, and terminate the cluster when the job completes (termination is optional). ec2cluster is like Amazon Elastic MapReduce, except it is uses MPI and REST instead of Hadoop and SOAP. The source code is also free for use in both personal and commercial projects, released under the BSD license.
Posted in HadoopAndPig, Open Source Projects | No Comments »
Monday, September 28th, 2009
Bringing Big Data to the Enterprise with Apache Hadoop
Posted in Big Guys, HadoopAndPig | No Comments »