Back to index

A comparison of approaches to large-scale data analysis

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker
Brown University, UW-Madison, Yale Univeristy, Microsoft, and MIT

One-line Summary

MapReduce is a simple programming model for processing massive data sets on massive clusters with superior fault-tolerance, but sometimes transacational operations are required so that parallel DBMS, an execution platform, will still be alive.

Overview/Main Points

Background
- How do we get here (MapReduce, parallel DBMS, distributed DBMS, etc)?
- What does it mean to us?
- Using DB
  - Understand data by defining schemas
  - Write SQL queries
  - Load data and deal with problems such as mismatch type in schema
  - Process data / queries only by DBMS
- Why MapReduce appears?
  - HW is inexpensive
  - DBMS is expensive
  - Complex computations at Google
    - "stitch satellite images"
    - "generate inverted index"
    - "process road segments"
  - Various data source in different and complex types
- "MapReduce borrows many key ideas from parallel database systems including the use of partitioned data sets and the use of hashing to redistribute records with identical key values to the same node for subsequent processing."
- parallel DBMS
  - Poor fault-tolerance: if one node fails, query fails.
  - Not “ elastic ”: hard to change nodes in systems; scale to hundreds of machines in shared nothing configuration.
  - transaction
  - Efficient due to relational data model
    - reduce I/Os by 1)only requesting relevant tables using data schema; 2)compression
    - index
    - query optimization
  - higher-level language (SQL)
Trends between MapReduce and DB community
- Hadoop moves towards Parallel DBMS
  - Hive
- Parallel DBMS moves towards MapReduce
  - elasticity: Amazone cloud DB
  - M/R
Streaming DBMS
- Data feeds through fixed standard queries
- Application: twitter
- Twitter Storm
  - no schema
  - user defined programming
  - elastic
  - fault-tolerant
Experiment setup
- Platforms
  - 100-node Hadoop cluster, 0.19.0, Java6
  - DBMS-X: shared-nothing row store, hash-partitioned, stored and indexed
  - Vertica: column store
- Grep Task
  - Find 3-byte pattern in 100-byte record; 1 match per 10,000 records
  - Data set
    - 10-byte unique key, 90-byte value
    - 1TB spread across 25, 50, or 100 nodes
    - 10 billion records
- Analysis Task: simple web processing schema
  - 600k HTML Documents (6GB per node)
  - 155 million UserVisit Records (20GB per node)
  - 18 million Rankings records (1GB per node)
- Aggregation Task: simple query to find adRevenue by IP prefix (scaleup)
  - SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue), FROM userVistits GROUP BY (sourceIP, 1, 7)
- UDF Task: one iteration of simplified pagerank
- Paper reports selection (w/ index) and join tasks: pDBMSs outperform Hadoop

A comparison of approaches to large-scale data analysis

One-line Summary

Overview/Main Points

Relevance

Flaws