Summary of "Bryan Cantrill from Joyent on Manta: internet-facing object storage facility that features compute"

Summary of Video: “Bryan Cantrill from Joyent on Manta: internet-facing object storage facility that features compute”

Key Technological Concepts and Product Features

1. Historical Context of File Systems and Storage

Early file systems (1970s-1980s) were designed for single large disks.
As disk sizes plateaued, RAID (Redundant Array of Inexpensive Disks) was introduced to combine multiple smaller disks.
Traditional volume management layers acted as a “lie” to file systems by pretending multiple disks were one, causing manageability, performance, and consistency issues.
This divide between file systems and volume management led to complexity and reliability problems, especially with logical volume managers (LVMs) like vxvm, IBM LVM, etc.

2. ZFS as a Revolutionary File System

Developed at Sun Microsystems to bridge the gap between file systems and volume management through tight integration.
Key features:
- Copy-on-write design.
- Always consistent on disk (no fsck needed).
- Checksumming of all blocks.
- Self-healing of mirrored data.
- Transactional rollbacks.
Designed with future technologies in mind, such as non-volatile main memory.
Battle-tested in enterprise storage environments, improving to handle complex failure modes including firmware bugs.
Adoption limited by licensing and business factors despite its advantages.

3. Virtualization Models

Hardware virtualization: Traditional method (e.g., IBM 360), creating virtual hardware for legacy software compatibility.
Platform-level virtualization: High-level and restrictive (e.g., Google App Engine).
OS-level virtualization (containers/zones): Lightweight, efficient model where multiple isolated environments share a single OS kernel.
Joyent’s SmartOS uses zones, providing bulletproof OS-level virtualization with flexible resource allocation and resizing.

4. Manta: Object Storage with Compute

Joyent’s internet-facing object storage service built on ZFS and OS-level virtualization (zones).
Supports true hierarchical storage, unlike Amazon S3’s flat key-value model.
Enables users to run compute jobs directly where the data lives (compute-to-data model).
Jobs run inside zones on the same physical nodes as the data, avoiding expensive data movement.
Users can log into zones interactively (via mlogin) to debug or run arbitrary programs on stored objects.
Supports streaming large objects, no hard limits on object size, and efficient handling of big data workloads.
Enables parallelism by creating multiple zones to process many objects simultaneously.

5. Unix Philosophy and Manta

Leverages the Unix philosophy of simple tools that do one thing well and can be composed via pipelines.
Contrasts with heavyweight big data frameworks (e.g., Hadoop) that often violate simplicity.
Enables running classic Unix-style data processing pipelines (e.g., word count) directly on stored objects in parallel.
Reduces complexity and infrastructure overhead compared to traditional MapReduce clusters.

6. Billing and Pricing Model

Storage priced competitively at Amazon S3 rates.
Compute charged by the second (unlike traditional cloud providers that often bill by the hour).
Billing implemented using Manta itself, running jobs to process usage logs and generate invoices.
Pricing is transparent and usage can be inspected via JSON logs.

7. CAP Trade-offs in Manta

For writes, Manta chooses consistency and partition tolerance (CP): a write is acknowledged only after it is stored in multiple data centers.
For reads, Manta chooses availability and partition tolerance (AP), allowing reads from any available data center.
Contrasts with S3, which often returns success before replication completes, causing eventual consistency issues.

8. Use Cases and Applications

Log processing is a natural fit.
Video analytics and scientific computing (e.g., bioinformatics, proteomics) are actively supported and growing.
Enables interactive debugging of large core dumps and crash dumps, stored indefinitely for analysis.
Encourages creativity by providing a general-purpose compute platform co-located with data.

9. Limitations and Constraints

Default resource allocation per job includes 1 GB RAM and 8 GB storage, adjustable based on needs and billing.
Large reducers may require more RAM; parallelism is achieved by splitting data into multiple objects.
Designed for workloads where compute-to-data is efficient; very long-running compute jobs are better suited for traditional infrastructure-as-a-service (IaaS).

Guides, Tutorials, and Demonstrations

Live demonstrations of Manta commands such as mls, mget, mfind, and mjob create showing:
- Listing hierarchical directories and objects.
- Retrieving ancient Unix man pages stored as objects.
- Running parallel jobs on multiple objects (e.g., word frequency count).
- Logging into a zone running on the same machine as the data for interactive debugging.
Explanation of building data processing pipelines on Manta using Unix-style command-line tools.
Discussion on interpreting billing logs and usage data stored and processed within Manta itself.

Main Speakers / Sources

Bryan Cantrill — Former Sun Microsystems engineer, co-creator of DTrace and ZFS, currently at Joyent.
Mark Cavage — Senior engineer formerly at Amazon AWS, contributed key insights to Manta’s design.
Other Joyent Engineers — Including Josh PL (creator of mlogin) and Dave P (lead engineer on Manta).

Summary

The talk covers the evolution of storage technology, architectural innovations behind ZFS, the rationale for OS-level virtualization, and the design and capabilities of Manta as a next-generation object storage system. Manta integrates compute directly with data, enabling new paradigms in big data processing and cloud infrastructure.