Distributed File Systems

Introduction

General Characteristics of Distributed File Systems

Basis to evaluate and compare

Network transparency: same access operation as local files

Location transparency: file name should not reveal its location

Location independence: file name should not be changed when its physical location changes

User mobility: access to file from anywhere

Fault tolerance

Scalability

File mobility: move files from one place to another in a running system

Design Considerations

Name space

Uniform name space: each client uses the same path name to access a file

Non-uniform: each client mounts the file system to arbitrary location

Stateful or stateless operation

Stateful: server maintains information about client operations such as open mode, file offset, etc

Stateless: client should send self-containing request to server

Stateful is fast but consistency/crash recovery is difficult

Semantics of sharing

Unix semantics: changes to a file made by a client should be visible when another issues the next read/write

Session semantics: consistency in granularity of open/close

Even weaker semantics: periodically enforce consistency

Remote access methods: issues about how actively server should participate in the file operation (such as notification of cache inconsistency, ...)

Just an agent to do what client requests?

NFS

Introduction

Client-server model

Use RPC: Operates in synchronous mode (see RPC note)

Application in client machine issues a file operation ->

NFS client sends the corresponding req. ->

The client application is blocked ->

Server handles the req. and replies ->

Client receives the response ->

Client application is unblocked

User Perspective

Server can export the entire file system or a subtree

Client mounts the exported file system (or subtree) or any subtree of the file system. Two types of mount operation:

Hard-mount: client keeps trying the request until it gets response

Soft-mount: client gives up after a while and returns error

Design Goals

Simple recovery mechanism

Transparent access to remote files

Unix file system semantics

Transport-independent

NFS Components

NFS Protocol: GETATTR, SETATTR, LOOKUP, READLINK, READ, WRITE, CREATE, REMOVE, RENAME, LINK, SYMLINK, MKDIR, RMDIR, READDIR, STATFS

Mount Protocol: MNT, DUMP, UMNT, UMNTALL, EXPORT

RPC: each request is sent as an RPC packet

XDR: a machine-independent data encoding method

Daemon processes

server: nfsd daemons & mountd

client: biod

Statelessness

No open/close method

Every request has full information such as file offset, number of bytes, ...

Easy crash recovery: nothing needs to be done

Major problem of statelessness:

Each update req. from a client is a self-contained and independent update to file system

-> Server must commit--apply the update to disk--every modification to stable storage before replying to a request

<- System crash can lose data even on a local file system, but in that case users are aware of the crash and the possibility of data loss. This is not true for distributed file system--for example server crash-and-reboot is indistinguishable from slow server

The Protocol Suite

Extended Data Representation

Remote Procedure Calls (RPC)

NFS uses Sun RPC -> Sun RPC supports only synchronous communication -> synchronous nature of NFS

RPC message

xid: unique id of the message

direction: request or reply

rpc-version (req. only)

prog & vers (req. only): service program and its version to call

proc (req.only): function within the service program

reply_stat & accept_stat (reply only)

authentication info

procedure specific arguments (req. only)

procedure specific return values (reply only)

NFS Implementation

Control Flow

Application in client machine issues a file operation with vnode ->

The corresponding vnode virtual function in turn calls file system specific code. In this case NFS function ->

NFS client sends the corresponding req. ->

The client application is blocked

Server gets the req. & map to the local vnode ->

Server invokes local vnode operation ->

Server replies ->

Client receives the response ->

At client side, remaining vnode operations are executed ->

Client application is resumed

File Handles

When a client request LOOKUP, CREATE, or MKDIR, server replies with a unique file handle, whose components are usually:

unique file system id + inode number + generation number

The client uses the file handle for subsequent requests

Mount Operation

Application calls mount ->

new vfs structure allocated -> nfs_mount() called:

Client sends RPC req., to mountd, with the name of dir ->

mountd accepts the req. & checks permission ->

mountd replies with file descriptor of the dir ->

Client adds server addr to vfs structure ->

Client allocates vnode and rnode (NFS version of inode) for the root dir of the mounted file system

Pathname Lookup

Control flow

Application calls open, creat, or stat ->

RPC client send LOOKUP, CREATE, or MKDIR req. ->

Server returns file handle

Single pathname may cause several pathname lookup RPCs

Why not sending full pathname with a single pathname lookup RPC?

Need to figure out where the NFS boundary is in the long path name

Server needs to understand Unix path name -> OS independence hurts

UNIX Semantics

Open File Permissions

In Unix, for example:

A process opens a file with read-write mode

Later the owner of the file changes the file to read-only

The process can still read from and write to the file until it closes the file

In NFS, the process can only read from the file because each read/write is checked against

Deletion of Open Files

In Unix, if 

A process opens a file

Another process deletes the file later

The former can still access the file

In NFS, the former will get invalid file handle error after the deletion

Reads and Writes

In Unix, each read/write is atomic: vnode and inode are locked

In NFS, a single read/write may span several RPC calls (max size of RPC packet is 8K)

Modern NFS provides Network Lock Manager to lock entire file or a portion of it. But it's advisory lock!

NFS Performance

Performance Bottlenecks

Any NFS request that modifies the file system (data or meta data) is extremely slow, because every update should be committed to a stable storage

"ls -l" will be end up with bunch of RPC calls (stat for each file in the dir). In the case of local file system, inode cache will boost up the speed

RPC retransmission could aggravate the network problem

Client-Side Caching

Client caches both file blocks and attributes

Cache coherence problem

Attribute: expiration(60 sec) mechanism is used

File blocks

if the cached attribute has not been expired, check the last modification time against the time at which the block was cached

Deferral of Writes

Client side: hint = user knows about the crash of the client machine

Asynchronous write (= issue req. but do not wait for the reply) for full block

Delayed write (= do not WRITE req. until close or next 30 sec) for partial block

Server side: use static RAM and do not commit to disk every time

Retransmissions Cache

Nonidempotent req. can cause problems, for example:

REMOVE req.

the file removed

the reply gets lost

REMOVE retransmitted

error code will be returned

Also hurts server performance a lot

=> Retransmission cache introduced at the server side

All recent req. are cached with state(= in progress, done)

Duplicates for req of in progress state are ignored

Duplicates for req of done state:

recent duplicates are ignored

no-recent idempotent req - process again

no-recent nonidempotent req - reply OK or process again depending the last modification time of the file

NFS Security

NFS Access Control

Server specifies machines, called export list, which can mount the file system or subdirectory

-> Anyone on the machine listed in export list can mount it

(In case of AUTH_UNIX authentication) NFS requests are access-controlled based on uid and gid

-> requires flat (uid, gid) space over all the machines in a NFS domain

If I have the root privilege on my workstation, I can impersonate John by making a user John on my workstation

Also NFS requests are not encrypted

UID Remapping

Map the same (uid, gid) from different machines to different (uid, gid) of server

In the appendix of Kerberos paper, they suggest to use kerberos to setup this remapping

When a user mounts remote NFS file system, she submits her kerberos ticket

Mount daemon decrypts the ticket and finds a mapping from (client-ip, uid at client machine) to (uid at server)

Later on, client uses the uid at the server side

When the user does umount or logs out, the mapping gets deleted

This scheme is still vulnerable to eavesdropping while the user logs in

No mainstream NFS use UID remapping. They use secure RPC with AUTH_DES or AUTH_KERB

Root Remapping

Remap root from client machine to nobody

AFS

Design Goals and Characteristics

Scalable to thousands of users

Unix compatible - binary should work with AFS client without modification

Uniform name space

User mobility

Location independence - file name should not get changed when the file being moved to other machine

Fault-tolerant - server or network error should be confined as much as possible and AFS service still be provided

Performance should be comparable to time sharing system

Scalable Architecture

Issues on network configuration and load of servers

Clustered network configuration

A cluster is composed of one server and multiple clients

Each server mainly serves clients in its cluster

Client can still get served by a server in another cluster with reduced speed

Network can be dynamically reconfigured

Load of servers

Aggressive caching at the client side to reduce network and server load

Storage and Name Space Organization

The collection of AFS servers together holds the shared files

Files are organized in volumes

One more indirection on top of partition

Lots of freedom

freely moved among partitions and machines without changing volume name

volume-based backup

Ex. volume per user

Uniform name space

Each file is referenced by fid:

volume id

inode number within the volume

generation number - same inode number is used again and again with generation number being increased each time

Volume location DB

mapping between volume-id and location of the volume

replicated to each server

Client caches files and may have local files for performance reason or convenience

Session Semantics

AFS only guarantees session semantics for data - consistency at open and close

AFS provides stronger semantics for meta data (file attributes)

Meta data update propagates to server immediately

Implementation

Caching and Consistency

Client caches whole file or 64K chunks, if the file is bigger than 64K, in the disk

Client caches file attributes in memory

Client serves read, write using the cache

The cache is flushed when the file is closed

Coherence

When the client fetches the file from the server, the server gives out callback--the certificate that the file is valid

If another client modifies the file and sends the update to the server, the server notifies the breaking of the certificate to the client

What if network failure occurs and server's breaking the certificate is not delivered to the client?

Client periodically probes the server for each callback

Pathname Lookup

Pathname lookup--translation of pathname into fid--is done by the client

Directory is just another file. So if the client does not have a directory which is part of the pathname, then fetch the directory from the server and proceeds.

Contact the server to get the location of the volume, if necessary

Server is just presented with fid. And this makes server's operation very fast

Security

AFS server does not trust clients. Hence use Kerberos authentication

Also ACL for each directory, but not for files

To access a file, user should pass both of AFS's ACL and Unix permission check

AFS Shortcomings

Server is fairly off-load, but client becomes heavy-loaded

Session semantics is not good enough

High chance of 'close' call failing

Programmers are lax at checking the status of close call

Spritely NFS

An attempt to make NFS be stateful
Recognize the performance problem of stateless & write-through
Recognize that NFS does not guarantee consistency