A method of storing and accessing files based in a Client/Server Architecture. In a distributed file system, one or more central servers store files that can be accessed, with proper authorization rights, by any number of remote clients in the network. Much like an operating system organizes files in a hierarchical file management system, the distributed system uses a uniform naming convention and a mapping scheme to keep track of where files are located. When the client device retrieves a file from the server, the file appears as a normal file on the client machine, and the user is able to work with the file in the same ways as if it were stored locally on the workstation. When the user finishes working with the file, it is returned over the network to the server, which stores the now-altered file for retrieval at a later time. Distributed file systems can be advantageous because they make it easier to distribute documents to multiple clients and they provide a centralized storage system so that client machines are not using their resources to store files. NFS from Sun Microsystems and DFS from Microsoft are examples of distributed file systems. General Terms: - Design, reliability, performance, measurement Keywords: - clustered storage, data storage, Fault tolerance, scalability
During much of the 1980s, people that wanted to share files used a method known as sneaker net. A sneaker net is the process of sharing files by copying them onto floppy disks, physically carrying it to another computer and copying it again. As computer networks started to evolve in the 1980s it became evident that the old file systems had many limitations that made them unsuitable for multiuser environments. At first many users started to use FTP to share files. Although this method avoided the time consuming physical movement of removable media, files still needed to be copied twice: once from the source computer onto a server, and a second time from the server onto the destination computer. Additionally users had to know the physical addresses of every computer involved in the file sharing process. As computer companies tried to solve the shortcomings above, entirely new systems such as Sun Microsystem's Network File System (NFS) were developed and new features such as file locking were added to existing file systems. The new systems such as NFS were not replacements for the old file systems, but an additional layer between the disks file system and the user processes.
Summary of possible features of a distributed file system:-
A. Concurrent updates: - The file systems in the 1970s were developed for centralized computer systems, where the data was only accessed by one user at a time. When multiple users and processes were to access files at the same time a notion of locking was introduced. There are two kinds of locks, read and write. Since two read operations don't conflict, an attempt to set a read lock on an object that is already locked with a read lock is always successful. But an attempt to set a write lock on an object with a read or write lock isn't. There are two philosophies concerning locks: mandatory locking and advisory locking. In mandatory systems operations actually fail if a lock is set. In advisory systems it is up to the programs to decide if the lock is to be honored. B. Transparency: - An early stated goal for some of the new distributed file systems was to provide the same interface as the old file systems to processes. This service is known as access transparency and have the advantage that current stand-alone systems could be migrated to networks seemingly easy and old program wouldn't need to be modified. Location transparency makes it possible for the clients to access the file system without knowledge of where the physical files are located. For example this makes it possible to retain paths even when the servers are physically moved or replaced. Scaling transparency is the ability of incremental growth of the system size without having to make changes to the structure of the system or its applications. The larger the system, the more important is scaling transparency. C. Distributed storage and replication: - One possibility in distributed systems is to spread the physical storage of data onto many locations. The main advantage of replication is increased dependability. Multiple copies increases the availability of the data by making it possible to share the load. In addition, it enhances the reliability by enabling clients to access others locations if one has failed. One important requirement for replication is to provide replication transparency, users should not need to be aware that their data is stored on multiple locations. Another use for distributed storage is the possibility to disconnect a computer from the network but continue working on some files. When reconnected the system would automatically propagate the necessary changes onto the other locations. D. Security: - In a multi-user environment a desirable feature is to only allow access to data to authorized users. Users must be authenticated and only requests to files where users has access right should be granted. Most distributed file systems implement some kind of access control lists. Additionally protection mechanisms may include protect of protocol messages by signing them with digital signatures and encrypting the data.
We begin by examining the basic abstraction realized by file systems, and proceed to develop a taxonomy of issues in their design. Section 2.2 then traces the origin and development of distributed file systems until the middle of the current decade, when the systems described in Section 3 came into use. A sizeable body of empirical data on file usage properties is available to us today. Section 2.3 summarizes these observations and shows how they have influenced the design of distributed file systems.
2.1. Basic Issues
Permanent storage is a fundamental abstraction in computing. It consists of a named set of objects that come into existence by explicit creation, are immune to temporary failures of the system, and persist until explicitly destroyed. The naming structure, the characteristics of the objects, and the set of operations associated with them characterize a specific refinement of the basic abstraction. A file system is one such refinement. From the perspective of file system design, computing models can be classified into four levels. The set of design issues at any level subsumes those at lower levels. Consequently, the implementation of a file system for a higher level will have to be more sophisticated than one that is adequate for a lower level. At the lowest level, exemplified by IBM PC-DOS  and Apple Macintosh , one user at a single site performs computations via a single process. A file system for this model must address four key issues. These include the naming structure of the file system, the application programming interface, the mapping of the file system abstraction on to physical storage media, and the integrity of the file system across power, hardware, media and software failures. The next level, exemplified by OS/2 , involves a single user computing with multiple processes at one site. Concurrency control is now an important consideration at the programming interface and in the implementation of the file system. The survey by Bernstein and Goodman  treats this issue in depth. The classic timesharing model, where multiple users share data and resources, constitutes the third level of the taxonomy. Mechanisms to specify and enforce security now become important. UNIX  is the archetype of a timesharing file system. Distributed file systems constitute the highest level of the taxonomy. Here multiple users who are physically dispersed in a network of autonomous computers share in the use of a common file system. A useful way to view such a system is to think of it as a distributed implementation of the timesharing file system abstraction. The challenge is in realizing this abstraction in an efficient, secure and robust manner. In addition, the issues of file location and availability assume significance. The simplest approach to file location is to embed location information in names. Examples of this approach can be found in the Newcastle Connection , Cedar , and VAX/VMS . But the static binding of name to location makes it inconvenient to move files between sites. It also requires users to remember machine names, a difficult feat in a large distributed environment. A better approach is to use location transparency, where the name of a file is devoid of location information. An explicit file location mechanism dynamically maps file names to storage sites. Availability is of special significance because the usage site of data can be different from its storage site. Hence failure modes are substantially more complex in a distributed environment. Replication, the basic technique used to achieve high availability, introduces complications of its own. Since multiple copies of a file are present, changes have to be propagated to all the replicas. Such propagation has to be done in a consistent and efficient manner.
User-initiated file transfer was the earliest form of remote file access. Although inconvenient and limited in functionality, it served as an important mechanism for sharing data in the early days of distributed computing. IFS on the Alto personal computers  and the Datanet file repository on the Arpanet  exemplify this approach. A major step in the evolution of distributed file systems was the recognition that access to remote file could be made to resemble access to local files. This property, called network transparency, implies that any operation that can be performed on a local file may also be performed on a remote file. The extent to which an actual implementation meets this ideal is an important measure of quality. The Newcastle Connection and Cocanet  are two early examples of systems that provided network transparency. In both cases the name of the remote site was a prefix of a remote file name. The decade from 1975 to 1985 saw a profusion of experimental file systems. Svobodova examines many of these in her comparative survey . Systems such as Felix , XDFS , Alpine , Swallow  and Amoeba [45, 46] explored the issues of atomic transactions and concurrency control on remote files. The Cambridge file system  and the CMU-CFS file system  examined how the naming structure of a distributed file system could be separated from its function as a permanent storage repository. The latter also addressed access control, caching, and transparent file migration on to archival media. Cedar  was the first file system to practically demonstrate the viability of caching entire files. Many of its design decisions were motivated by its intended application as a base for program development. Locus [56, 88] was a landmark system in two important ways. First, it identified location transparency as an important design criterion. Second it proposed replication, along with a mechanism for detecting inconsistency, to achieve high availability. Locus also provided support for atomic transactions on files and generalized the notion of transparent remote access to all aspects of the operating system. Weighted voting, an alternative way of using replication for availability, was demonstrated in Violet [21, 22]. The rapid decline of CPU and memory costs motivated research on workstations without local disks or other permanent storage media. In such a system, a disk server exports a low-level interface that emulates local disk operations. Diskless operation has been successfully demonstrated in systems such as V  and RVD . Lazowska et al  present an in-depth analysis of the performance of diskless workstations. Since diskless operation impacts autonomy, scalability, availability and security, it has to be viewed as a fundamental design constraint. It remains to be seen whether these considerations, together with continuing improvements in disk technology, will eventually outweigh the cost benefits of diskless operation. Distributed file systems are in widespread use today. Section 3 describes the most prominent of these systems. Each major vendor now supports a distributed file system, and users often view it as an indispensable component. But the process of evolution is far from complete. As elaborated in Section 5, the next decade is likely to see significant improvements in the functionality, usability and performance of distributed file systems.
2.3. Empirical Observations
A substantial amount of empirical investigation in the classic scientific mold has been done on file systems. The results of this work have been used to guide high-level design as well as to determine values of system parameters. For example, data on file sizes has been used in the efficient mapping of files to disk storage blocks. Information on the frequency of different file operations and the degree of read and write-sharing of files has influenced the design of caching algorithms. Type-specific file reference information has been useful in file placement and in the design of replication mechanisms. Empirical work on file systems involves many practical difficulties. The instrumentation usually requires modifications to the operating system. In addition, it has to impact system performance minimally. The total volume of data generated is usually large, and needs to be stored and processed efficiently. In addition to the difficulty of collecting data, there are two basic concerns about its interpretation. Generality is one of these concerns. How specific are the observations to the system being observed? Data of widespread applicability is obviously of most value. Independent investigations have been made of a variety of academic and research environments. The systems examined include IBM MVS [60, 79, 76], DEC PDP-10 [66, 67], and UNIX [50, 17, 18, 39]. Although these studies differ in their details, there is substantial overlap in the set of issues they investigate. Further, their results do not exhibit any serious contradictions. We thus have confidence in our understanding of file system characteristics in academic and research environments. Unfortunately there is little publicly available information from other kinds of environments. The second concern relates to the interdependency of design and empirical observations. Are the observed properties an artifact of existing system design or are they intrinsic? Little is known about the influence of system design on file properties, although the existence of such influence is undeniable. For example, in a design that uses whole-file transfer, there is substantial disincentive to the creation of very large files. In the long run this may affect the observed file size distribution. It is therefore important to revalidate our understanding of file properties as new systems are built and existing systems mature. Studies of file systems fall into two broad categories. Early studies [60, 79, 76, 66] were based on static analysis, using one or more snapshots of a file system. The data from these studies is unweight. Later studies [67, 50, 17, 18, 39] are based on dynamic analysis, using continuous monitoring of a file system. These data are weighted by frequency of file usage. Although these studies have all been done on timesharing file systems their results are assumed to hold for distributed file systems. This is based on the premise that user behavior and programming environment characteristics are the primary factors influencing file properties. A further assumption is that neither of these factors changes significantly in moving to a distributed environment. No studies have yet been done to validate these assumptions. The most consistent observation in all the studies is the skewing of file sizes toward the low end. In other words, most files are small, typically in the neighborhood of 10 kilobytes. Another common observation is that read operations on files are much more frequent than write operations. Random accessing of a file is rare. A typical application program sequentially reads an entire file into its address space and then performs non sequential processing on the in-memory data. A related observation is that a file is usually read in its entirety once it has been opened. Averaged over all the files in a system, data appears to be highly mutable. The functional lifetime of a file, defined as the time interval between the most recent read and the most recent write, is skewed toward the low end. In other words, data in files tends to be overwritten often. Although the mean functional lifetime is small, the tail of the distribution is long, indicating the existence of files with long-lived data. Most files are read and written by one user. When users share a file, it is usually the case that only one of them modifies it. Fine granularity read-write sharing of files is rare. It is important to emphasize that these are observations derived from research or academic environments. An environment with large collaborative projects or one that makes extensive use of databases may show substantially greater write-sharing of data. File references show substantial temporal locality of reference. If a file is referenced there is a high probability it will be referenced again in the near future. Over short periods of time the set of referenced files is a very small subset of all files. The characteristics described above apply to the file population as a whole. If one were to focus on files of a specific type their properties may differ significantly. For example, system programs tend to be stable and rarely modified. Consequently the average functional lifetime of system programs is much larger than the average over all files. Temporary files on the other hand show substantially shorter lifetimes.