CS239-1 Homework Assignment #2 - April 19, 2007

Homework is due at the beginning of class, Thursday, April 26, 2007

(26 points) The Conquest file system actually consisted of two sub-file systems. There was a file system that held all files' metadata and the complete data for all files smaller than 1 Mbyte in persistent RAM, and a simple disk file system that stored the data for all files larger than 1 Mbyte in a format optimized for sequential access to the data. The assumption was that large files would be accessed almost entirely sequentially, which could be done very efficiently from this file system. Assume metadata operations take a characteristically different time than small file accesses, due to a somewhat different path through the file system code.

We wish to design a model for a generator to create a workload for the Conquest file system. We have obtained a file system trace and below characterize it by the accesses to metadata, the accesses to small files, the sequential accesses to large files (defined as either accessing the first byte of a large file, or the byte immediately after the last byte of the file that was accessed), and the non-sequential accesses to large files (defined as any other pattern of access to a large file than that defined as sequential access). The table below shows how often each type of operation was performed and how often each other operation immediately followed it.

M = metadata (100,000 total operations) SF = small file (500,000 total operations) SLF = sequential access to large files (225,000 total operations) NLF = non-sequential access to large files (50,000 total operations)

Next operation operation M SF SLF NLF M 31,500 65,000 3,000 500 SF 68,000 419,000 10,000 2,000 SLF 400 5,000 208,000 11,600 NLF 99 10,000 4,000 35,900

a) Create a model for this access pattern based on the probability of each kind of operation occurring. (6 points)

b) Design a Markov model for this access pattern. Show both the chart and the state transition diagram. (10 points)

c) Is a Markov model a better model for this workload, or should we use the purely probabilistic model? Why? Is there any piece of information about the system that it would be helpful to know in making this decision? (10 points)

(30 points, 3 points for each part) For each of the following cases, what sort of workload would you suggest? Describe the type of workload (trace, benchmark, generator, live test) and the kinds of operations that would be included in that workload.

A. A laptop computer being considered for use by a group of salesmen.

B. A machine intended as the web server for a moderate sized (but rapidly growing) Internet commerce site.

C. A distributed system that detects the spread of worms by analyzing the number of connections from unique sites initiated per minute.

D. A router to handle traffic internally for a large company. The router is expected to be connected to 2-3 incoming lines at T3 speeds or greater, and will fan out traffic to around two dozen internal local area networks.

E. Determining which of several alternate implementations of TCP will provide the best service for a set of machines that move large scientific data sets across the Internet.

F. A browser plugin that provides users with a color-coded indication of the probability that a particular web site is being used for phishing.

G. A virus checking program designed to scan an entire disk.

H. A virus checking program that looks for viruses in email attachments.

I. A new cryptographic algorithm that purports to be faster than AES.

J. Deciding whether a company will achieve reasonable performance if it upgrades from Windows XP to Vista.

(30 points) For each of the following situations, suggest the parameter or parameters that should be used to characterize the workload. Indicate whether you can characterize these parameters as just averages, averages plus variation, histograms for each parameter, or multiparameter histograms. Don't worry too much about the other elements of the experiment (such as topology of test systems or measurement issues), except to the extent they touch on the above issues.

A. A multipath routing algorithm finds several paths from source to destination in a network by proportionally dividing traffic based the load on the paths, with a goal of reducing the average delay for a packet. Background traffic in the network is considered by another mechanism. How should we represent the workload for the source-to-destination traffic? (5 points)

B. A distributed hash table implementation of DNS is to be tested in a wide-area environment for both reliability and speed. What parameters should we choose and how should we characterize each? (5 points)

C. A new piece of buffer management code has been added to Linux, with the goal of reducing the average time to allocate buffers for both file system operations and network sends and receives. (5 points)

D. A new algorithm for deciding when to spin down the disk in a laptop computer has been proposed.

i). What parameters would you choose to test whether this algorithm saves more battery power than the existing algorithm? How would you characterize them? (5 points)

ii). What parameters would you choose to test whether this algorithm has a different effect on the performance of applications on the machine? How would you characterize them? (5 points)

iii). Describe why you should use the same or different parameters for these two cases. (5 points)

(14 points) A network intrusion detection appliance observes packets in a local area network and analyzes them to discover possible attacks. The system is designed to run on a typical Ethernet that is connected to the Internet through a firewall and that hosts several dozen machines. How would you design a workload to test this system:

1). If you were evaluating the system for an article to appear in an industry magazine? (7 points)

2). If you were evaluating the system to be installed in your own office? (7 points)