IMPORTANT: This assignment may be done in pairs and you are strongly encouraged to do that. Late assignments penalized at 33.333% per day pro-rated. Late assignments not accepted after 2 days. In addition your program must compile without warnings or errors and you are not permitted to change compiler options or add directives in the code to disable warnings.
In this assignment you will be implementation a distributed transaction management system using two phase commit. You will also use vector timestamps and logging so that you can visualize the key steps in a transaction. See assignment 2 for a description of how to handle vector timestamps and use ShiViz. You are allowed to reuse your vector clock implementation from A2. You will
You are required to write 3 programs as part of this assignment:
Recall that both the workers and the transaction manager keep log files to record their decisions. In addition, the worker records transaction information for the objects it modifies. Consequently, one of the things you will want to do is build a transaction logging system. To simplify marking, you will implement a roll-back transaction system where the "real object" is modified and the old values are recorded in the log file. The sample code illustrates how to name and create a transaction log file. Although each worker and the transaction manager will write to their own log file, they must use a common transaction subsystem implementation. (i.e. you are to have only one transaction implementation regardless of whether it is used by a worker or transaction manager.) The precise information stored in the log file is a design decision, but keep in mind that a log file could have multiple sets of transaction related records in it and that you must also deal with the case when a worker or transaction manager crashes and then comes back to life. In such a situation you have to ensure that the vector clock starts where the other one left off. (Hint: Store the vector clock at the very start of the log file. Each time you change the vector clock seek to the start of the file and write the clocks out, or better yet, map the start of the file into memory using mmap and then update the memory and sync it. You might also want to open the file a second time time in append mode so that you write the other transaction records to the end of the file.
The transaction manager is responsible for coordinating one or more simultaneous transactions. For this assignment you may assume that there are at most 4 transactions in progress at once and that the total number of workers, across all transactions is no more than 9. (This implies that there will be no more than 10 vector clocks. As in assignment 2 you can use the port number as the node number. As described below the worker will have two UDP ports, you should probably use the command port as the node ID for the vector clocks.) The transaction manager takes a single argument, the port it is to listen on and send from. As in assignment 2 you are to used UDP as the transport protocol and do not have to confirm if a packet is received.
To simulate the modification of objects involved in a transaction, the worker has some state information it maintains to simulate two integer objects, A and B and string identifier. Since the objects have to be durable, this state information is stored in a file and needs to be updated each time an object's value is changed. The format of the state information can be found in the tworker.h file. The IDstring can be anything you want. The fields A and B will be updated by commands, and the vectorClock values and lastUpdateTime fields are changed every time either A or B is updated. Keep this in mind when you record information in the transaction log as the both the vectorClock and lastUpdateTime will need to revert to the values they had at the start of the transaction. The type and format of the messages exchanged between the workers and transaction manager are left as a design decision. However, if a message requires a response, you are to use a timeoutvalue of 10 seconds. After 10 seconds, the entity waiting for a response can assume the other side has crashed and perform whatever the appropriate action is at that point. If a worker is ever in the "uncertain" state and needs to get a decision from the coordinator it should wait 30 seconds for a decision after it has sent the vote that it is prepared to commit and once every 10 seconds after that until it gets a result.
The worker program takes a single argument, the UDP port number it will listen on for the commands described below. Note you will need to use a 2nd UDP port to communicate with the transaction manager on. You should probably let the system select this port number when appropriate. There is one important thing to remember, if the worker crashes when it comes back up in must use this same port to interact with the transaction manger. This means that the worker will need to record the port number, it also means that the transaction manager needs to keep track of the contact information for each worker. A suggestion would be to record at the start of the transaction log file along with the vector clocks the port number to use for the transaction manager/worker interactions. When a worker is restarted after a crash it must be restarted with the same command port number specified on the command line that it was originally started with. Note that since the workers don't maintain the contact information for the other workers when they restart and are in the uncertain state they will have to poll the transaction manager until a response is received. This could be a long time if the manager is down.
A big challenge with testing the transaction management system is demonstrating that it works. The purpose of the cmd program is to help simplify this task. This program interprets the arguments supplied on the command line and sends a "command" to the identified worker. The worker then performs the action or actions specified by the command. These commands can be used to simulate various types of interactions between the transaction manager and the workers. Only your worker process needs to accept and respond to these commands. You can assume that the UDP packet is delivered and acted on by the worker. There is no response message from the worker to the cmd program. The commands are as follows:
A definition of an object store is provided in the file tworker.h. The example code in tworker.c illustrates how to create the backing store for this object and how to modify it. You are not allowed to change this structure or how the name of the file it is stored in is determined, or where in the file it is stored. Every time you change one of the 3 objects you must update the lastupdate time as part of the change along with updating the vector clock. The program dumpObject has been provided to print the contents of one of these object stores.
You will probably want to develop some scripts for testing your implementation. You should add these scripts to your repo and commit them.
As indicated earlier you need to log events to a ShiViz log file for debugging and visualization purposes. Each "node" is to keep its own event log file such that they can be combined and used by ShiViz. Clearly you will want to log the sending and receiving of messages between transaction manager and workers. You will probably also want the workers to log events that result in changes to the objects. Key events like the starting, stopping (crashing), and/or restarting of the worker or transaction manager are also good events to log.
As part of this assignment you are to submit an example of two log files you used submitted to ShiViz (They are to be named ShiViz-Log.dat1 ShiViz-log.dat2). One of the runs will show a normal transaction completion with now errors and the other must show a situation where a worker fails at some point but the transaction commits. In addition to the coordinator there must be at least 4 workers in the system. In the file ShiViz-Report.txt provide a brief description of each scenario highlighting key events. (e.g. when something crashed, when a recover starts, when a recover decisions is reached, messages involved in the recovery etc.) Be clear as to which file the description applies. The write-up is to be be in clear grammatically correct English (make sure to run a spell-checker on the submission) of between 200 and 300 words. To give you a sense of size, this paragraph is about 90 words.
All work is to be handed in via stash. Do not, any any circumstances hand-in object code, executable files, or any other form of binary file and that includes word or PF documents. Make sure you hand-in: