White Paper: Fault-Tolerant Control of a Distributed Database System

Optimal state information-based control policy for a distributed database system subject to server failures is considered. Fault tolerance is made possible by the partitioned architecture of the system and data redundancy therein. Control actions include restoration of lost data sets in a single server using redundant data sets in the remaining servers, routing of queries to intact servers, or overhaul of the entire system for renewal. Control policies are determined by solving Markov decision problems with cost criteria that penalize system unavailability and slow query response.

January 27, 2011

The first objective of this paper is to provide justification that the control policy applied in the aforementioned study is optimal in a well defined sense. To that end, a Markov decision problem is formulated and the solution that minimizes a total expected discounted cost is sought. For the purpose of illustration, a simple problem that disregards the query states is set up, for which the policy developed in is confirmed to be optimal.

Click here to download the rest of the white paper.