January 04, 2010 12:00 AM

Designing Reliable Workflow Services

Handling service exceptions using .NET 4 Workflow Services and Windows Application Server Extensions
Rating: (0)
DevProConnections
InstantDoc ID #124937

Exceptions happen. Workflow Services—implemented using .NET 4 and hosted in the Windows Application Server Extensions (code-named "Dublin")—offers three options for dealing with exceptions. They can immediately stop running, shut down gracefully, or roll back to a last-known good state. This article goes beyond the structured exception handling offered by the TryCatch activity to demonstrate what can be achieved by a combination of workflow activities, workflow design, and Dublin configuration.

What’s Reliable?
Reliability entails handling or recovering from exceptions generated within the workflow logic as well as faults generated by interactions with systems external to the workflow. When the exception is caused by the internal logic, it might be acceptable to simply discard any work performed thus far and return to a last-known good state on the assumption that retrying the operation may result in success. However, with external systems you will typically want to perform some form of cleanup of the work done in order to return the entire system to its previous good state. For example, a graceful cleanup may entail deleting inserted records in a database or making calls to externals services that "undo" work completed, or otherwise restoring the state of the external system to a state before action occurred (e.g., if a file was created, delete it).

Upon encountering an exception, the workflow service host runtime can treat the error as a fatal exception and simply quit the instance—resulting in a terminated workflow. If the error might be recoverable (e.g., by retrying the operation), the in-memory instance of the workflow can be discarded, and the workflow is considered aborted. The runtime will automatically resume execution of the aborted workflow from the point at which it was last persisted. Alternately, an aborted workflow instance can be suspended, remaining persisted until a user explicitly resumes the instance through the IIS Manager. Sometimes the error is not recoverable, but additional logic is required to return the system to a consistent state—a canceled workflow provides the opportunity to execute this cleanup logic. On top of these options, workflow also supports the notion of compensation, which allows one to define best-effort “undo” logic that runs automatically when the workflow encounters an unhandled exception.

Configuring Reliable Workflows in Dublin
Through the IIS Manager, a running workflow service instance that is configured for persistence in Dublin can be suspended, terminated, or canceled. One way to access these options within the IIS Manager is to select an application, double-click the Dashboard feature, and then click the Persisted WF Instances feature hyperlink. Right clicking an item in the listing for a running workflow instance will produce a menu similar to that shown in Figure 1. By selecting terminate, suspend, or cancel, the user is able to force the workflow instance to take the selected action.

Dublin can be configured to automatically take one of these actions for any workflow instance that throws an unhandled exception. Configuring Dublin to react in this way is the cornerstone of reliable workflow services and is performed while enabling an application for persistence. Within IIS Manager, right click an application and choose .NET 4 WCF and WF, Configuration, then select the Workflow Persistence tab. On that tab, check the Enable Application Persistence check box to enable the Advanced… button. Clicking this button displays the dialog shown in Figure 2. Take notice of the Action on unhandled exception drop- down, as this will play a critical role in reliability.

Figure 3 summarizes the states (cancel, terminate, or abort) that a workflow instance can enter in response to some form of error condition. The triggers consist of either a user performing an action through IIS Manager as described previously (UI), in response to an exception that has gone unhandled and bubbled all the way to the top (unhandled exception), or as the result of a particular activity. Cancellation allows cleanup logic to be defined, whereas an aborted workflow allows the instance to be manually restarted from a last-known good checkpoint. A canceled workflow also allows for logic to undo work previously done in the face of an unhandled exception via a process known as implicit compensation. A terminated workflow is one that halts execution immediately and does not allow for cleanup or returning to a previously known state. Let's consider the ramifications to workflow service reliability from entering each of these states.

Aborting Workflows
A workflow can be aborted by either aborting a workflow through IIS Manager or by configuring the Advanced Persistence Settings to abandon unhandled exceptions, as described previously. Note that in Dublin the terminology is abandon, though .NET 4 Workflow uses the term abort. Abort effectively discards any state changes to the workflow instance that have occurred in memory and resets its current state to what is stored in the persistence database. Therefore, creating a reliable workflow service using abort amounts to establishing checkpoints that update the persistence store. This can be done most easily by placing persist activities at the desired locations, but can also be enabled on SendReply activities by setting their PersistBeforeSend property to true. Figure 4 shows a sample workflow that puts a persistence “checkpoint” between the receive operations. In calling DoMoreWork, an exception will be thrown which will remain unhandled. The Action on unhandled exception in this case is set to Abandon. The runtime will act as if DoMoreWork was never called, and re-schedule it so it can be called again.

Consider that instead of using Throw, we have an activity that calls out to another service and that service is simply unreachable because the network is down. Abort provides a simple way to retry that call by invoking DoMoreWork again, once network connectivity is restored. If we had chosen Abandon and suspend before DoMoreWork could be called, the persisted workflow instance would have to be manually resumed via IIS Manager. This extra step can be useful to exert more precise control over when workflows resume their execution.

 

ARTICLE TOOLS

Add a Comment

There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here

Comments from the DevProConnections Community

GOOGLE LINKS
SPONSORED LINKS
FEATURED LINKS