Serverless Computing: Look Ma, no Guarantees

This blog explores failure behavior in context of AWS Lambda functions being invoked by (transactional) MySQL triggers.

In Short: No Guarantees

There is no transactional coordination between an ongoing database transaction during trigger execution and a AWS Lambda functions invoked by the trigger: a function failure does not necessarily cause a transaction rollback and vice versa: no transactional guarantees, aka, no consistency guarantees, are provided out of the box by the combination of MySQL triggers and AWS Lambda function execution.

It is a DIY moment.

Execution Styles

On a high level, two execution styles can be distinguished:

  • OLTP (Online Transaction Processing)
    • Client initiates the execution of functionality synchronously via an invocation
    • Client receives response of success or failure at the end of the invocation in the same synchronous invocation thread
  • Event (asynchronous execution from client viewpoint)
    • Client initiates the execution of functionality asynchronously by submitting a request, and immediately receives the response that the request was accepted (or not).
    • The client does not get any result of the invocation itself in the initial request submission
    • Server continues the execution, and possibly uses events to coordinate various execution steps
    • Client submits an identifier with the request (later on implemented as ‘correlation_id’) that can be used by the client to check on the asynchronous execution state
    • Any success or failure will be known later in an asynchronous operation: the client polls for the invocation outcome using the ‘correlation_id’, or it will be notified of the outcome via a callback invocation

In the following I focus exclusively on event style execution and all discussion as well as implementation is in context of this style. Retrieving the result or status of asynchronous invocations is done via polling, not callback.

Example Logic

In context of this blog, the client initiates the invocation by inserting a row into a table. The insert initiates a trigger, that in turn calls an AWS Lambda function. The function execution result is placed into the row. If an error happens, error information is recorded in the row instead of a result. The client can retrieve the result, the status or any error that took place by querying the table. In order to know which row to look for, the client can insert a ‘correlation_id’ and use that as primary key to determine the row it is looking for.

The table is defined as follows:

CREATE TABLE fib.fibonacci (
  fib INTEGER,
  value INTEGER,
  correlation_id CHAR(36),
  success BOOLEAN,
  error_sqlstate CHAR(5),
  error_msg TEXT,
  error_errno INTEGER,
  PRIMARY KEY (correlation_id)
);

The columns represent the following data

  • fib. This column is inserted by a client stating the Fibonacci value it wants computed.
  • value. This column will contain the Fibonacci value corresponding to ‘fib’ as it is computed by the function invoked. It is null if it is not computed or an error occurred.
  • correlation_id. This column is inserted by the client in order to distinguish the different rows (aka, asynchronous invocations).
  • success. This column is a Boolean value indicating the success of the invocation (MySQL uses 1 for true, and 0 for false).
  • error_sqlstate. This column contains the MySQL ‘errorstate’ in case an error happened.
  • error_msg. This column contains the MySQL ‘msg’ in case an error happened.
  • error_errno. This column contains the MySQL ‘errno’ in case an error happened.

For this insert

INSERT INTO fib.fibonacci VALUES ( 2, null, '22', null, null, null, null);

the result row looks like

| fib | value | correlation_id | success | error_sqlstate | error_msg | error_errno |
+-----+-------+----------------+---------+----------------+-----------+-------------+
| 2   | 1     | '22'           | 1       | NULL           | NULL      | NULL        |

The following insert causes an error

INSERT INTO fib.fibonacci VALUES (-2, null, '-22', null, null, null, null);

and the result row looks like

| fib | value | correlation_id | success | error_sqlstate | error_msg                           | error_errno |
+-----+-------+----------------+---------+----------------+-------------------------------------+-------------+ 
| -2  | NULL  | '-22'          | 0       | '45000'        | 'Signal before function invocation' | 1091        |

Failure Points Analysis: Principle Points of Failure

There are different possible points of failure during the trigger execution. Given the combination of MySQL triggers and an AWS Lambda function invocation, the various failure points are discussed next.

The trigger has the following structure (some details omitted):

CREATE TRIGGER fibcomp BEFORE INSERT
ON fibonacci
FOR EACH ROW
BEGIN
<local variable declarations>
DECLARE CONTINUE HANDLER FOR SQLEXCEPTION
  BEGIN
        GET DIAGNOSTICS CONDITION 1
        error_sqlstate = RETURNED_SQLSTATE,
        error_msg = MESSAGE_TEXT,
        error_errno = MYSQL_ERRNO;
        SET NEW.error_sqlstate = error_sqlstate;
        SET NEW.error_msg = error_msg;
        SET NEW.error_errno = error_errno;
        SET NEW.success = FALSE;
  END;
SET function_result_message = lambda_sync('arn:aws:lambda:<region:id>:function:fibonacci',
                                          CONCAT('{"fib": ', NEW.fib, '}'));

CALL json_extract_property(function_result_message, 'output', output_value_as_string, property_type);

IF (output_value_as_string IS NULL)
   THEN
        SET NEW.error_errno = "-1";
        SET NEW.error_msg = CONCAT('No output value received: ', function_result_message);
   ELSE
        SET NEW.value = output_value_as_string;
        SET NEW.success = TRUE;
   END IF;
END

Basically, when an insert takes place, the AWS Lambda function ‘fibonacci’ is called. The results is given to the recursive descent parser that I implemented for extracting the ‘output’. The output is analyzed and either inserted into the row, or if there is a failure, the error is inserted. The continue handler catches every exception and inserts detailed error information into the row.

In relation to an AWS Lambda function invocation a trigger can fail

  • Before the function execution
  • During the function execution
  • After the function execution

For any failure the following semantics applies:

  • The failures is recorded in the row so that a client can retrieve the failure details (using the ‘correlation_id’)
  • The service implementation, not client, would ideally perform any error handling and retry logic in order to get to a final resolution (of success or terminal failure). The client should not be involved in error recovery and resolution unless the server exhausted all possible error resolution possibilities.
  • In terms of MySQL, the error handler uses the ‘continue handler’ condition so that it can insert error details into the row. The trigger logic is built so that after an error the trigger execution is completed (and not continued after the failure point).
  • The error (or success) is recorded in the ‘success’ column for the client to determine the state. In case of an error ‘0’ (false) is inserted, in case of success a ‘1’ is recorded
  • For clients to check the execution status a client queries the row it is interested in by providing the ‘correlation_id’ as selection criteria

Deterministic Chaos Monkey: Failures and Outcome

One possible way to actually implement forced errors at the various failure points is to introduce a deterministic chaos monkey that creates a failure on command as follows.

A ‘Chaos Monkey’ database function is implemented (in a very simple deterministic approach at this point) causing different failures based on the various input values. The chaos monkey is invoked in 3 locations: before, during, and after the AWS Lambda function invocation.

The induced errors and their triggering input values are discussed next by failure point.

Before Function Invocation

  • fib = -1. Fatal: have trigger sleep for a minute and crash database by manually killing the OS process running the database
    • Rollback takes place after restart, aka, the insert has no effect
    • Client realizes that no row with the provided ‘correlation_id’ can be found, so it knows that a rollback took place
    • The failure test is done on a locally running MySQL database as it is not possible to crash the database intentionally on AWS
  • fib = -2. Non-fatal: signal
    • No rollback takes place and the insert is successful, but execution does not finish successfully
    • The error is recorded in the row
| fib | value | correlation_id | success | error_sqlstate | error_msg                           | error_errno |
+-----+-------+----------------+---------+----------------+-------------------------------------+-------------+ 
| -2  | NULL  | '-22'          | 0       | '45000'        | 'Signal before function invocation' | 1091        |

AWS Lambda Function Execution Failure

  • fib = -3. Implementation failure caused by a div by zero
    • The function invocation returns an incomplete stack trace with that information embedded (denoted separately below for formatting reasons)
| fib | value | correlation_id | success | error_sqlstate | error_msg | error_errno |
+-----+-------+----------------+---------+----------------+-----------+-------------+ 
| -3  | NULL  | '-33'          | 0       | 'HY000'        | <see (1)> | 1873        |

(1)
'Lambda API returned error: Unhandled. {\n
   \"errorMessage\" : \"/ by zero\",\n
 \"errorType\" : \"java.lang.ArithmeticException\",\n
   \"stackTrace\" : [\n
 \"org.blog.fibonacci.Fibonacci.errorMonkey(Fibonacci.java:190)\",\n
      \"org.blog.fibonacci.Fibonacci.handleRequest(Fibonacci.java:137)\",\n
 \"sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\",\n
      \"sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\",\n
 \"sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMetho'
  • fib = -4. Environment failure by exceeding the AWS Lambda execution time limit (the function invocation sleeps for 10 seconds with a AWS Lambda function execution time limit of 9 seconds)
    • An error message is coming back, however, it does not state that a violation of the limit happened, only that the ‘task’ timed out
| fib | value | correlation_id | success | error_sqlstate | error_msg | error_errno |
+-----+-------+----------------+---------+----------------+-----------+-------------+ 
| -4  | NULL  | '-44'          | 0       | 'HY000'        | <see (2)> | 1873        |

(2)
'Lambda API returned error: Unhandled. {\n
   \"errorMessage\" : \"2018-09-25T13:55:22.381Z 9fe96566-c0ca-11e8-bdeb-17decae46851 Task timed out after 9.01 seconds\"\n
}\n'
  • fib = -5. Environment failure by exceeding the memory limit set (a string is created of size 128MB + 1 byte with an AWS Lambda function limit set to 128MB)
    • Again, the error message coming back only states that there is an out of memory error, not that the set limit was exceeded.
| fib | value | correlation_id | success | error_sqlstate | error_msg | error_errno |
+-----+-------+----------------+---------+----------------+-----------+-------------+ 
| -5  | NULL  | '-55'          | 0       | 'HY000'        | <see (3)> | 1873        |

(3)
'Lambda API returned error: Unhandled. {\n
   \"errorMessage\" : \"Java heap space\",\n
 \"errorType\" : \"java.lang.OutOfMemoryError\",\n
   \"stackTrace\" : [\n
 \"java.util.Arrays.copyOf(Arrays.java:3332)\",\n
 \"java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)\",\n
      \"java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)\",\n
 \"java.lang.StringBuilder.append(StringBuilder.java:202)\",\n
 \"java.util.Formatter$FormatSpecifier.justify(Formatter.java:2926)\",\n'

Some observations:

  • All AWS Lambda function execution errors have same ‘sqlstate’ and ‘errno’
    • HY000
    • 1873
  • To distinguish the errors and to determine the reason requires parsing the error message. This is a bummer for various reasons.
  • The errors do not necessarily state the complete context for the failure, especially when limits are violated
  • A network problem was encountered, not sure why, and it was intermittent. So errors happen, and this unexpected error makes the point of this blog (and its subsequent ones) quite nicely.
| fib | value | correlation_id | success | error_sqlstate | error_msg | error_errno |
+-----+-------+----------------+---------+----------------+-----------+-------------+ 
| -4  | NULL  | '-44'          | 0       | 'HY000'        | <see (4)> | 1873        |

(4)
'Lambda API returned error: Network Connection. Unable to connect to endpoint'

I tried to find a complete list of possible AWS Lambda errors, but was not able to.

After Function Invocation

  • fib = -6. Fatal: have trigger sleep for a minute and crash database by manually killing the OS process running the database
    • Rollback took place after restart, aka, has no effect
    • Client needs to realize that by not finding a row with its known ‘correlation_id’
  • fib = -7. Non-fatal: signal
    • No rollback takes place and the insert is successful, but execution does not finish successfully
    • It is unknown if the function invocation succeeded or not. Any recovery requires checking if any function side effect took place
| fib | value | correlation_id | success | error_sqlstate | error_msg                          | error_errno |
+-----+-------+----------------+---------+----------------+------------------------------------+-------------+ 
| -7  | NULL  | '-77'          | 0       | '45000'        | 'Signal after function invocation' | 1092        |

The following is the complete table after triggering each error once:

| fib | value | correlation_id | success | error_sqlstate | error_msg                           | error_errno |
+-----+-------+----------------+---------+----------------+-------------------------------------+-------------+ 
| -2  | NULL  | '-22'          | 0       | '45000'        | 'Signal before function invocation' | 1091        |
| -3  | NULL  | '-33'          | 0       | 'HY000'        | <see (1)>                           | 1873        |
| -4  | NULL  | '-44'          | 0       | 'HY000'        | <see (2)>                           | 1873        |
| -5  | NULL  | '-55'          | 0       | 'HY000'        | <see (3)>                           | 1873        |
| -7  | NULL  | '-77'          | 0       | '45000'        | 'Signal after function invocation'  | 1092        |

Error Recovery

At this point in the discussion the errors, if they take place, are recorded. The upcoming blogs will focus on error recovery, and what has to be in place in order to actually be able to recover.

A big role plays the function type and behavior. A function that is idempotent can be simply invoked again and trigger execution can continue.

If a function is non-idempotent, it might be that the function was executed, but the result was never obtained by the trigger (aka, the return value was lost). Unless it is known if the function was executed successfully, it is impossible to recover from the error correctly and completely.

Two options for probing a function’s execution success are:

  • An idempotent check function that checks if the original function was executed successfully or not. Each regular function needs to be paired up with a check function for this approach. The check function must be able to return the result in order to recover the originally lost return
  • The function provides a idempotency token. This invocation style requires that the client generated a separate token for each invocation, and uses the same token for a repeat invocation to indicate that this is an attempt to execute the function again. The function implementation, receiving the idempotency token, understands that this invocation is a probing invocation and can return the result if it was already computed or initiate the computation.

Function Invocation Chain

As a teaser the following requires discussion next: what type of function invocations chains are assumed?

There are two dimensions:

  • One trigger calls more than one function
  • A function calls another function as part of its implementation
  • A combination of both

In each case, every function can be either idempotent or modifying state without being transactionally bound or coordinated with the trigger.

Summary

Since AWS Lambda functions and MySQL triggers are not transactionally bound, errors during execution can leave an inconsistent state behind. If consistency is required even in the presence of failures, error detection, analysis, recovery and resolution has to be designed and implemented in the triggers as well as AWS Lambda functions.

Go Serverless!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

 

Advertisements