AWS Step Functions: Orchestrating Complex Workflows

I deleted roughly 2,000 lines of orchestration code from our payment processing service last year. Replaced it with about 200 lines of Amazon States Language JSON. The system got more reliable, not less. That’s the short version of why I think Step Functions is one of the most underappreciated services in AWS.

The longer version involves a 3am incident, a chain of Lambda functions calling each other through direct invocation, and a payment that got charged twice because nobody could tell where the workflow had actually failed.

The Lambda Chain That Broke Everything

Here’s what our payment pipeline looked like before Step Functions. We had five Lambda functions: validate the order, check inventory, process payment, update the database, and send confirmation. Each one invoked the next using the AWS SDK. Classic Lambda-to-Lambda chain.

It worked fine in dev. It worked fine under light load. Then Black Friday happened.

The inventory check Lambda started timing out under load. The payment Lambda had already fired. But the inventory Lambda’s timeout meant the upstream function got an error — and retried. Now we’d charged the customer twice, the inventory count was wrong, and the confirmation email never sent. The worst part? CloudWatch showed the first Lambda as successful. The failure was buried three functions deep, and we had no single view of what had actually happened.

I’ve written about Lambda cold starts causing latency issues, but this was a different beast entirely. This was an orchestration problem. We were using Lambda as both the compute and the coordinator, and it’s terrible at the second job.

That’s what Step Functions solves. It separates the “what to do” from the “in what order, and what if something breaks.”

What Step Functions Actually Is

Step Functions is a state machine service. You define a workflow as a series of states — tasks, choices, parallel branches, waits — using Amazon States Language (ASL), which is just JSON. AWS manages the execution, tracks state, handles retries, and gives you a visual map of every execution.

Two workflow types exist:

Standard Workflows run for up to a year, give you exactly-once execution semantics, and maintain a full execution history. You pay per state transition. These are your bread and butter for business-critical processes — order fulfillment, ETL pipelines, anything where you need an audit trail.

Express Workflows run for up to five minutes, use at-least-once semantics, and bill by execution count and duration. They’re built for high-volume, event-driven work — processing IoT telemetry, transforming streaming data, handling API Gateway requests synchronously.

The choice between them isn’t academic. I’ve seen teams burn money running high-frequency event processing on Standard workflows when Express would’ve cost a tenth as much. And I’ve seen teams use Express for workflows that need exactly-once guarantees, then wonder why they’re processing duplicates.

If you’re building event-driven patterns, Express workflows pair naturally with EventBridge architectures. For longer-running business processes that fit into broader serverless patterns, Standard is almost always the right call.

Rewriting the Payment Pipeline in ASL

Here’s a simplified version of what our payment workflow looked like after the migration. This is the core state machine definition:

{
  "Comment": "Payment processing workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:eu-west-1:123456789012:function:ValidateOrder",
      "Next": "CheckInventory",
      "Retry": [
        {
          "ErrorEquals": ["States.TaskFailed"],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "HandleFailure",
          "ResultPath": "$.error"
        }
      ]
    },
    "CheckInventory": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:eu-west-1:123456789012:function:CheckInventory",
      "Next": "ProcessPayment",
      "Retry": [
        {
          "ErrorEquals": ["States.TaskFailed"],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "HandleFailure",
          "ResultPath": "$.error"
        }
      ]
    },
    "ProcessPayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:eu-west-1:123456789012:function:ProcessPayment",
      "Next": "ParallelPostProcessing",
      "Retry": [
        {
          "ErrorEquals": ["States.TaskFailed"],
          "IntervalSeconds": 5,
          "MaxAttempts": 2,
          "BackoffRate": 3.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "RefundPayment",
          "ResultPath": "$.error"
        }
      ]
    },
    "ParallelPostProcessing": {
      "Type": "Parallel",
      "End": true,
      "Branches": [
        {
          "StartAt": "UpdateDatabase",
          "States": {
            "UpdateDatabase": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:eu-west-1:123456789012:function:UpdateDB",
              "End": true
            }
          }
        },
        {
          "StartAt": "SendConfirmation",
          "States": {
            "SendConfirmation": {
              "Type": "Task",
              "Resource": "arn:aws:states:::sns:publish",
              "Parameters": {
                "TopicArn": "arn:aws:sns:eu-west-1:123456789012:OrderConfirmations",
                "Message.$": "$.confirmationMessage"
              },
              "End": true
            }
          }
        }
      ]
    },
    "RefundPayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:eu-west-1:123456789012:function:RefundPayment",
      "Next": "HandleFailure"
    },
    "HandleFailure": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:eu-west-1:123456789012:function:NotifyFailure",
      "End": true
    }
  }
}

Look at what we get for free here. If CheckInventory fails, it retries three times with exponential backoff. If it still fails, it catches the error and routes to HandleFailure — the payment never fires. If the payment succeeds but something downstream breaks, we route to RefundPayment first, then notify. Every execution is tracked. Every state transition is logged. No more guessing where things went wrong at 3am.

The Parallel state at the end runs the database update and SNS notification simultaneously. In the old Lambda chain, we did these sequentially because coordinating parallel async calls inside a Lambda was a mess. Step Functions makes it a first-class concept.

Error Handling That Actually Works

Error handling is where Step Functions earns its keep. In the Lambda chain world, you’re writing try/catch blocks in every function, building your own retry logic, and hoping you’ve covered every edge case. With Step Functions, error handling is declarative.

Three mechanisms matter:

Retry — Automatic retries with configurable intervals, backoff rates, and max attempts. You can target specific error types. I always set up retries for Lambda.ServiceException and Lambda.SdkClientException because Lambda itself can throw transient errors that have nothing to do with your code.

Catch — Fallback routing when retries are exhausted. The ResultPath parameter is crucial here — it lets you append the error information to the original input so your fallback state knows both what was being processed and what went wrong.

Timeouts — The TimeoutSeconds field on Task states prevents executions from hanging forever. I set this on every single task. A stuck execution costs money on Standard workflows and blocks resources. Default timeout is… there isn’t one. If your Lambda times out but Step Functions doesn’t know about it, you’ll wait until the state machine’s own timeout (up to a year for Standard). Always set explicit timeouts.

Here’s a pattern I use for tasks that call external APIs where I want aggressive retries on transient failures but immediate failure on validation errors:

{
  "Type": "Task",
  "Resource": "arn:aws:lambda:eu-west-1:123456789012:function:CallExternalAPI",
  "TimeoutSeconds": 30,
  "Retry": [
    {
      "ErrorEquals": ["States.Timeout", "Lambda.ServiceException"],
      "IntervalSeconds": 3,
      "MaxAttempts": 5,
      "BackoffRate": 2.0,
      "MaxDelaySeconds": 30
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["ValidationError"],
      "Next": "RejectOrder",
      "ResultPath": "$.error"
    },
    {
      "ErrorEquals": ["States.ALL"],
      "Next": "HandleUnexpectedError",
      "ResultPath": "$.error"
    }
  ]
}

Order matters in both Retry and Catch arrays. Step Functions evaluates them top to bottom and uses the first match. Put your specific error handlers before States.ALL.

Integration Patterns: Stop Writing Glue Code

One of the things that took me too long to learn is that Step Functions can call AWS services directly — you don’t need a Lambda function in between. These are called optimized integrations, and they eliminated a bunch of our “glue” Lambdas that existed only to make an SDK call and pass the result along.

Three integration patterns:

Request Response — Fire and forget. Step Functions calls the service and immediately moves to the next state. Use this for SNS publishes, SQS sends, anything where you don’t need to wait for a result.

Run a Job (.sync) — Step Functions calls the service and waits for the job to complete. This is huge for Batch jobs, ECS tasks, Glue jobs, even nested Step Functions executions. You append .sync to the resource ARN and Step Functions handles the polling for you.

Wait for Callback (.waitForTaskToken) — Step Functions pauses and waits for an external system to call back with a task token. This is how you integrate human approvals, external webhooks, or any process that lives outside AWS. The state machine just… waits. For up to a year on Standard workflows.

Here’s a direct DynamoDB integration that skips Lambda entirely:

{
  "Type": "Task",
  "Resource": "arn:aws:states:::dynamodb:putItem",
  "Parameters": {
    "TableName": "Orders",
    "Item": {
      "OrderId": {"S.$": "$.orderId"},
      "Status": {"S": "CONFIRMED"},
      "Timestamp": {"S.$": "$$.State.EnteredTime"}
    }
  },
  "ResultPath": "$.dynamoResult",
  "Next": "SendNotification"
}

No Lambda. No cold start. No runtime to maintain. Step Functions talks to DynamoDB directly. I’ve replaced dozens of single-purpose Lambda functions with direct integrations like this. Less code, fewer failure points, lower cost.

The callback pattern deserves special attention. We use it for payment provider webhooks — the state machine sends a payment request, includes the task token in the callback URL, and pauses. When the payment provider hits our webhook endpoint (an API Gateway + Lambda combo), it calls SendTaskSuccess with the token and the state machine resumes. Clean, traceable, and the state machine handles the timeout if the callback never arrives.

Deploying and Managing with the CLI

I deploy state machines through CloudFormation in production, but the CLI is invaluable for development and debugging. Here’s the workflow I use:

Create a state machine:

aws stepfunctions create-state-machine \
  --name "PaymentProcessing" \
  --definition file://payment-workflow.json \
  --role-arn arn:aws:iam::123456789012:role/StepFunctionsPaymentRole \
  --type STANDARD \
  --logging-configuration '{
    "level": "ALL",
    "includeExecutionData": true,
    "destinations": [{
      "cloudWatchLogsLogGroup": {
        "logGroupArn": "arn:aws:logs:eu-west-1:123456789012:log-group:/aws/vendedlogs/states/PaymentProcessing:*"
      }
    }]
  }'

Always enable logging with level: ALL and includeExecutionData: true during development. The /aws/vendedlogs/ prefix on the log group name avoids hitting CloudWatch resource policy size limits — a gotcha that’s bitten me more than once.

Start an execution:

aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:eu-west-1:123456789012:stateMachine:PaymentProcessing \
  --input '{"orderId": "ORD-12345", "amount": 99.99, "currency": "USD"}'

Check execution status:

aws stepfunctions describe-execution \
  --execution-arn arn:aws:states:eu-west-1:123456789012:execution:PaymentProcessing:exec-id

The execution history is where the real debugging power lives:

aws stepfunctions get-execution-history \
  --execution-arn arn:aws:states:eu-west-1:123456789012:execution:PaymentProcessing:exec-id \
  --query 'events[?type==`TaskFailed`]'

That --query filter is something I use constantly. Instead of scrolling through hundreds of state transition events, I can pull just the failures, just the Lambda results, or just the state entry/exit events.

Standard vs Express: Making the Right Call

I’ve already touched on this, but the choice between Standard and Express workflows trips people up enough that it’s worth going deeper.

Use Standard when:

The workflow runs longer than five minutes
You need exactly-once execution (payment processing, order fulfillment)
You need .sync or .waitForTaskToken integration patterns
You need a full audit trail of every execution
The workflow involves human approval steps

Use Express when:

High throughput, short duration (API backends, stream processing)
At-least-once is acceptable (idempotent operations)
You’re processing events from EventBridge, Kinesis, or SQS
Cost matters more than execution guarantees

A pattern I’ve used successfully: nest Express workflows inside Standard workflows. The Standard workflow handles the overall business process with its exactly-once guarantees. Individual steps that need high throughput — like processing a batch of line items — run as Express sub-workflows. You get the reliability of Standard for the orchestration and the cost efficiency of Express for the heavy lifting.

{
  "Type": "Task",
  "Resource": "arn:aws:states:::states:startExecution.sync:2",
  "Parameters": {
    "StateMachineArn": "arn:aws:states:eu-west-1:123456789012:stateMachine:ProcessLineItems",
    "Input.$": "$.lineItems"
  },
  "Next": "FinalizeOrder"
}

The .sync:2 suffix is important — it waits for the child execution to complete and returns the output as parsed JSON rather than a string. Without the :2, you get the output as an escaped JSON string inside the Output field and have to parse it yourself.

Distributed Map: Processing at Scale

The Map state in Distributed mode changed how I think about batch processing on AWS. Before this, processing thousands of S3 objects meant writing custom orchestration — chunking the work, managing concurrency, tracking failures. Now I point a Distributed Map at an S3 bucket and Step Functions handles the rest.

It can process millions of items with up to 10,000 concurrent child executions. Each child is its own Express workflow execution, so you get the cost benefits of Express with the orchestration of Standard. You set a MaxConcurrency to avoid overwhelming downstream services, and ToleratedFailurePercentage to decide how many items can fail before the whole batch is considered failed.

This is particularly relevant if you’re designing scalable AWS architectures. Distributed Map lets you scale processing horizontally without managing the infrastructure for that parallelism yourself.

What I’d Do Differently

Looking back at our migration, a few things I’d change:

Start with Workflow Studio. I wrote raw ASL JSON from day one because I’m stubborn. Workflow Studio — the visual editor in the console — would’ve saved me hours of debugging misplaced commas and wrong field names. Write the first draft visually, export the ASL, then manage it in code.

Use direct integrations from the start. We migrated the Lambda chain to Step Functions but kept all the Lambdas. It took another round of refactoring to realize half of them were just making SDK calls that Step Functions could make directly.

Set up X-Ray tracing immediately. Step Functions integrates with X-Ray, and the end-to-end trace view across Lambda invocations, DynamoDB calls, and SNS publishes is incredibly useful. We didn’t enable it until we had a performance issue in production, and by then we’d wasted days guessing.

Don’t over-orchestrate. Not everything needs a state machine. If you’ve got two Lambda functions and a simple success/failure path, a direct invocation or an EventBridge rule is simpler and cheaper. Step Functions shines when you have branching logic, parallel execution, error handling complexity, or long-running processes. For simple event-driven patterns, keep it simple.

The payment pipeline that started this whole journey hasn’t had a silent failure since the migration. Every execution is visible. Every error is caught and routed. Every retry is automatic. I can look at any execution in the console and see exactly what happened, what data flowed through each state, and where things went wrong if they did.

That’s worth more than any amount of custom orchestration code.