I want to tell you about a production incident that still bothers me.
We had a payment processing system built on Lambda. Each function did one thing: validate the card, charge the customer, update the order, send the receipt, trigger fulfillment. Clean separation of concerns. Looked great on paper.
Then a Lambda timed out in the middle of the charge step. The card had been charged. The order had not been updated. The receipt never went out. Fulfillment never started. And because there was no central record of what had run, we had no way to resume from where things broke. We ended up with a manual cleanup process, a refund, and an angry customer.
The root problem was not the timeout. The root problem was that we had orchestration logic scattered across function calls, SQS queues, and environment variables. When something went wrong, we had no visibility and no way to recover cleanly.
AWS Step Functions exists to solve exactly this problem. It gives you a managed, visual, stateful orchestration layer that sits above your compute. In this article I will walk you through how Step Functions actually works, the patterns that matter in production, and the mistakes I see teams make when they first adopt it.
What Step Functions Actually Does
Step Functions is a serverless orchestration service. You define a workflow as a state machine using Amazon States Language, a JSON-based specification. Each state in the machine can invoke a Lambda function, call an AWS service directly, wait for a human approval, run a parallel branch, or retry on failure with configurable backoff.
The key thing that separates Step Functions from gluing Lambdas together with SQS is that the state machine itself is the source of truth. Every execution has a complete audit trail. You can look at any execution and see exactly which states ran, what input and output they received, when they ran, and whether they succeeded or failed. When something goes wrong you have a complete picture.
There are two workflow types and the choice matters.
Standard Workflows are designed for long-running, durable processes. They can run for up to a year. Every state transition is recorded in the execution history. You pay per state transition. This is what you want for anything involving payments, order processing, document workflows, or human approvals.
Express Workflows are designed for high-volume, short-duration workloads. They run for up to five minutes, have at-least-once execution semantics, and you pay per execution duration. Use them for event processing pipelines where you need to handle thousands of events per second and idempotency is handled at the application level.
Your First Production State Machine
Let me walk through a real example: an e-commerce order processing workflow. This is a Standard Workflow since order processing is exactly the kind of thing you need full durability and auditability for.
{ "Comment": "Order processing workflow", "StartAt": "ValidateOrder", "States": { "ValidateOrder": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789:function:validate-order", "Next": "CheckInventory", "Retry": [ { "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"], "IntervalSeconds": 2, "MaxAttempts": 3, "BackoffRate": 2 } ], "Catch": [ { "ErrorEquals": ["OrderValidationError"], "Next": "OrderRejected", "ResultPath": "$.error" } ] }, "CheckInventory": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789:function:check-inventory", "Next": "ProcessPayment", "Retry": [ { "ErrorEquals": ["States.TaskFailed"], "IntervalSeconds": 5, "MaxAttempts": 2, "BackoffRate": 1.5 } ], "Catch": [ { "ErrorEquals": ["InsufficientInventoryError"], "Next": "NotifyOutOfStock", "ResultPath": "$.error" } ] }, "ProcessPayment": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789:function:process-payment", "Next": "FulfillmentAndNotification", "Retry": [ { "ErrorEquals": ["Lambda.ServiceException"], "IntervalSeconds": 1, "MaxAttempts": 2, "BackoffRate": 2 } ], "Catch": [ { "ErrorEquals": ["PaymentDeclinedError"], "Next": "NotifyPaymentFailed", "ResultPath": "$.error" }, { "ErrorEquals": ["States.ALL"], "Next": "OrderProcessingFailed", "ResultPath": "$.error" } ] }, "FulfillmentAndNotification": { "Type": "Parallel", "Branches": [ { "StartAt": "TriggerFulfillment", "States": { "TriggerFulfillment": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789:function:trigger-fulfillment", "End": true } } }, { "StartAt": "SendConfirmationEmail", "States": { "SendConfirmationEmail": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789:function:send-email", "End": true } } } ], "Next": "OrderComplete" }, "OrderComplete": { "Type": "Succeed" }, "OrderRejected": { "Type": "Fail", "Error": "OrderRejected" }, "NotifyOutOfStock": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789:function:notify-out-of-stock", "End": true }, "NotifyPaymentFailed": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789:function:notify-payment-failed", "End": true }, "OrderProcessingFailed": { "Type": "Fail", "Error": "ProcessingFailed" } }}
A few things worth pointing out in this definition.
The Retry blocks on each Task state handle transient failures automatically. The configuration above retries on Lambda service exceptions with exponential backoff. You get this behavior for free without writing any retry logic in your Lambda functions themselves.
The Catch blocks handle business-logic failures separately from infrastructure failures. A PaymentDeclinedError routes to a notification state. An unhandled exception routes to a generic failure state. The ResultPath ensures the error detail is written into the execution context alongside the original input, not replacing it.
The Parallel state in FulfillmentAndNotification runs fulfillment and email simultaneously. Both branches must complete before the workflow advances to OrderComplete. If either branch fails, the entire Parallel state fails. This is often exactly the behavior you want: do not mark the order complete until both downstream systems have been notified.
SDK Integrations: Stop Writing Wrapper Lambdas
One of the most common mistakes I see is writing Lambda functions whose only job is to call another AWS service. A Lambda that calls DynamoDB to write a record. A Lambda that sends an SNS message. A Lambda that starts a Glue job.
Step Functions has optimized integrations with over 220 AWS services. You can call these services directly from a state definition without a Lambda in the middle.
Here is a state that writes directly to DynamoDB:
"SaveOrderToDynamo": { "Type": "Task", "Resource": "arn:aws:states:::dynamodb:putItem", "Parameters": { "TableName": "orders", "Item": { "orderId": { "S.$": "$.orderId" }, "customerId": { "S.$": "$.customerId" }, "status": { "S": "CONFIRMED" }, "totalAmount":{ "N.$": "States.Format('{}', $.totalAmount)" }, "createdAt": { "S.$": "$$.Execution.StartTime" } } }, "Next": "SendToSNS"}
And a state that publishes to SNS:
"SendToSNS": { "Type": "Task", "Resource": "arn:aws:states:::sns:publish", "Parameters": { "TopicArn": "arn:aws:sns:us-east-1:123456789:order-events", "Message": { "orderId.$": "$.orderId", "customerId.$": "$.customerId", "status": "CONFIRMED" } }, "Next": "OrderComplete"}
The .$ suffix on a key means “resolve this from the state input.” The $$.Execution.StartTime is a context object reference that gives you metadata about the current execution. These small conveniences add up significantly when building real workflows.
Removing wrapper Lambdas reduces cold starts, lowers your Lambda invocation costs, simplifies your IAM surface, and makes the workflow easier to read because every state’s purpose is self-evident.
The Wait for Callback Pattern
Some workflows cannot move forward until something external happens. A human needs to approve a refund. A third-party payment processor needs to call back. A document needs to pass a review queue.
Step Functions handles this with the waitForTaskToken integration pattern. The state machine pauses, sends a token to an external system, and resumes only when that token is returned.
Here is the state definition:
"WaitForManagerApproval": { "Type": "Task", "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken", "Parameters": { "QueueUrl": "https://sqs.us-east-1.amazonaws.com/123456789/approval-queue", "MessageBody": { "taskToken.$": "$$.Task.Token", "orderId.$": "$.orderId", "amount.$": "$.totalAmount", "requestedBy.$":"$.customerId" } }, "HeartbeatSeconds": 3600, "Next": "ProcessApprovedRefund", "Catch": [ { "ErrorEquals": ["ApprovalRejected"], "Next": "NotifyRejected" }, { "ErrorEquals": ["States.HeartbeatTimeout"], "Next": "EscalateApproval" } ]}
The approval service picks up the message, presents it to a manager, and then calls back:
import boto3sfn = boto3.client("stepfunctions")def handle_approval_decision(task_token: str, approved: bool, reason: str): if approved: sfn.send_task_success( taskToken=task_token, output=json.dumps({"approved": True, "approvedBy": "manager@company.com"}) ) else: sfn.send_task_failure( taskToken=task_token, error="ApprovalRejected", cause=reason )
The HeartbeatSeconds field is important. If the external system does not send a heartbeat or complete the task within that window, the state fails with a HeartbeatTimeout. In the example above that routes to an escalation state rather than silently hanging forever. Always set a heartbeat on any waitForTaskToken state.
Deploying with Terraform
Defining your state machine in the console is fine for exploration. In production, everything should be in code.
resource "aws_sfn_state_machine" "order_processing" { name = "order-processing-workflow" role_arn = aws_iam_role.step_functions_role.arn type = "STANDARD" definition = templatefile("${path.module}/state_machine.json", { validate_order_arn = aws_lambda_function.validate_order.arn check_inventory_arn = aws_lambda_function.check_inventory.arn process_payment_arn = aws_lambda_function.process_payment.arn trigger_fulfillment_arn = aws_lambda_function.trigger_fulfillment.arn send_email_arn = aws_lambda_function.send_email.arn }) logging_configuration { level = "ALL" include_execution_data = true log_destination = "${aws_cloudwatch_log_group.sfn_logs.arn}:*" } tracing_configuration { enabled = true }}resource "aws_iam_role" "step_functions_role" { name = "step-functions-order-processing-role" assume_role_policy = jsonencode({ Version = "2012-10-17" Statement = [{ Action = "sts:AssumeRole" Effect = "Allow" Principal = { Service = "states.amazonaws.com" } }] })}resource "aws_iam_role_policy" "sfn_policy" { name = "sfn-order-processing-policy" role = aws_iam_role.step_functions_role.id policy = jsonencode({ Version = "2012-10-17" Statement = [ { Effect = "Allow" Action = ["lambda:InvokeFunction"] Resource = [ aws_lambda_function.validate_order.arn, aws_lambda_function.check_inventory.arn, aws_lambda_function.process_payment.arn, aws_lambda_function.trigger_fulfillment.arn, aws_lambda_function.send_email.arn ] }, { Effect = "Allow" Action = ["logs:CreateLogDelivery", "logs:PutLogEvents", "logs:GetLogDelivery"] Resource = "*" }, { Effect = "Allow" Action = ["xray:PutTraceSegments", "xray:PutTelemetryRecords"] Resource = "*" } ] })}resource "aws_cloudwatch_log_group" "sfn_logs" { name = "/aws/states/order-processing" retention_in_days = 30}
Using templatefile to inject Lambda ARNs into the state machine definition keeps your infrastructure code clean and makes it easy to reference the correct function ARN for each environment without hardcoding anything.
Observability in Production
Step Functions gives you three layers of observability out of the box when you configure them properly.
CloudWatch Metrics publishes execution counts, failure rates, and durations for every state machine automatically. Set alarms on ExecutionsFailed and ExecutionsTimedOut. For payment or order workflows, a single failed execution is worth an alert. For high-volume event pipelines, set a threshold based on your acceptable failure rate.
CloudWatch Logs with include_execution_data = true captures the full input and output of every state transition. This is the setting that makes debugging possible. Without it, you know a state failed but not what data it received. With it, you can replay the exact scenario that caused the failure.
X-Ray tracing propagates trace context through Lambda invocations triggered by your state machine. In the AWS console, you get a service map showing exactly where time was spent across each execution. For workflows where latency matters, this is the fastest way to identify the bottleneck.
One practical tip: write a CloudWatch Insights query that you can run immediately when an incident starts.
fields @timestamp, execution_arn, type, details.name, details.status| filter type in ["ExecutionFailed", "TaskFailed", "TaskStateExited"]| sort @timestamp desc| limit 50
Save this query before you need it. Running it during an incident is much faster than clicking through individual executions.
Common Mistakes
Not setting ResultPath on Catch handlers. By default, a Catch block replaces the entire state input with the error object. Your downstream states then receive only the error, not the original order data they need. Always use "ResultPath": "$.error" to merge the error into the existing input.
Using Express Workflows for payment processing. Express Workflows have at-least-once semantics. A state can execute more than once under failure conditions. For anything involving money or external side effects, use Standard Workflows with idempotency keys in your Lambda functions, or use Standard Workflows period.
Ignoring the execution history limit. Standard Workflow execution history is capped at 25,000 events. For very long-running workflows with many state transitions, you can hit this limit. If your workflow runs for days or weeks with thousands of steps, use the Map state with chunking to keep individual execution histories manageable.
Hardcoding ARNs in state machine definitions. Environment-specific ARNs belong in Terraform variables or SSM Parameter Store, not in your state machine JSON. The pattern shown above with templatefile keeps this clean.
Step Functions does not eliminate complexity. What it does is make complexity visible and manageable. Your business logic lives in Lambda. Your orchestration logic lives in the state machine. When something fails, you have a complete, queryable record of exactly what happened and where.
The teams that get the most value from Step Functions are the ones that resist the temptation to build orchestration logic into their Lambda functions. Keep each function focused on a single responsibility. Let the state machine handle sequencing, retries, error routing, and parallelism. The result is a system where debugging takes minutes instead of hours and where new team members can understand the full workflow by reading a single JSON file.
Enjoy the cloud.
Osama