26 March 2017
A few months ago, we started with the aim of building a Slack integration for Mingle. Technically a Slack app needs you to expose HTTPS endpoints where Slack can talk to your application for slash commands, incoming webhooks and message buttons.
Given the need for additional endpoints and persisting state for these Slack interactions, we did not want to add this to the existing Mingle’s monolith Rails application. We knew this belonged as a separate service that could interact with Mingle using its API.
A serverless application is where a backend is composed of stateless compute units that run on a per request basis. The automated provisioning/maintenance of compute units is generally backed by cloud provider services like AWS Lambdas, Azure functions etc.
Serverless applications are designed to scale easily. Traffic bursts and increased usage is handled by API Gateway and the number of concurrent Lambdas invoked are increased proportionally. You do not need to dedicate additional resources upfront; instead you get on-demand scaling.
Serverless infrastructure costs you based on usage, API gateway and Lambda functions have metered usage and you pay only for the amount of requests you handle. This is advantageous against having server machines running 24/7 that may not be optimally utilised. API Gateway for example, is currently free for the first million requests in the first 12 months and then charged at $3.50 per million API calls plus data transfer costs which is very reasonable for our usage.
You end up writing less boilerplate code and can focus only on your core functionality. For example, on AWS, APIs can be updated based on Swagger config files and Lambda functions can be deployed by uploading to an S3 bucket. AWS ensures new containers are initialized with new code and APIs are updated within seconds.
You need to write code to handle the core functionality of your app. Request handling, response generation, parameter parsing, rate limiting and scaling are all taken care of by the service provider. This reduces startup time while you can focus on what matters most to your users.
We designed the system so there is no inbound access other than API Gateway endpoints. API gateway also gives you out of the box traffic management and limits the steady-state request rates to 1000 requests per second and allows bursts of up to 2000 rps across all APIs.
The application architecture is as shown above. Requests from Slack are received at the configured API Gateway endpoints. API Gateway is a service from AWS that lets you easily build scalable web APIs. Each API Gateway endpoint invokes a Lambda function which runs in its own container. When a request is received, AWS either re-uses existing containers or spins up new ones if there are none available. Scaling is done by spawning multiple containers based on number of concurrent requests. The Lambda functions talk to a Postgres DB where Slack integration state is stored, and fetch Mingle data using the API from Mingle.
Slack makes HTTPS requests at the API Gateway endpoints with x-www-form-urlencoded Content-type. This is translated at the API endpoint into JSON using a body mapping template.
AWS Lambda translates this into a Java object that is passed into your Lambda function as a parameter. The returned object from the Lambda function is converted into a JSON response which is mapped into HTTP responses with appropriate status code by API gateway. This is returned to Slack in response to command invocations.
In some cases the Mingle app makes direct requests to the Slack app to check for integration status for different Mingle customers and their specific projects or users. These requests are made to specific API gateway endpoints and a response is generated from a Lambda function in a similar manner.
When Mingle needs to notify a user on events or a change then this uses a different request path. Jobs on Mingle push messages to AWS SNS (Simple Notification Service) which invokes a lambda function. The function checks for integration details in the Slack app DB and makes necessary Slack API calls to send a notification message to the user.
You will notice that there are no dedicated servers running for the Slack application. AWS services are used to generate responses as and when requests come into the app.
We wrote custom build scripts in ruby using rake and the aws-sdk gem to automate the deployment. We have a similar setup for deploying Mingle The aws-sdk gem is generally great to work with except for a few rough edges. In our continuous deployment pipeline it takes about 8 minutes for a new commit to reach production. This includes deployment time taken for dev and staging environments.
AWS Lambda supports multiple languages like Python, Java and Node.js. We use Java to implement the functions. All the AWS Lambda functions code and their dependencies are packaged into a single jar using Gradle. Each function can be configured by specifying the class and the method name that is invoked for the request. AWS has extensive APIs to update Lambda functions. The functions can be configured to run from a jar stored in S3 or directly uploaded to it. Our build scripts invoke AWS APIs to drop the packaged Lambda jar into a dedicated S3 bucket.
The API Gateway APIs can be specified in Swagger specification + AWS Extensions JSON files. We define the configuration in YAML files that are translated by our deployment scripts into Swagger definition. There are custom attributes that specify the integration points between the Rest API and Lambda functions. You can specify custom mapping between HTTP request body or query parameters into JSON data passed into the Lambda function. Similar mapping templates can be added from Lambda responses to the APIs. The integration uses a velocity style template for mapping the request payload.
API Gateway endpoints must be HTTPS and you can setup your custom domain to point to a deployed API. A cloudfront distribution is created for each deployed API. Server Name Identification (SNI) is used to setup a custom domain name. We uploaded our SSL certificate for the domain on the AWS console. Since the domain name does not change on environments this setup was done manually. Each deployment updates the base path mapping on the custom domain to point to the new deployed API version.
When we started with AWS Lambdas, there was no support for configuring environment variables or Java system properties on each Lambda function. This made it difficult to manage different environments and setup configs (like db credentials, keys etc). The workaround was to have configurations written in a file in AWS S3 and read that file each time the function executed. This introduced multiple challenges, as each Lambda execution needed to fetch the config file from S3, it introduced latency to each request execution drastically slowing down response times.
Another issue we faced was that the Lambda functions were in a private subnet in an AWS VPC and we needed to setup a VPC endpoint for access to S3.
After we had spent considerable effort setting this up, AWS announced environment variables support for Lambdas. We then made a switch to using environment variables which made setup much simpler.
API Gateway integration points have support for request mapping templates to map data between HTTP query parameters to JSON passed into Lambdas. They have no out-of-the-box support for the x-www-form-urlencoded Content-type in which the request params are url encoded and present in the request body.
There are 2 ways around this, either you pass the entire body into the Lambda function and then parse the url-encoded params in your function code or define it at the request mapping layer.
We ended up doing the latter by writing a custom mapping that parsed the request body and converts it into a JSON structure at the integration level using velocity style templates.
This stackoverflow post helped us with a solution that we further enhanced.
We also had to ensure that snake-cased request parameters were converted into camel case so that they are correctly serialised into the request object properties for the Lambda function. The serialization failed when properties were in snake case. Slack calls pass params in snake case so this was required for all requests received.
AWS provides no guarantees on how containers are reused on subsequent Lambda invocations. Based on the request count and number of concurrent requests one or many containers are initialized to serve requests. When a new container is initialized, JVM needs to be started on it which comes with its starting overhead. The first requests can take 2500-3000 ms followed by subsequent ones taking 200ms. The impact of such cold starts affects response times especially when there are not too many incoming requests.
We did a couple of things to improve this. Firstly, we increased the memory allocated to the Lambdas which helps reduce initialisation time. We also set up some CloudWatch triggers to invoke the Lambda functions periodically so that containers could be reused when there are periods of low traffic.
When we first deployed the lambda functions we faced an issue where we were unable to make API calls to Slack from the function. There was no internet access available from the function. After some debugging we figured out that this was because lambda functions were inside a private subnet in the VPC. The solution for this was to add a NAT (Network Address Translation) Gateway inside the private subnet. Requests are sent from the Lambda functions to the NAT Gateway which sets the source ip address to its elastic IP and forwards it to the internet gateway inside the VPC.
This was a great learning experience for the team given the opportunity of architecting, deploying, optimizing and maintaining this serverless system that runs our integration between Mingle + Slack.