Appgree is an application for real-time opinion aggregation that provides an easy mechanism to communicate ideas among large groups of people as if it were a single entity. Appgree was born with the ambition to become a new form of communication with which any group of users, regardless of size, can express with one voice (that faithfully represents the group as a whole) and not to individual users.
Given the nature of the project, there are a number of circumstances that make impossible the use of a classical datacenter due to the deployment and operating costs:
- High service scalability and availability since the beginning.
- Very high potential users number in the first months after entering into production.
- Application subject to peak loads.
- Very limited development time for the first product releases (less than six months for the production release due to commercial reasons).
With these constraints in mind, it is decided to use a cloud computing solution for the service, specifically Amazon Web Services (AWS), mainly due to its market leader position and the high number of managed services that enable a reduced development time.
Configuration Management: Puppet
In order to provide a mechanism for centralization and documentation of the infrastructure configuration, all configuration system was implemented in the form of manifestos and Puppet modules. Thanks to these services, the entire platform configuration is centralized, modeled as versionable code and allowing a simplified deployment of configuration changes without the need to access each node independently. In addition, to facilitate the parallel execution of tasks and the consultation of local information of a group of instances, Mcollective was implemented as a means of orchestrating tasks.
Configuration Management: Puppet + Mcollective
In order to make this scenario possible with the minimum interaction from system administrators, it has been implemented a custom bootstrap so that all instances are recorded in these management services automatically and assigned a number of properties depending on their role in the infrastructure, so that each node or group of nodes is easily reachable by simple Mcollective filters.
Due to the application potential of being used by massive groups of people from anywhere in the world, it is critical to design a fault-tolerant system and prepared to keep running 24x7.
Design subnets and Availability Zones.
Managing subnets in an AWS cloud computing environment as it is understood coming from a OnPremise infrastructure, completely loses its meaning due to networks virtualization. In this scenario we can reduce the subnets management to a merely functional and "geographical" distribution. The following table shows the combination matrix chosen for a configuration of subnets according to three criteria:
- AWS availability zone (3 in this case).
- Public/Private network visibility. Only those instances or services housed in public networks could have a public IP.
- Scope of Use. In practice it can be simplified as subnets intended to house EC2 instances or AWS managed services with VPC configuration availability (ELB, RDS, ElastiCache, EMR...).
This classification allows an easy management of the routing tables depending on availability zones (default routes through independent NATs) and knowing which subnets will automatically have public IP when starting an instance in these networks (EC2 public).
For network traffic restriction it is preferable to implement security groups, a much more flexible mechanism very similar to the classic ACLs, with the added value of allowing restrictions due to group membership and not only due to network segments and CIDR.
In order to minimize the attack surface exposed to the Internet, the number of EC2 instances in public networks has been minimized, the use of public ELBs maximized and the AWS load balancing service managed. Only the NAT instances of each availability zone and the administrative VPN are deployed in such subnets.
Data storage: RDS, S3 and SQS.
As with any other server infrastructure, data storage systems are the main bottleneck and source of structural problems due to their high disk access needs and the problems derived from replication, scaling and failover tasks. To solve this part of the infrastructure and maximize service availability it has been chosen to implement various AWS managed services:
- RDS: With this service, along with provisioned SSD volume contracting and MultiAZ configuration, all the project needs were ensured:
-Configuration or the service within minutes
-Automatic failover with a limited writing downtime of 2 minutes (easily absorbable at runtime)
-Automated vertical scaling management.
-Horizontal reading escalation (service availability for pr readings to virtually 100%) to yield IO predictable, automated backups, or n monitorizaci immediately CloudWatch, etc.
- SQS: This AWS managed service for queue managing is implemented for the user events ingestion, tasks synchronization among multiple batch execution servers and notification sending. SQS provides a highly scalable and fault-tolerant mechanism for mass storage of temporary pending processing data.
- S3: Once the data coming from one of the many has been processed, the results are stored in S3 to allow a massive access to them without the need of an storage infrastructure with such capabilities. Additionally, all the application static content is stored in S3 for distribution with Cloudfront.
AutoScaling groups as Minimal Application Unit
Everything can fail" is one of the basic qualities that Appgree was born with when approaching problems of scalability, and that is why ELB and Autoscaling are considered the best services to successfully undertake strong service availability requirements. These services not only serve as simplified mechanisms for implementing resilient and scalable clusters, but they provide a very effective tool for errors self-healing in infrastructure failed nodes. All Appgree application layers are designed in a way that allows Autoscaling to be associated with an ELB Healthcheck and this can automatically trigger the replacement of those instances that do not adequately pass specified thresholds.
Simplified application architecture
If we add to this philosophy the above multiple availability zones and managed storage services, we can get an availability of up to 99.99%.
Internet output SPOF: HTTP Proxy Elastic
One of the main problems in implementing AWS VPC is the lack of a subnet fault tolerance mechanism by default, which makes that the instances designed to provide access to services hosted outside VPC must go through a single NAT instance.
Due to the high number of services outside VPC on which depends the proper functioning of Appgree (S3, SES, SQS ...) the availability of a fault tolerant Internet access is critical. For this, it has been implemented a mechanism of HTTP/S proxy based on ELB and Autoscaling with EC2 instances in public subnets:
Proxy HTTP/S Elastic
By means of using this proxy, the SPOF due to the NAT instance for all AWS APIs (all based on HTTPS) has been resolved successfully. Unfortunately, for certain no "proxyfied" protocols we are still dependent on NAT instances of each availability zone, with special emphasis on those nodes monitoring.
Thanks to the fault tolerant design of the different application layers (all aimed at using AWS autoscaling and managed services with inherent capabilities for high availability and scalability) Appgree is a full elastic application, allowing infrastructure to grow and decrease depending on the load with a “mouse click” from the AWS console.
Serve as an example the statistics obtained during the application early production release dates, in which business commitments with various media resulted in a massive acceptance of Appgree, with over 350,000 user registers in two days. Appgree needed to scale their fleet of servers from about 30 to more than 300 nodes in a few hours and then return to its initial state. In fact, this process was repeated periodically for several weeks because of its use in one of most popular Spanish TV shows, all of them, without any service loss or degradation. This also allows Appgree to reduce the investment cost at a fraction of what it would have required with a traditional datacenter infrastructure.
Development environments and load tests
Thanks to the modeling infrastructure as Puppet code and the extensive use of AMIs (snapshots) for auto-scaling implementation, the different Appgree development teams find all the tools needed for sourcing and rapid deployment of test environments available, being this availability equivalent to the one in the production environment.
In these scenarios, the implementation of availability hours, the use of CPU bursts instances types (T2) and the contracting of "spot" instances, allow a further reduction of the operational costs for maintaining these environments.
For latency and large-scale load testing, the same Autoscaling tools were used, but this time with clusters arranged in other AWS regions, which simulate the mass user access and ensure that the infrastructure will support the load peaks expected on the first days of the application life. During these tests, more than 200 instances were used to simulate users and about 300 for the testing infrastructure that must support the load. Again, all for a fraction of the cost that would have been needed using computing power in a traditional datacenter thanks to the AWS“ by hour” billing.
Monitoring and Alerts
Due to the volume and variety of the different services that make up the infrastructure Appgree, it was decided to use several different services for gathering performance metrics and logs: Monitoring and alerts
- Amazon Cloudwatch: All AWS services are integrated immediately with this service, allowing a simple gathering of the services metric hosted on AWS (EC2, ELB, RDS ...).
- Amazon ELB+S3: TAll ELBs published, entry point for the application users, export their logs access to S3 for subsequent analysis.
- Elastic ELK: To provide a detailed vision of the events occurring in any part of the platform, there was an implementation of a log gathering service based on Elastic Inc. tools for syslog analysis, public ELBs access logs and multiple log files from different applications both internally developed and third party ones.
- Ganglia: Although CloudWatch provides all the tools necessary for monitoring the infrastructure performance, the implementation of a Ganglia collector and the distribution of agents on all nodes allows the aggregation of metrics by business concepts, offers richer historical metrics and provides the ability to report metric with a granularity of seconds.
- Nagios: To run your custom HealthChecks for checking the status of the various services hosted on the EC2 instances, it was decided to use Nagios, the standard toll for such monitoring services.
- Nagios + Ganglia: For the execution of performance checks with customized metrics without knowing the instances IPs that make up the infrastructure at a moment of time and the reduction of the number of checks to be performed by each node, it was chosen an integration of Nagios with Ganglia. Ganglia is a centralized repository where all the infrastructure reports its status. In this way, with just one API call, Nagios can alert IT staff if any instance of any cluster has a problem, without having to maintain individual checks for each instance.
- Amazon SNS: Due to its immediate integration with other Amazon services, the simplicity of its API for its integration with third party applications and its notification model based on subscriptions, SNS is used for notification sending coming from CloudWatch, Nagios HealthChecks or multiple internal business processes.
The flexibility of hiring computing and storage resources into a cloud computing system as AWS, especially on a large scale, makes a complex task the inventory control of every item for which a cost is charged by AWS.
In order to keep track of all resources, to make a successful use of reserved instances and to have a depth view of the infrastructure servers’ costs, we have implemented a series of open source tools published by Netflix:
- ICE: This tool analyzes the detailed billing that Amazon provides its customers and provides a web interface to see quickly which products the money is being spent on, with one hour granularity. Additionally, every Appgree AWS resource contains a number of custom tags that allow ICE show breakdowns for business concepts and not only by AWS products.
- Janitor Monkey: As part of the "suite" of tools SimianArmy, Janitor Monkey explores Appgree’s AWS account for unused resources, notifying the administrators of its existence, and even deleting them automatically.
- Conformity Monkey: Another Netflix "monkey", this one analyzes all auto scaling groups and load balancers, and ensures they meet a series of conformity conditions:
- the right distribution of instances between different availability zones.
- consistency between ELB and Autoscaling subnets.
- consistency on multiple configurations.
One of the concerns when Appgree moved their servers to the AWS cloud, was to minimize the risk of security breaches in access to the various services of the infrastructure, both in the public part of it (E-Commerce) and in the internal applications. To do this, we have made use of a wide range of options provided by different network and security AWS services:
- VPC: All EC2 instances stay inside VPC, an AWS service that enables the provision of an EC2 section completely isolated from the Internet.
- Subnets: Inside the VPC, different subnets were provisioned, each one intended for accommodation of certain types of services based on their nature: public or private access, EC2 instances managed by Appgree, AWS managed services, Availability Zone, etc…
- Security Groups: Instead of implementing complex mechanisms of firewall and network ACLs, we used Security Groups, an EC2 network management characteristic which allows to configure network restrictions based on group membership of each instance and/or managed AWS service.
- Administrative NAT/VPN: As mentioned previously, VPC is a network section completely isolated from EC2. To provide access to the Internet and other AWS services hosted outside the internal network (S3, SNS...) a NAT instance was configured for each availability zone. As mentioned before, this SPOF was solved with an elastic HTTP/S. Additionally, these instances provide a VPN for VPC access management from remote locations.
- ELB: To minimize the attack surface exposed to the Internet, the number of EC2 instances with public access was reduced to the minimum possible. So much it is, that only Proxy instances, NAT and VPN Management instance (one per Availability Zone) are the only ones with public IPs across the infrastructure. All incoming traffic from the application usage is done through the AWS balancers, minimizing the attack surface and centralizing SSL certificates and encryption algorithms management.
Backup plans and disaster recovery.
To simplify the backup and recovery process, a wide use of the services offered by Amazon Web Services has been done:
- EC2 and EBS
- An AMI for every instance ready for an eventual recovery service in case of unrecoverable errors in production instances.
- Regular Snapshots of physical data volumes.
- RDS and ElastiCache
- Integrated backup based on snapshots of storage volumes.
- Instance automatic recovery in case of serious errors that prevent restoring the service.
- Data replication integrated in the service.
- S3 and Glacier
- File storing backups via scheduled tasks from EC2 instances dedicated to data storages.
- Lifecycle implementation for automatic backup filing in low-cost storage services and old items deletion.