mirror of
https://github.com/ansible/awx.git
synced 2026-02-18 03:30:02 -03:30
Updating acceptance documentation and system docs
This commit is contained in:
@@ -1,42 +1,66 @@
|
||||
## Tower Clustering/HA Overview
|
||||
|
||||
Prior to 3.1 the Ansible Tower HA solution was not a true high-availability system. In 3.1 we have rewritten this system entirely with a new focus in mind:
|
||||
Prior to 3.1 the Ansible Tower HA solution was not a true high-availability system. In 3.1 we have rewritten this system entirely with a new focus towards
|
||||
a proper highly available clustered system. In 3.2 we have extended this further to allow grouping of clustered instances into different pools/queues.
|
||||
|
||||
* Each node should be able to act as an entrypoint for UI and API Access.
|
||||
This should enable Tower administrators to use load balancers in front of as many nodes as they wish
|
||||
* Each instance should be able to act as an entrypoint for UI and API Access.
|
||||
This should enable Tower administrators to use load balancers in front of as many instances as they wish
|
||||
and maintain good data visibility.
|
||||
* Each node should be able to join the Tower cluster and expand its ability to execute jobs. This is currently
|
||||
a naive system where jobs can and will run anywhere rather than be directed on where to run. *That* work will
|
||||
be done later when building out the Federation/Rampart system.
|
||||
* Provisioning new nodes should be as simple as updating the `inventory` file and re-running the setup playbook
|
||||
* Nodes can be deprovisioned with a simple management commands
|
||||
* Each instance should be able to join the Tower cluster and expand its ability to execute jobs.
|
||||
* Provisioning new instance should be as simple as updating the `inventory` file and re-running the setup playbook
|
||||
* Instances can be deprovisioned with a simple management commands
|
||||
* Instances can be grouped into one or more Instance Groups to share resources for topical purposes.
|
||||
* These instance groups should be assignable to certain resources:
|
||||
* Organizations
|
||||
* Inventories
|
||||
* Job Templates
|
||||
such that execution of jobs under those resources will favor particular queues
|
||||
|
||||
It's important to point out a few existing things:
|
||||
|
||||
* PostgreSQL is still a standalone instance node and is not clustered. We also won't manage replica configuration or,
|
||||
* PostgreSQL is still a standalone instance and is not clustered. We also won't manage replica configuration or,
|
||||
if the user configures standby replicas, database failover.
|
||||
* All nodes should be reachable from all other nodes and they should be able to reach the database. It's also important
|
||||
* All instances should be reachable from all other instances and they should be able to reach the database. It's also important
|
||||
for the hosts to have a stable address and/or hostname (depending on how you configure the Tower host)
|
||||
* RabbitMQ is the cornerstone of Tower's Clustering system. A lot of our configuration requirements and behavior is dictated
|
||||
by its needs. Thus we are pretty inflexible to customization beyond what our setup playbook allows. Each Tower node has a
|
||||
deployment of RabbitMQ that will cluster with the other nodes' RabbitMQ instances.
|
||||
* Existing old-style HA deployments will be transitioned automatically to the new HA system during the upgrade process.
|
||||
* Manual projects will need to be synced to all nodes by the customer
|
||||
by its needs. Thus we are pretty inflexible to customization beyond what our setup playbook allows. Each Tower instance has a
|
||||
deployment of RabbitMQ that will cluster with the other instances' RabbitMQ instances.
|
||||
* Existing old-style HA deployments will be transitioned automatically to the new HA system during the upgrade process to 3.1.
|
||||
* Manual projects will need to be synced to all instances by the customer
|
||||
|
||||
## Important Changes
|
||||
|
||||
* There is no concept of primary/secondary in the new Tower system. *All* systems are primary.
|
||||
* Setup playbook changes to configure rabbitmq and give hints to the type of network the hosts are on.
|
||||
* The `inventory` file for Tower deployments should be saved/persisted. If new nodes are to be provisioned
|
||||
* The `inventory` file for Tower deployments should be saved/persisted. If new instances are to be provisioned
|
||||
the passwords and configuration options as well as host names will need to be available to the installer.
|
||||
|
||||
## Concepts and Configuration
|
||||
|
||||
### Installation and the Inventory File
|
||||
The current standalone node configuration doesn't change for a 3.1 deploy. The inventory file does change in some important ways:
|
||||
The current standalone instance configuration doesn't change for a 3.1+ deploy. The inventory file does change in some important ways:
|
||||
|
||||
* Since there is no primary/secondary configuration those inventory groups go away and are replaced with a
|
||||
single inventory group `tower`. The `database` group remains for specifying an external postgres, however:
|
||||
single inventory group `tower`. The customer may, *optionally*, define other groups and group instances in those groups. These groups
|
||||
should be prefixed with `rampart_`. Instances are not required to be in the `tower` group alongside other `rampart_` groups, but one
|
||||
instance *must* be present in the `tower` group. Technically `tower` is a group like any other `rampart_` group but it must always be present
|
||||
and if a specific group is not associated with a specific resource then job execution will always fall back to the `tower` group:
|
||||
```
|
||||
[tower]
|
||||
hostA
|
||||
hostB
|
||||
hostC
|
||||
|
||||
[rampart_east]
|
||||
hostB
|
||||
hostC
|
||||
|
||||
[rampart_west]
|
||||
hostC
|
||||
hostD
|
||||
```
|
||||
|
||||
The `database` group remains for specifying an external postgres. If the database host is provisioned seperately this group should be empty
|
||||
```
|
||||
[tower]
|
||||
hostA
|
||||
@@ -46,23 +70,34 @@ The current standalone node configuration doesn't change for a 3.1 deploy. The i
|
||||
[database]
|
||||
hostDB
|
||||
```
|
||||
|
||||
* It's common for customers to provision Tower instances externally but prefer to reference them by internal addressing. This is most significant
|
||||
for RabbitMQ clustering where the service isn't available at all on an external interface. For this purpose it is necessary to assign the internal
|
||||
address for RabbitMQ links as such:
|
||||
```
|
||||
[tower]
|
||||
hostA rabbitmq_host=10.1.0.2
|
||||
hostB rabbitmq_host=10.1.0.3
|
||||
hostC rabbitmq_host=10.1.0.3
|
||||
```
|
||||
* The `redis_password` field is removed from `[all:vars]`
|
||||
* There are various new fields for RabbitMQ:
|
||||
- `rabbitmq_port=5672` - RabbitMQ is installed on each node and is not optional, it's also not possible to externalize. It is
|
||||
- `rabbitmq_port=5672` - RabbitMQ is installed on each instance and is not optional, it's also not possible to externalize. It is
|
||||
possible to configure what port it listens on and this setting controls that.
|
||||
- `rabbitmq_vhost=tower` - Tower configures a rabbitmq virtualhost to isolate itself. This controls that settings.
|
||||
- `rabbitmq_username=tower` and `rabbitmq_password=tower` - Each node will be configured with these values and each node's Tower
|
||||
- `rabbitmq_username=tower` and `rabbitmq_password=tower` - Each instance will be configured with these values and each instance's Tower
|
||||
instance will be configured with it also. This is similar to our other uses of usernames/passwords.
|
||||
- `rabbitmq_cookie=<somevalue>` - This value is unused in a standalone deployment but is critical for clustered deployments.
|
||||
This acts as the secret key that allows RabbitMQ cluster members to identify each other.
|
||||
- `rabbitmq_use_long_names` - RabbitMQ is pretty sensitive to what each node is named. We are flexible enough to allow FQDNs
|
||||
- `rabbitmq_use_long_names` - RabbitMQ is pretty sensitive to what each instance is named. We are flexible enough to allow FQDNs
|
||||
(host01.example.com), short names (host01), or ip addresses (192.168.5.73). Depending on what is used to identify each host
|
||||
in the `inventory` file this value may need to be changed. For FQDNs and ip addresses this value needs to be `true`. For short
|
||||
names it should be `false`
|
||||
- `rabbitmq_enable_manager` - Setting this to `true` will expose the RabbitMQ management web console on each node.
|
||||
- `rabbitmq_enable_manager` - Setting this to `true` will expose the RabbitMQ management web console on each instance.
|
||||
|
||||
The most important field to point out for variability is `rabbitmq_use_long_name`. That's something we can't detect or provide a reasonable
|
||||
default for so it's important to point out when it needs to be changed.
|
||||
default for so it's important to point out when it needs to be changed. If instances are provisioned to where they reference other instances
|
||||
internally and not on external addressess then `rabbitmq_use_long_name` semantics should follow the internal addressing (aka `rabbitmq_host`.
|
||||
|
||||
Other than `rabbitmq_use_long_name` the defaults are pretty reasonable:
|
||||
```
|
||||
@@ -77,42 +112,54 @@ rabbitmq_use_long_name=false
|
||||
rabbitmq_enable_manager=false
|
||||
```
|
||||
|
||||
### Provisioning and Deprovisioning Nodes
|
||||
### Provisioning and Deprovisioning Instances and Groups
|
||||
|
||||
* Provisioning
|
||||
Provisioning Nodes after installation is supported by updating the `inventory` file and re-running the setup playbook. It's important that this file
|
||||
contain all passwords and information used when installing the cluster or other nodes may be reconfigured (This could be intentional)
|
||||
Provisioning Instances after installation is supported by updating the `inventory` file and re-running the setup playbook. It's important that this file
|
||||
contain all passwords and information used when installing the cluster or other instances may be reconfigured (This could be intentional)
|
||||
|
||||
* Deprovisioning
|
||||
Tower does not automatically de-provision nodes since we can't distinguish between a node that was taken offline intentionally or due to failure.
|
||||
Instead the procedure for deprovisioning a node is to shut it down (or stop the `ansible-tower-service`) and run the Tower deprovision command:
|
||||
Tower does not automatically de-provision instances since we can't distinguish between an instance that was taken offline intentionally or due to failure.
|
||||
Instead the procedure for deprovisioning an instance is to shut it down (or stop the `ansible-tower-service`) and run the Tower deprovision command:
|
||||
|
||||
```
|
||||
$ tower-manage deprovision-node <nodename>
|
||||
```
|
||||
|
||||
* Removing/Deprovisioning Instance Groups
|
||||
Tower does not automatically de-provision or remove instance groups, even though re-provisioning will often cause these to be unused. They may still
|
||||
show up in api endpoints and stats monitoring. These groups can be removed with the following command:
|
||||
|
||||
```
|
||||
$ tower-manage unregister_queue --queuename=<name>
|
||||
```
|
||||
|
||||
### Status and Monitoring
|
||||
|
||||
Tower itself reports as much status as it can via the api at `/api/v1/ping` in order to provide validation of the health
|
||||
Tower itself reports as much status as it can via the api at `/api/v2/ping` in order to provide validation of the health
|
||||
of the Cluster. This includes:
|
||||
|
||||
* The node servicing the HTTP request
|
||||
* The last heartbeat time of all other nodes in the cluster
|
||||
* The state of the Job Queue, any jobs each node is running
|
||||
* The instance servicing the HTTP request
|
||||
* The last heartbeat time of all other instances in the cluster
|
||||
* The state of the Job Queue
|
||||
* The RabbitMQ cluster status
|
||||
* Instance Groups and Instance membership in those groups
|
||||
|
||||
### Node Services and Failure Behavior
|
||||
A more detailed view of Instances and Instance Groups, including running jobs and membership
|
||||
information can be seen at `/api/v2/instances/` and `/api/v2/instance_groups`.
|
||||
|
||||
Each Tower node is made up of several different services working collaboratively:
|
||||
### Instance Services and Failure Behavior
|
||||
|
||||
Each Tower instance is made up of several different services working collaboratively:
|
||||
|
||||
* HTTP Services - This includes the Tower application itself as well as external web services.
|
||||
* Callback Receiver - Whose job it is to receive job events from running Ansible jobs.
|
||||
* Celery - The worker queue, that processes and runs all jobs.
|
||||
* RabbitMQ - Message Broker, this is used as a signaling mechanism for Celery as well as any event data propogated to the application.
|
||||
* Memcached - local caching service for the node it lives on.
|
||||
* Memcached - local caching service for the instance it lives on.
|
||||
|
||||
Tower is configured in such a way that if any of these services or their components fail then all services are restarted. If these fail sufficiently
|
||||
often in a short span of time then the entire node will be placed offline in an automated fashion in order to allow remediation without causing unexpected
|
||||
often in a short span of time then the entire instance will be placed offline in an automated fashion in order to allow remediation without causing unexpected
|
||||
behavior.
|
||||
|
||||
### Job Runtime Behavior
|
||||
@@ -120,56 +167,72 @@ behavior.
|
||||
Ideally a regular user of Tower should not notice any semantic difference to the way jobs are run and reported. Behind the scenes its worth
|
||||
pointing out the differences in how the system behaves.
|
||||
|
||||
When a job is submitted from the API interface it gets pushed into the Celery queue on RabbitMQ. A single RabbitMQ node is the responsible master for
|
||||
individual queues but each Tower node will connect to and receive jobs from that queue using a Fair scheduling algorithm. Any node in the cluster is just
|
||||
as likely to receive the work and execute the task. If a node fails while executing jobs then the work is marked as permanently failed.
|
||||
When a job is submitted from the API interface it gets pushed into the Celery queue on RabbitMQ. A single RabbitMQ instance is the responsible master for
|
||||
individual queues but each Tower instance will connect to and receive jobs from that queue using a Fair scheduling algorithm. Any instance on the cluster is
|
||||
just as likely to receive the work and execute the task. If a instance fails while executing jobs then the work is marked as permanently failed.
|
||||
|
||||
As Tower nodes are brought online it effectively expands the work capacity of the Tower system which is measured as one entire unit (the cluster's capacity).
|
||||
Conversely de-provisioning a node will remove capacity from the cluster.
|
||||
If a cluster is divided into separate Instance Groups then the behavior is similar to the cluster as a whole. If two instances are assigned to a group then
|
||||
either one is just as likely to receive a job as any other in the same group.
|
||||
|
||||
It's important to note that not all nodes are required to be provisioned with an equal capacity.
|
||||
As Tower instances are brought online it effectively expands the work capacity of the Tower system. If those instances are also placed into Instance Groups then
|
||||
they also expand that group's capacity. If an instance is performing work and it is a member of multiple groups then capacity will be reduced from all groups for
|
||||
which it is a member. De-provisioning an instance will remove capacity from the cluster wherever that instance was assigned.
|
||||
|
||||
Project updates behave differently than they did before. Previously they were ordinary jobs that ran on a single node. It's now important that
|
||||
they run successfully on any node that could potentially run a job. Project's will now sync themselves to the correct version on the node immediately
|
||||
It's important to note that not all instances are required to be provisioned with an equal capacity.
|
||||
|
||||
Project updates behave differently than they did before. Previously they were ordinary jobs that ran on a single instance. It's now important that
|
||||
they run successfully on any instance that could potentially run a job. Project's will now sync themselves to the correct version on the instance immediately
|
||||
prior to running the job.
|
||||
|
||||
If an Instance Group is configured but all instances in that group are offline or unavailable, any jobs that are launched targeting only that group will be stuck
|
||||
in a waiting state until instances become available. Fallback or backup resources should be provisioned to handle any work that might encounter this scenario.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
When verifying acceptance we should ensure the following statements are true
|
||||
|
||||
* Tower should install as a standalone Node
|
||||
* Tower should install as a standalone Instance
|
||||
* Tower should install in a Clustered fashion
|
||||
* Instance should, optionally, be able to be grouped arbitrarily into different Instance Groups
|
||||
* Capacity should be tracked at the group level and capacity impact should make sense relative to what instance a job is
|
||||
running on and what groups that instance is a member of.
|
||||
* Provisioning should be supported via the setup playbook
|
||||
* De-provisioning should be supported via a management command
|
||||
* All jobs, inventory updates, and project updates should run successfully
|
||||
* Jobs should be able to run on all hosts
|
||||
* Jobs should be able to run on hosts which it is targeted. If assigned implicitly or directly to groups then it should
|
||||
only run on instances in those Instance Groups.
|
||||
* Project updates should manifest their data on the host that will run the job immediately prior to the job running
|
||||
* Tower should be able to reasonably survive the removal of all nodes in the cluster
|
||||
* Tower should be able to reasonably survive the removal of all instances in the cluster
|
||||
* Tower should behave in a predictable fashiong during network partitioning
|
||||
|
||||
## Testing Considerations
|
||||
|
||||
* Basic testing should be able to demonstrate parity with a standalone node for all integration testing.
|
||||
* Basic testing should be able to demonstrate parity with a standalone instance for all integration testing.
|
||||
* Basic playbook testing to verify routing differences, including:
|
||||
- Basic FQDN
|
||||
- Short-name name resolution
|
||||
- ip addresses
|
||||
- /etc/hosts static routing information
|
||||
* We should test behavior of large and small clusters. I would envision small clusters as 2 - 3 nodes and large
|
||||
clusters as 10 - 15 nodes
|
||||
* Failure testing should involve killing single nodes and killing multiple nodes while the cluster is performing work.
|
||||
* We should test behavior of large and small clusters. I would envision small clusters as 2 - 3 instances and large
|
||||
clusters as 10 - 15 instances
|
||||
* Failure testing should involve killing single instances and killing multiple instances while the cluster is performing work.
|
||||
Job failures during the time period should be predictable and not catastrophic.
|
||||
* Node downtime testing should also include recoverability testing. Killing single services and ensuring the system can
|
||||
* Instance downtime testing should also include recoverability testing. Killing single services and ensuring the system can
|
||||
return itself to a working state
|
||||
* Persistent failure should be tested by killing single services in such a way that the cluster node cannot be recovered
|
||||
and ensuring that the node is properly taken offline
|
||||
* Persistent failure should be tested by killing single services in such a way that the cluster instance cannot be recovered
|
||||
and ensuring that the instance is properly taken offline
|
||||
* Network partitioning failures will be important also. In order to test this
|
||||
- Disallow a single node from communicating with the other nodes but allow it to communicate with the database
|
||||
- Break the link between nodes such that it forms 2 or more groups where groupA and groupB can't communicate but all nodes
|
||||
- Disallow a single instance from communicating with the other instances but allow it to communicate with the database
|
||||
- Break the link between instances such that it forms 2 or more groups where groupA and groupB can't communicate but all instances
|
||||
can communicate with the database.
|
||||
* Crucially when network partitioning is resolved all nodes should recover into a consistent state
|
||||
* Crucially when network partitioning is resolved all instances should recover into a consistent state
|
||||
* Upgrade Testing, verify behavior before and after are the same for the end user.
|
||||
* Project Updates should be thoroughly tested for all scm types (git, svn, hg) and for manual projects.
|
||||
* Setting up instance groups in two scenarios:
|
||||
a) instances are shared between groups
|
||||
b) instances are isolated to particular groups
|
||||
Organizations, Inventories, and Job Templates should be variously assigned to one or many groups and jobs should execute
|
||||
in those groups in preferential order as resources are available.
|
||||
|
||||
## Performance Testing
|
||||
|
||||
|
||||
Reference in New Issue
Block a user