Updating acceptance documentation and system docs

This commit is contained in:
Matthew Jones 2017-05-10 16:15:49 -04:00
parent 5508bad97c
commit 704da9c7f2
2 changed files with 124 additions and 73 deletions

View File

@ -1,42 +1,66 @@
## Tower Clustering/HA Overview
Prior to 3.1 the Ansible Tower HA solution was not a true high-availability system. In 3.1 we have rewritten this system entirely with a new focus in mind:
Prior to 3.1 the Ansible Tower HA solution was not a true high-availability system. In 3.1 we have rewritten this system entirely with a new focus towards
a proper highly available clustered system. In 3.2 we have extended this further to allow grouping of clustered instances into different pools/queues.
* Each node should be able to act as an entrypoint for UI and API Access.
This should enable Tower administrators to use load balancers in front of as many nodes as they wish
* Each instance should be able to act as an entrypoint for UI and API Access.
This should enable Tower administrators to use load balancers in front of as many instances as they wish
and maintain good data visibility.
* Each node should be able to join the Tower cluster and expand its ability to execute jobs. This is currently
a naive system where jobs can and will run anywhere rather than be directed on where to run. *That* work will
be done later when building out the Federation/Rampart system.
* Provisioning new nodes should be as simple as updating the `inventory` file and re-running the setup playbook
* Nodes can be deprovisioned with a simple management commands
* Each instance should be able to join the Tower cluster and expand its ability to execute jobs.
* Provisioning new instance should be as simple as updating the `inventory` file and re-running the setup playbook
* Instances can be deprovisioned with a simple management commands
* Instances can be grouped into one or more Instance Groups to share resources for topical purposes.
* These instance groups should be assignable to certain resources:
* Organizations
* Inventories
* Job Templates
such that execution of jobs under those resources will favor particular queues
It's important to point out a few existing things:
* PostgreSQL is still a standalone instance node and is not clustered. We also won't manage replica configuration or,
* PostgreSQL is still a standalone instance and is not clustered. We also won't manage replica configuration or,
if the user configures standby replicas, database failover.
* All nodes should be reachable from all other nodes and they should be able to reach the database. It's also important
* All instances should be reachable from all other instances and they should be able to reach the database. It's also important
for the hosts to have a stable address and/or hostname (depending on how you configure the Tower host)
* RabbitMQ is the cornerstone of Tower's Clustering system. A lot of our configuration requirements and behavior is dictated
by its needs. Thus we are pretty inflexible to customization beyond what our setup playbook allows. Each Tower node has a
deployment of RabbitMQ that will cluster with the other nodes' RabbitMQ instances.
* Existing old-style HA deployments will be transitioned automatically to the new HA system during the upgrade process.
* Manual projects will need to be synced to all nodes by the customer
by its needs. Thus we are pretty inflexible to customization beyond what our setup playbook allows. Each Tower instance has a
deployment of RabbitMQ that will cluster with the other instances' RabbitMQ instances.
* Existing old-style HA deployments will be transitioned automatically to the new HA system during the upgrade process to 3.1.
* Manual projects will need to be synced to all instances by the customer
## Important Changes
* There is no concept of primary/secondary in the new Tower system. *All* systems are primary.
* Setup playbook changes to configure rabbitmq and give hints to the type of network the hosts are on.
* The `inventory` file for Tower deployments should be saved/persisted. If new nodes are to be provisioned
* The `inventory` file for Tower deployments should be saved/persisted. If new instances are to be provisioned
the passwords and configuration options as well as host names will need to be available to the installer.
## Concepts and Configuration
### Installation and the Inventory File
The current standalone node configuration doesn't change for a 3.1 deploy. The inventory file does change in some important ways:
The current standalone instance configuration doesn't change for a 3.1+ deploy. The inventory file does change in some important ways:
* Since there is no primary/secondary configuration those inventory groups go away and are replaced with a
single inventory group `tower`. The `database` group remains for specifying an external postgres, however:
single inventory group `tower`. The customer may, *optionally*, define other groups and group instances in those groups. These groups
should be prefixed with `rampart_`. Instances are not required to be in the `tower` group alongside other `rampart_` groups, but one
instance *must* be present in the `tower` group. Technically `tower` is a group like any other `rampart_` group but it must always be present
and if a specific group is not associated with a specific resource then job execution will always fall back to the `tower` group:
```
[tower]
hostA
hostB
hostC
[rampart_east]
hostB
hostC
[rampart_west]
hostC
hostD
```
The `database` group remains for specifying an external postgres. If the database host is provisioned seperately this group should be empty
```
[tower]
hostA
@ -46,23 +70,34 @@ The current standalone node configuration doesn't change for a 3.1 deploy. The i
[database]
hostDB
```
* It's common for customers to provision Tower instances externally but prefer to reference them by internal addressing. This is most significant
for RabbitMQ clustering where the service isn't available at all on an external interface. For this purpose it is necessary to assign the internal
address for RabbitMQ links as such:
```
[tower]
hostA rabbitmq_host=10.1.0.2
hostB rabbitmq_host=10.1.0.3
hostC rabbitmq_host=10.1.0.3
```
* The `redis_password` field is removed from `[all:vars]`
* There are various new fields for RabbitMQ:
- `rabbitmq_port=5672` - RabbitMQ is installed on each node and is not optional, it's also not possible to externalize. It is
- `rabbitmq_port=5672` - RabbitMQ is installed on each instance and is not optional, it's also not possible to externalize. It is
possible to configure what port it listens on and this setting controls that.
- `rabbitmq_vhost=tower` - Tower configures a rabbitmq virtualhost to isolate itself. This controls that settings.
- `rabbitmq_username=tower` and `rabbitmq_password=tower` - Each node will be configured with these values and each node's Tower
- `rabbitmq_username=tower` and `rabbitmq_password=tower` - Each instance will be configured with these values and each instance's Tower
instance will be configured with it also. This is similar to our other uses of usernames/passwords.
- `rabbitmq_cookie=<somevalue>` - This value is unused in a standalone deployment but is critical for clustered deployments.
This acts as the secret key that allows RabbitMQ cluster members to identify each other.
- `rabbitmq_use_long_names` - RabbitMQ is pretty sensitive to what each node is named. We are flexible enough to allow FQDNs
- `rabbitmq_use_long_names` - RabbitMQ is pretty sensitive to what each instance is named. We are flexible enough to allow FQDNs
(host01.example.com), short names (host01), or ip addresses (192.168.5.73). Depending on what is used to identify each host
in the `inventory` file this value may need to be changed. For FQDNs and ip addresses this value needs to be `true`. For short
names it should be `false`
- `rabbitmq_enable_manager` - Setting this to `true` will expose the RabbitMQ management web console on each node.
- `rabbitmq_enable_manager` - Setting this to `true` will expose the RabbitMQ management web console on each instance.
The most important field to point out for variability is `rabbitmq_use_long_name`. That's something we can't detect or provide a reasonable
default for so it's important to point out when it needs to be changed.
default for so it's important to point out when it needs to be changed. If instances are provisioned to where they reference other instances
internally and not on external addressess then `rabbitmq_use_long_name` semantics should follow the internal addressing (aka `rabbitmq_host`.
Other than `rabbitmq_use_long_name` the defaults are pretty reasonable:
```
@ -77,42 +112,54 @@ rabbitmq_use_long_name=false
rabbitmq_enable_manager=false
```
### Provisioning and Deprovisioning Nodes
### Provisioning and Deprovisioning Instances and Groups
* Provisioning
Provisioning Nodes after installation is supported by updating the `inventory` file and re-running the setup playbook. It's important that this file
contain all passwords and information used when installing the cluster or other nodes may be reconfigured (This could be intentional)
Provisioning Instances after installation is supported by updating the `inventory` file and re-running the setup playbook. It's important that this file
contain all passwords and information used when installing the cluster or other instances may be reconfigured (This could be intentional)
* Deprovisioning
Tower does not automatically de-provision nodes since we can't distinguish between a node that was taken offline intentionally or due to failure.
Instead the procedure for deprovisioning a node is to shut it down (or stop the `ansible-tower-service`) and run the Tower deprovision command:
Tower does not automatically de-provision instances since we can't distinguish between an instance that was taken offline intentionally or due to failure.
Instead the procedure for deprovisioning an instance is to shut it down (or stop the `ansible-tower-service`) and run the Tower deprovision command:
```
$ tower-manage deprovision-node <nodename>
```
* Removing/Deprovisioning Instance Groups
Tower does not automatically de-provision or remove instance groups, even though re-provisioning will often cause these to be unused. They may still
show up in api endpoints and stats monitoring. These groups can be removed with the following command:
```
$ tower-manage unregister_queue --queuename=<name>
```
### Status and Monitoring
Tower itself reports as much status as it can via the api at `/api/v1/ping` in order to provide validation of the health
Tower itself reports as much status as it can via the api at `/api/v2/ping` in order to provide validation of the health
of the Cluster. This includes:
* The node servicing the HTTP request
* The last heartbeat time of all other nodes in the cluster
* The state of the Job Queue, any jobs each node is running
* The instance servicing the HTTP request
* The last heartbeat time of all other instances in the cluster
* The state of the Job Queue
* The RabbitMQ cluster status
* Instance Groups and Instance membership in those groups
### Node Services and Failure Behavior
A more detailed view of Instances and Instance Groups, including running jobs and membership
information can be seen at `/api/v2/instances/` and `/api/v2/instance_groups`.
Each Tower node is made up of several different services working collaboratively:
### Instance Services and Failure Behavior
Each Tower instance is made up of several different services working collaboratively:
* HTTP Services - This includes the Tower application itself as well as external web services.
* Callback Receiver - Whose job it is to receive job events from running Ansible jobs.
* Celery - The worker queue, that processes and runs all jobs.
* RabbitMQ - Message Broker, this is used as a signaling mechanism for Celery as well as any event data propogated to the application.
* Memcached - local caching service for the node it lives on.
* Memcached - local caching service for the instance it lives on.
Tower is configured in such a way that if any of these services or their components fail then all services are restarted. If these fail sufficiently
often in a short span of time then the entire node will be placed offline in an automated fashion in order to allow remediation without causing unexpected
often in a short span of time then the entire instance will be placed offline in an automated fashion in order to allow remediation without causing unexpected
behavior.
### Job Runtime Behavior
@ -120,56 +167,72 @@ behavior.
Ideally a regular user of Tower should not notice any semantic difference to the way jobs are run and reported. Behind the scenes its worth
pointing out the differences in how the system behaves.
When a job is submitted from the API interface it gets pushed into the Celery queue on RabbitMQ. A single RabbitMQ node is the responsible master for
individual queues but each Tower node will connect to and receive jobs from that queue using a Fair scheduling algorithm. Any node in the cluster is just
as likely to receive the work and execute the task. If a node fails while executing jobs then the work is marked as permanently failed.
When a job is submitted from the API interface it gets pushed into the Celery queue on RabbitMQ. A single RabbitMQ instance is the responsible master for
individual queues but each Tower instance will connect to and receive jobs from that queue using a Fair scheduling algorithm. Any instance on the cluster is
just as likely to receive the work and execute the task. If a instance fails while executing jobs then the work is marked as permanently failed.
As Tower nodes are brought online it effectively expands the work capacity of the Tower system which is measured as one entire unit (the cluster's capacity).
Conversely de-provisioning a node will remove capacity from the cluster.
If a cluster is divided into separate Instance Groups then the behavior is similar to the cluster as a whole. If two instances are assigned to a group then
either one is just as likely to receive a job as any other in the same group.
It's important to note that not all nodes are required to be provisioned with an equal capacity.
As Tower instances are brought online it effectively expands the work capacity of the Tower system. If those instances are also placed into Instance Groups then
they also expand that group's capacity. If an instance is performing work and it is a member of multiple groups then capacity will be reduced from all groups for
which it is a member. De-provisioning an instance will remove capacity from the cluster wherever that instance was assigned.
Project updates behave differently than they did before. Previously they were ordinary jobs that ran on a single node. It's now important that
they run successfully on any node that could potentially run a job. Project's will now sync themselves to the correct version on the node immediately
It's important to note that not all instances are required to be provisioned with an equal capacity.
Project updates behave differently than they did before. Previously they were ordinary jobs that ran on a single instance. It's now important that
they run successfully on any instance that could potentially run a job. Project's will now sync themselves to the correct version on the instance immediately
prior to running the job.
If an Instance Group is configured but all instances in that group are offline or unavailable, any jobs that are launched targeting only that group will be stuck
in a waiting state until instances become available. Fallback or backup resources should be provisioned to handle any work that might encounter this scenario.
## Acceptance Criteria
When verifying acceptance we should ensure the following statements are true
* Tower should install as a standalone Node
* Tower should install as a standalone Instance
* Tower should install in a Clustered fashion
* Instance should, optionally, be able to be grouped arbitrarily into different Instance Groups
* Capacity should be tracked at the group level and capacity impact should make sense relative to what instance a job is
running on and what groups that instance is a member of.
* Provisioning should be supported via the setup playbook
* De-provisioning should be supported via a management command
* All jobs, inventory updates, and project updates should run successfully
* Jobs should be able to run on all hosts
* Jobs should be able to run on hosts which it is targeted. If assigned implicitly or directly to groups then it should
only run on instances in those Instance Groups.
* Project updates should manifest their data on the host that will run the job immediately prior to the job running
* Tower should be able to reasonably survive the removal of all nodes in the cluster
* Tower should be able to reasonably survive the removal of all instances in the cluster
* Tower should behave in a predictable fashiong during network partitioning
## Testing Considerations
* Basic testing should be able to demonstrate parity with a standalone node for all integration testing.
* Basic testing should be able to demonstrate parity with a standalone instance for all integration testing.
* Basic playbook testing to verify routing differences, including:
- Basic FQDN
- Short-name name resolution
- ip addresses
- /etc/hosts static routing information
* We should test behavior of large and small clusters. I would envision small clusters as 2 - 3 nodes and large
clusters as 10 - 15 nodes
* Failure testing should involve killing single nodes and killing multiple nodes while the cluster is performing work.
* We should test behavior of large and small clusters. I would envision small clusters as 2 - 3 instances and large
clusters as 10 - 15 instances
* Failure testing should involve killing single instances and killing multiple instances while the cluster is performing work.
Job failures during the time period should be predictable and not catastrophic.
* Node downtime testing should also include recoverability testing. Killing single services and ensuring the system can
* Instance downtime testing should also include recoverability testing. Killing single services and ensuring the system can
return itself to a working state
* Persistent failure should be tested by killing single services in such a way that the cluster node cannot be recovered
and ensuring that the node is properly taken offline
* Persistent failure should be tested by killing single services in such a way that the cluster instance cannot be recovered
and ensuring that the instance is properly taken offline
* Network partitioning failures will be important also. In order to test this
- Disallow a single node from communicating with the other nodes but allow it to communicate with the database
- Break the link between nodes such that it forms 2 or more groups where groupA and groupB can't communicate but all nodes
- Disallow a single instance from communicating with the other instances but allow it to communicate with the database
- Break the link between instances such that it forms 2 or more groups where groupA and groupB can't communicate but all instances
can communicate with the database.
* Crucially when network partitioning is resolved all nodes should recover into a consistent state
* Crucially when network partitioning is resolved all instances should recover into a consistent state
* Upgrade Testing, verify behavior before and after are the same for the end user.
* Project Updates should be thoroughly tested for all scm types (git, svn, hg) and for manual projects.
* Setting up instance groups in two scenarios:
a) instances are shared between groups
b) instances are isolated to particular groups
Organizations, Inventories, and Job Templates should be variously assigned to one or many groups and jobs should execute
in those groups in preferential order as resources are available.
## Performance Testing

View File

@ -14,19 +14,10 @@ The `schedule()` function is ran (a) periodically by a celery task and (b) on jo
### Scheduler Algorithm
* Get all non-completed jobs, `all_tasks`
* Generate the hash tables from `all_tasks`:
* `<job_template_id, True/False>` indicates a job is running
* `<project_id, True/False>` indicates a project update is running
* `<inventory_id, True/False>` indicates a job template or inventory update is running
* `<inventory_source_id, True/False>` indiciates an inventory update is running
* `<workflow_job_template_id, True/False>` indiciates a workflow job is running
* `<project_id, latest_project_update_partial>` used to determine cache timeout
* `<inventory_id, [ inventory_source_partial, ... ]>` used to determine cache timeout and dependencies to spawn
* `<inventory_source_id, latest_inventory_update_partial>` used to determine cache timeout
* Detect finished workflow jobs
* Spawn next workflow jobs if needed
* For each pending jobs; start with oldest created job and stop when no capacity == 0
* If job is not blocked, determined using generated hash tables, and there is capacity, then mark the as `waiting` and submit the job to celery.
* For each pending jobs; start with oldest created job
* If job is not blocked, and there is capacity in the instance group queue, then mark the as `waiting` and submit the job to celery.
### Job Lifecycle
| Job Status | State |
@ -41,11 +32,8 @@ The `schedule()` function is ran (a) periodically by a celery task and (b) on jo
## Code Composition
The main goal of the new task manager is to run in our HA environment. This translates to making the task manager logic run on any tower node. To support this we need to remove any reliance on state between task manager schedule logic runs. We had a secondary goal in mind of designing the task manager to have limited/no access to the database for the future federation feature. This secondary requirement combined with performance needs led us to create partial models that wrap dict database model data.
### Partials
Partials wrap a subset of Django model dict data, provide a simple static query method that is purpose built to support the populating of the task manager hash tables, have a link back to the model which they are wrapping so that the original Django ORM model for which the partial is wrapping can be easily gotten, and can be made serializable via `<type, self.data>` since `self.data` is a `dict` of the database record.
### Blocking Logic
The blocking logic has been moved from the respective ORM model to the code that manages the dependency hash tables. The blocking logic could easily be moved to the partials or even the ORM models. However, the per-model blocking logic code would be operating on the dependency hash tables; not on ORM models as in the previous design. The blocking logic is kept close to the data-structures required to operate on.
The blocking logic is handled by a mixture of ORM instance references and task manager local tracking data in the scheduler instance
## Acceptance Tests
@ -53,8 +41,8 @@ The new task manager should, basically, work like the old one. Old task manager
### Task Manager Rules
* Groups of blocked tasks run in chronological order
* Tasks that are not blocked run whenever there is capacity available***
* ***1 job is always allowed to run, even if there isn't enough capacity.
* Tasks that are not blocked run whenever there is capacity available in the instance group they are set to run in***
* ***1 job is always allowed to run per instance group, even if there isn't enough capacity.
* Only 1 Project Updates for a Project may be running
* Only 1 Inventory Update for an Inventory Source may be running
* For a related Project, only a Job xor Project Update may be running