diff --git a/docs/clustering.md b/docs/clustering.md index d8b1217fdc..244b026424 100644 --- a/docs/clustering.md +++ b/docs/clustering.md @@ -1,105 +1,91 @@ ## Tower Clustering/HA Overview -Prior to 3.1 the Ansible Tower HA solution was not a true high-availability system. In 3.1 we have rewritten this system entirely with a new focus towards -a proper highly available clustered system. In 3.2 we have extended this further to allow grouping of clustered instances into different pools/queues. +Prior to 3.1, the Ansible Tower HA solution was not a true high-availability system. This system has been entirely rewritten in 3.1 with a focus towards a proper highly-available clustered system. This has been extended further in 3.2 to allow grouping of clustered instances into different pools/queues. -* Each instance should be able to act as an entrypoint for UI and API Access. - This should enable Tower administrators to use load balancers in front of as many instances as they wish - and maintain good data visibility. +* Each instance should be able to act as an entry point for UI and API Access. + This should enable Tower administrators to use load balancers in front of as many instances as they wish and maintain good data visibility. * Each instance should be able to join the Tower cluster and expand its ability to execute jobs. -* Provisioning new instance should be as simple as updating the `inventory` file and re-running the setup playbook -* Instances can be deprovisioned with a simple management commands +* Provisioning new instance should be as simple as updating the `inventory` file and re-running the setup playbook. +* Instances can be de-provisioned with a simple management command. * Instances can be grouped into one or more Instance Groups to share resources for topical purposes. * These instance groups should be assignable to certain resources: * Organizations * Inventories * Job Templates - such that execution of jobs under those resources will favor particular queues + + ...such that execution of jobs under those resources will favor particular queues. It's important to point out a few existing things: -* PostgreSQL is still a standalone instance and is not clustered. We also won't manage replica configuration or, - if the user configures standby replicas, database failover. -* All instances should be reachable from all other instances and they should be able to reach the database. It's also important - for the hosts to have a stable address and/or hostname (depending on how you configure the Tower host) -* RabbitMQ is the cornerstone of Tower's Clustering system. A lot of our configuration requirements and behavior is dictated - by its needs. Thus we are pretty inflexible to customization beyond what our setup playbook allows. Each Tower instance has a - deployment of RabbitMQ that will cluster with the other instances' RabbitMQ instances. +* PostgreSQL is still a standalone instance and is not clustered. Replica configuration will not be managed. If the user configures standby replicas, database failover will also not be managed. +* All instances should be reachable from all other instances and they should be able to reach the database. It's also important for the hosts to have a stable address and/or hostname (depending on how you configure the Tower host). +* RabbitMQ is the cornerstone of Tower's Clustering system. A lot of AWX's configuration requirements and behavior are dictated by its needs. For this reason, it is generally inflexible to customize beyond what the setup playbook allows. Each AWX/Tower instance has a deployment of RabbitMQ which will cluster with the other instances' RabbitMQ instances. * Existing old-style HA deployments will be transitioned automatically to the new HA system during the upgrade process to 3.1. -* Manual projects will need to be synced to all instances by the customer +* Manual projects will need to be synced to all instances by the customer. + +Ansible Tower 3.3 adds support for container-based clusters using Openshift or Kubernetes. -Ansible Tower 3.3 adds support for container-based clusters using Openshift or Kubernetes ## Important Changes * There is no concept of primary/secondary in the new Tower system. *All* systems are primary. -* Setup playbook changes to configure rabbitmq and give hints to the type of network the hosts are on. -* The `inventory` file for Tower deployments should be saved/persisted. If new instances are to be provisioned - the passwords and configuration options as well as host names will need to be available to the installer. +* Set up playbook changes to configure RabbitMQ and give hints to the type of network the hosts are on. +* The `inventory` file for Tower deployments should be saved/persisted. If new instances are to be provisioned, the passwords and configuration options as well as host names will need to be available to the installer. + ## Concepts and Configuration ### Installation and the Inventory File -The current standalone instance configuration doesn't change for a 3.1+ deploy. The inventory file does change in some important ways: +The current standalone instance configuration doesn't change for a 3.1+ deployment. The inventory file does change in some important ways: -* Since there is no primary/secondary configuration those inventory groups go away and are replaced with a - single inventory group `tower`. The customer may, *optionally*, define other groups and group instances in those groups. These groups - should be prefixed with `instance_group_`. Instances are not required to be in the `tower` group alongside other `instance_group_` groups, but one - instance *must* be present in the `tower` group. Technically `tower` is a group like any other `instance_group_` group but it must always be present - and if a specific group is not associated with a specific resource then job execution will always fall back to the `tower` group: - ``` - [tower] - hostA - hostB - hostC - - [instance_group_east] - hostB - hostC - - [instance_group_west] - hostC - hostD - ``` - - The `database` group remains for specifying an external postgres. If the database host is provisioned seperately this group should be empty - ``` - [tower] - hostA - hostB - hostC - - [database] - hostDB - ``` +* Since there is no primary/secondary configuration, those inventory groups go away and are replaced with a single inventory group `tower`. The customer may *optionally* define other groups and group instances in those groups. These groups should be prefixed with `instance_group_`. Instances are not required to be in the `tower` group alongside other `instance_group_` groups, but one instance *must* be present in the `tower` group. Technically `tower` is a group like any other `instance_group_` group, but it must always be present and if a specific group is not associated with a specific resource, then job execution will always fall back to the `tower` group: -* It's common for customers to provision Tower instances externally but prefer to reference them by internal addressing. This is most significant - for RabbitMQ clustering where the service isn't available at all on an external interface. For this purpose it is necessary to assign the internal - address for RabbitMQ links as such: - ``` - [tower] - hostA rabbitmq_host=10.1.0.2 - hostB rabbitmq_host=10.1.0.3 - hostC rabbitmq_host=10.1.0.3 - ``` -* The `redis_password` field is removed from `[all:vars]` +``` +[tower] +hostA +hostB +hostC + +[instance_group_east] +hostB +hostC + +[instance_group_west] +hostC +hostD +``` + +The `database` group remains in order to specify an external Postgres. If the database host is provisioned separately, this group should be empty. + +``` +[tower] +hostA +hostB +hostC + +[database] +hostDB +``` + +* It's common for customers to provision Tower instances externally but prefer to reference them by internal addressing. This is most significant for RabbitMQ clustering, where the service isn't available at all on an external interface. Because of this, it is necessary to assign the internal address for RabbitMQ links as such: + +``` +[tower] +hostA rabbitmq_host=10.1.0.2 +hostB rabbitmq_host=10.1.0.3 +hostC rabbitmq_host=10.1.0.3 +``` + +* The `redis_password` field is removed from `[all:vars]`. * There are various new fields for RabbitMQ: - - `rabbitmq_port=5672` - RabbitMQ is installed on each instance and is not optional, it's also not possible to externalize. It is - possible to configure what port it listens on and this setting controls that. - - `rabbitmq_vhost=tower` - Tower configures a rabbitmq virtualhost to isolate itself. This controls that settings. - - `rabbitmq_username=tower` and `rabbitmq_password=tower` - Each instance will be configured with these values and each instance's Tower - instance will be configured with it also. This is similar to our other uses of usernames/passwords. - - `rabbitmq_cookie=` - This value is unused in a standalone deployment but is critical for clustered deployments. - This acts as the secret key that allows RabbitMQ cluster members to identify each other. - - `rabbitmq_use_long_names` - RabbitMQ is pretty sensitive to what each instance is named. We are flexible enough to allow FQDNs - (host01.example.com), short names (host01), or ip addresses (192.168.5.73). Depending on what is used to identify each host - in the `inventory` file this value may need to be changed. For FQDNs and ip addresses this value needs to be `true`. For short - names it should be `false` + - `rabbitmq_port=5672` - RabbitMQ is installed on each instance and is not optional, it's also not possible to externalize. It is possible to configure what port it listens on and this setting controls that. + - `rabbitmq_vhost=tower` - Tower configures a rabbitmq virtualhost to isolate itself. This controls that setting. + - `rabbitmq_username=tower` and `rabbitmq_password=tower` - Each instance will be configured with these values and each instance's Tower instance will be configured with it also. This is similar to our other uses of usernames/passwords. + - `rabbitmq_cookie=` - This value is unused in a standalone deployment but is critical for clustered deployments. This acts as the secret key that allows RabbitMQ cluster members to identify each other. + - `rabbitmq_use_long_names` - RabbitMQ is pretty sensitive to what each instance is named. We are flexible enough to allow FQDNs (_host01.example.com_), short names (`host01`), or IP addresses (192.168.5.73). Depending on what is used to identify each host in the `inventory` file, this value may need to be changed. For FQDNs and IP addresses, this value needs to be `true`. For short names it should be `false` - `rabbitmq_enable_manager` - Setting this to `true` will expose the RabbitMQ management web console on each instance. -The most important field to point out for variability is `rabbitmq_use_long_name`. That's something we can't detect or provide a reasonable -default for so it's important to point out when it needs to be changed. If instances are provisioned to where they reference other instances -internally and not on external addressess then `rabbitmq_use_long_name` semantics should follow the internal addressing (aka `rabbitmq_host`. +The most important field to point out for variability is `rabbitmq_use_long_name`. This cannot be detected and no reasonable default is provided for it, so it's important to point out when it needs to be changed. If instances are provisioned to where they reference other instances internally and not on external addresses then `rabbitmq_use_long_name` semantics should follow the internal addressing (aka `rabbitmq_host`). Other than `rabbitmq_use_long_name` the defaults are pretty reasonable: ``` @@ -115,17 +101,13 @@ rabbitmq_enable_manager=false ``` Recommendations and constraints: - - Do not create a group named `instance_group_tower` - - Do not name any instance the same as a group name + - Do not create a group named `instance_group_tower`. + - Do not name any instance the same as a group name. + ### Security Isolated Rampart Groups -In Tower versions 3.2+ customers may optionally define isolated groups -inside security-restricted networking zones to run jobs and ad hoc commands from. -Instances in these groups will _not_ have a full install of Tower, but will have a minimal -set of utilities used to run jobs. Isolated groups must be specified -in the inventory file prefixed with `isolated_group_`. An example inventory -file is shown below. +In Tower versions 3.2+ customers may optionally define isolated groups inside of security-restricted networking zones from which to run jobs and ad hoc commands. Instances in these groups will _not_ have a full install of Tower, but will have a minimal set of utilities used to run jobs. Isolated groups must be specified in the inventory file prefixed with `isolated_group_`. An example inventory file is shown below: ``` [tower] @@ -145,84 +127,51 @@ isolatedB controller=security ``` -In the isolated rampart model, "controller" instances interact with "isolated" -instances via a series of Ansible playbooks over SSH. At installation time, -a randomized RSA key is generated and distributed as an authorized key to all -"isolated" instances. The private half of the key is encrypted and stored -within Tower, and is used to authenticate from "controller" instances to -"isolated" instances when jobs are run. +In the isolated rampart model, "controller" instances interact with "isolated" instances via a series of Ansible playbooks over SSH. At installation time, a randomized RSA key is generated and distributed as an authorized key to all "isolated" instances. The private half of the key is encrypted and stored within Tower, and is used to authenticate from "controller" instances to "isolated" instances when jobs are run. When a job is scheduled to run on an "isolated" instance: -* The "controller" instance compiles metadata required to run the job and copies - it to the "isolated" instance via `rsync` (any related project or inventory - updates are run on the controller instance). This metadata includes: +* The "controller" instance compiles metadata required to run the job and copies it to the "isolated" instance via `rsync` (any related project or inventory updates are run on the controller instance). This metadata includes: - the entire SCM checkout directory for the project - a static inventory file - pexpect passwords - environment variables - - the `ansible`/`ansible-playbook` command invocation, i.e., - `bwrap ... ansible-playbook -i /path/to/inventory /path/to/playbook.yml -e ...` + - the `ansible`/`ansible-playbook` command invocation, _i.e._, `bwrap ... ansible-playbook -i /path/to/inventory /path/to/playbook.yml -e ...` -* Once the metadata has been rsynced to the isolated host, the "controller - instance" starts a process on the "isolated" instance which consumes the - metadata and starts running `ansible`/`ansible-playbook`. As the playbook - runs, job artifacts (such as stdout and job events) are written to disk on - the "isolated" instance. +* Once the metadata has been `rsync`ed to the isolated host, the "controller instance" starts a process on the "isolated" instance which consumes the metadata and starts running `ansible`/`ansible-playbook`. As the playbook runs, job artifacts (such as `stdout` and job events) are written to disk on the "isolated" instance. -* While the job runs on the "isolated" instance, the "controller" instance - periodically copies job artifacts (stdout and job events) from the "isolated" - instance using `rsync`. It consumes these until the job finishes running on the - "isolated" instance. +* While the job runs on the "isolated" instance, the "controller" instance periodically copies job artifacts (`stdout` and job events) from the "isolated" instance using `rsync`. It consumes these until the job finishes running on the "isolated" instance. -Isolated groups are architected such that they may exist inside of a VPC -with security rules that _only_ permit the instances in its `controller` -group to access them; only ingress SSH traffic from "controller" instances to -"isolated" instances is required. +Isolated groups are architected such that they may exist inside of a VPC with security rules that _only_ permit the instances in its `controller` group to access them; only ingress SSH traffic from "controller" instances to "isolated" instances is required. Recommendations for system configuration with isolated groups: - - Do not create a group named `isolated_group_tower` - - Do not put any isolated instances inside the `tower` group or other - ordinary instance groups. - - Define the `controller` variable as either a group var or as a hostvar - on all the instances in the isolated group. Please _do not_ allow - isolated instances in the same group have a different value for this - variable - the behavior in this case can not be predicted. - - Do not put an isolated instance in more than 1 isolated group. + - Do not create a group named `isolated_group_tower`. + - Do not put any isolated instances inside the `tower` group or other ordinary instance groups. + - Define the `controller` variable as either a group var or as a hostvar on all the instances in the isolated group. Please _do not_ allow isolated instances in the same group have a different value for this variable - the behavior in this case can not be predicted. + - Do not put an isolated instance in more than one isolated group. + Isolated Instance Authentication -------------------------------- -By default - at installation time - a randomized RSA key is generated and -distributed as an authorized key to all "isolated" instances. The private half -of the key is encrypted and stored within Tower, and is used to authenticate -from "controller" instances to "isolated" instances when jobs are run. +By default - at installation time - a randomized RSA key is generated and distributed as an authorized key to all "isolated" instances. The private half of the key is encrypted and stored within Tower, and is used to authenticat from "controller" instances to "isolated" instances when jobs are run. -For users who wish to manage SSH authentication from controlling instances to -isolated instances via some system _outside_ of Tower (such as externally-managed -passwordless SSH keys), this behavior can be disabled by unsetting two Tower -API settings values: +For users who wish to manage SSH authentication from controlling instances to isolated instances via some system _outside_ of Tower (such as externally-managed passwordless SSH keys), this behavior can be disabled by unsetting two Tower API settings values: `HTTP PATCH /api/v2/settings/jobs/ {'AWX_ISOLATED_PRIVATE_KEY': '', 'AWX_ISOLATED_PUBLIC_KEY': ''}` ### Provisioning and Deprovisioning Instances and Groups -* Provisioning -Provisioning Instances after installation is supported by updating the `inventory` file and re-running the setup playbook. It's important that this file -contain all passwords and information used when installing the cluster or other instances may be reconfigured (This could be intentional) +* **Provisioning** - Provisioning Instances after installation is supported by updating the `inventory` file and re-running the setup playbook. It's important that this file contain all passwords and information used when installing the cluster or other instances may be reconfigured (this could be intentional). -* Deprovisioning -Tower does not automatically de-provision instances since we can't distinguish between an instance that was taken offline intentionally or due to failure. -Instead the procedure for deprovisioning an instance is to shut it down (or stop the `ansible-tower-service`) and run the Tower deprovision command: +* **Deprovisioning** - Tower does not automatically de-provision instances since it cannot distinguish between an instance that was taken offline intentionally or due to failure. Instead the procedure for deprovisioning an instance is to shut it down (or stop the `ansible-tower-service`) and run the Tower deprovision command: ``` $ awx-manage deprovision_instance --hostname= ``` -* Removing/Deprovisioning Instance Groups -Tower does not automatically de-provision or remove instance groups, even though re-provisioning will often cause these to be unused. They may still -show up in api endpoints and stats monitoring. These groups can be removed with the following command: +* **Removing/Deprovisioning Instance Groups** - Tower does not automatically de-provision or remove instance groups, even though re-provisioning will often cause these to be unused. They may still show up in API endpoints and stats monitoring. These groups can be removed with the following command: ``` $ awx-manage unregister_queue --queuename= @@ -238,39 +187,33 @@ Once created, `Instances` can be associated with an Instance Group with: HTTP POST /api/v2/instance_groups/x/instances/ {'id': y}` ``` -An `Instance` that is added to an `InstanceGroup` will automatically reconfigure itself to listen on the group's work queue. See the following -section `Instance Group Policies` for more details. +An `Instance` that is added to an `InstanceGroup` will automatically reconfigure itself to listen on the group's work queue. See the following section `Instance Group Policies` for more details. + ### Instance Group Policies Tower `Instances` can be configured to automatically join `Instance Groups` when they come online by defining a policy. These policies are evaluated for every new Instance that comes online. -Instance Group Policies are controlled by 3 optional fields on an `Instance Group`: +Instance Group Policies are controlled by three optional fields on an `Instance Group`: -* `policy_instance_percentage`: This is a number between 0 - 100. It gaurantees that this percentage of active Tower instances will be added - to this `Instance Group`. As new instances come online, if the number of Instances in this group relative to the total number of instances - is less than the given percentage then new ones will be added until the percentage condition is satisfied. -* `policy_instance_minimum`: This policy attempts to keep at least this many `Instances` in the `Instance Group`. If the number of - available instances is lower than this minimum then all `Instances` will be placed in this `Instance Group`. +* `policy_instance_percentage`: This is a number between 0 - 100. It guarantees that this percentage of active Tower instances will be added to this `Instance Group`. As new instances come online, if the number of Instances in this group relative to the total number of instances is fewer than the given percentage, then new ones will be added until the percentage condition is satisfied. +* `policy_instance_minimum`: This policy attempts to keep at least this many `Instances` in the `Instance Group`. If the number of available instances is lower than this minimum, then all `Instances` will be placed in this `Instance Group`. * `policy_instance_list`: This is a fixed list of `Instance` names to always include in this `Instance Group`. > NOTES -* `Instances` that are assigned directly to `Instance Groups` by posting to `/api/v2/instance_groups/x/instances` or - `/api/v2/instances/x/instance_groups` are automatically added to the `policy_instance_list`. This means they are subject to the - normal caveats for `policy_instance_list` and must be manually managed. -* `policy_instance_percentage` and `policy_instance_minimum` work together. For example, if you have a `policy_instance_percentage` of - 50% and a `policy_instance_minimum` of 2 and you start 6 `Instances`. 3 of them would be assigned to the `Instance Group`. If you reduce the number - of `Instances` to 2 then both of them would be assigned to the `Instance Group` to satisfy `policy_instance_minimum`. In this way, you can set a lower - bound on the amount of available resources. -* Policies don't actively prevent `Instances` from being associated with multiple `Instance Groups` but this can effectively be achieved by making the percentages - sum to 100. If you have 4 `Instance Groups` assign each a percentage value of 25 and the `Instances` will be distributed among them with no overlap. +* `Instances` that are assigned directly to `Instance Groups` by posting to `/api/v2/instance_groups/x/instances` or `/api/v2/instances/x/instance_groups` are automatically added to the `policy_instance_list`. This means they are subject to the normal caveats for `policy_instance_list` and must be manually managed. + +* `policy_instance_percentage` and `policy_instance_minimum` work together. For example, if you have a `policy_instance_percentage` of 50% and a `policy_instance_minimum` of 2 and you start 6 `Instances`, 3 of them would be assigned to the `Instance Group`. If you reduce the number of `Instances` to 2 then both of them would be assigned to the `Instance Group` to satisfy `policy_instance_minimum`. In this way, you can set a lower bound on the amount of available resources. + +* Policies don't actively prevent `Instances` from being associated with multiple `Instance Groups` but this can effectively be achieved by making the percentages sum to 100. If you have 4 `Instance Groups`, assign each a percentage value of 25 and the `Instances` will be distributed among them with no overlap. + ### Manually Pinning Instances to Specific Groups If you have a special `Instance` which needs to be _exclusively_ assigned to a specific `Instance Group` but don't want it to automatically join _other_ groups via "percentage" or "minimum" policies: -1. Add the `Instance` to one or more `Instance Group`s' `policy_instance_list` +1. Add the `Instance` to one or more `Instance Group`s' `policy_instance_list`. 2. Update the `Instance`'s `managed_by_policy` property to be `False`. This will prevent the `Instance` from being automatically added to other groups based on percentage and minimum policy; it will **only** belong to the groups you've manually assigned it to: @@ -287,15 +230,15 @@ HTTP PATCH /api/v2/instances/X/ } ``` + ### Status and Monitoring -Tower itself reports as much status as it can via the api at `/api/v2/ping` in order to provide validation of the health -of the Cluster. This includes: +Tower itself reports as much status as it can via the API at `/api/v2/ping` in order to provide validation of the health of the Cluster. This includes: -* The instance servicing the HTTP request -* The last heartbeat time of all other instances in the cluster -* The RabbitMQ cluster status -* Instance Groups and Instance membership in those groups +* The instance servicing the HTTP request. +* The last heartbeat time of all other instances in the cluster. +* The RabbitMQ cluster status. +* Instance Groups and Instance membership in those groups. A more detailed view of Instances and Instance Groups, including running jobs and membership information can be seen at `/api/v2/instances/` and `/api/v2/instance_groups`. @@ -304,79 +247,57 @@ information can be seen at `/api/v2/instances/` and `/api/v2/instance_groups`. Each Tower instance is made up of several different services working collaboratively: -* HTTP Services - This includes the Tower application itself as well as external web services. -* Callback Receiver - Whose job it is to receive job events from running Ansible jobs. -* Celery - The worker queue, that processes and runs all jobs. -* RabbitMQ - Message Broker, this is used as a signaling mechanism for Celery as well as any event data propogated to the application. -* Memcached - local caching service for the instance it lives on. +* **HTTP Services** - This includes the Tower application itself as well as external web services. +* **Callback Receiver** - Receives job events that result from running Ansible jobs. +* **Celery** - The worker queue that processes and runs all jobs. +* **RabbitMQ** - A Message Broker, this is used as a signaling mechanism for Celery as well as any event data propagated to the application. +* **Memcached** - A local caching service for the instance it lives on. + +Tower is configured in such a way that if any of these services or their components fail, then all services are restarted. If these fail sufficiently often in a short span of time, then the entire instance will be placed offline in an automated fashion in order to allow remediation without causing unexpected behavior. -Tower is configured in such a way that if any of these services or their components fail then all services are restarted. If these fail sufficiently -often in a short span of time then the entire instance will be placed offline in an automated fashion in order to allow remediation without causing unexpected -behavior. ### Job Runtime Behavior -Ideally a regular user of Tower should not notice any semantic difference to the way jobs are run and reported. Behind the scenes its worth -pointing out the differences in how the system behaves. +Ideally a regular user of Tower should not notice any semantic difference to the way jobs are run and reported. Behind the scenes it is worth pointing out the differences in how the system behaves. -When a job is submitted from the API interface it gets pushed into the Celery queue on RabbitMQ. A single RabbitMQ instance is the responsible master for -individual queues but each Tower instance will connect to and receive jobs from that queue using a Fair scheduling algorithm. Any instance on the cluster is -just as likely to receive the work and execute the task. If a instance fails while executing jobs then the work is marked as permanently failed. +When a job is submitted from the API interface it gets pushed into the Celery queue on RabbitMQ. A single RabbitMQ instance is the responsible master for individual queues, but each Tower instance will connect to and receive jobs from that queue using a Fair scheduling algorithm. Any instance on the cluster is just as likely to receive the work and execute the task. If an instance fails while executing jobs, then the work is marked as permanently failed. -If a cluster is divided into separate Instance Groups then the behavior is similar to the cluster as a whole. If two instances are assigned to a group then -either one is just as likely to receive a job as any other in the same group. +If a cluster is divided into separate Instance Groups, then the behavior is similar to the cluster as a whole. If two instances are assigned to a group then either one is just as likely to receive a job as any other in the same group. -As Tower instances are brought online it effectively expands the work capacity of the Tower system. If those instances are also placed into Instance Groups then -they also expand that group's capacity. If an instance is performing work and it is a member of multiple groups then capacity will be reduced from all groups for -which it is a member. De-provisioning an instance will remove capacity from the cluster wherever that instance was assigned. +As Tower instances are brought online, it effectively expands the work capacity of the Tower system. If those instances are also placed into Instance Groups, then they also expand that group's capacity. If an instance is performing work and it is a member of multiple groups, then capacity will be reduced from all groups for which it is a member. De-provisioning an instance will remove capacity from the cluster wherever that instance was assigned. It's important to note that not all instances are required to be provisioned with an equal capacity. -Project updates behave differently than they did before. Previously they were ordinary jobs that ran on a single instance. It's now important that -they run successfully on any instance that could potentially run a job. Project's will now sync themselves to the correct version on the instance immediately -prior to running the job. -When the sync happens, it is recorded in the database as a project update with a `launch_type` of "sync" -and a `job_type` of "run". Project syncs will not change the status or version of the project, -instead, they will update the source tree only on the instance where they run. -The only exception to this behavior is when the project is in the "never updated" state -(meaning that no project updates of any type have been ran), -in which case a sync should fill in the project's initial revision and status, and subsequent -syncs should not make such changes. +Project updates behave differently than they did before. Previously they were ordinary jobs that ran on a single instance. It's now important that they run successfully on any instance that could potentially run a job. Projects will now sync themselves to the correct version on the instance immediately prior to running the job. + +When the sync happens, it is recorded in the database as a project update with a `launch_type` of "sync" and a `job_type` of "run". Project syncs will not change the status or version of the project; instead, they will update the source tree _only_ on the instance where they run. The only exception to this behavior is when the project is in the "never updated" state (meaning that no project updates of any type have been run), in which case a sync should fill in the project's initial revision and status, and subsequent syncs should not make such changes. + +If an Instance Group is configured but all instances in that group are offline or unavailable, any jobs that are launched targeting only that group will be stuck in a waiting state until instances become available. Fallback or backup resources should be provisioned to handle any work that might encounter this scenario. -If an Instance Group is configured but all instances in that group are offline or unavailable, any jobs that are launched targeting only that group will be stuck -in a waiting state until instances become available. Fallback or backup resources should be provisioned to handle any work that might encounter this scenario. #### Controlling where a particular job runs -By default, a job will be submitted to the `tower` queue, meaning that it can be -picked up by any of the workers. +By default, a job will be submitted to the `tower` queue, meaning that it can be picked up by any of the workers. + ##### How to restrict the instances a job will run on If any of the job template, inventory, -or organization has instance groups associated with them, a job ran from that job template -will not be eligible for the default behavior. That means that if all of the -instance associated with these 3 resources are out of capacity, the job will -remain in the `pending` state until capacity frees up. +or organization has instance groups associated with them, a job run from that job template will not be eligible for the default behavior. That means that if all of the instance associated with these three resources are out of capacity, the job will remain in the `pending` state until capacity frees up. + ##### How to set up a preferred instance group -The order of preference in determining which instance group to submit the job to -goes: +The order of preference in determining which instance group to which the job gets submitted is as follows: -1. job template -2. inventory -3. organization (by way of inventory) +1. Job Template +2. Inventory +3. Organization (by way of Inventory) -If instance groups are associated with the job template, and all of these -are at capacity, then the job will be submitted to instance groups specified -on inventory, and then organization. +To expand further: If instance groups are associated with the job template and all of them are at capacity, then the job will be submitted to instance groups specified on inventory, and then organization. + +The global `tower` group can still be associated with a resource, just like any of the custom instance groups defined in the playbook. This can be used to specify a preferred instance group on the job template or inventory, but still allow the job to be submitted to any instance if those are out of capacity. -The global `tower` group can still be associated with a resource, just like -any of the custom instance groups defined in the playbook. This can be -used to specify a preferred instance group on the job template or inventory, -but still allow the job to be submitted to any instance if those are out of -capacity. #### Instance Enable / Disable @@ -430,7 +351,7 @@ When verifying acceptance we should ensure the following statements are true a) instances are shared between groups b) instances are isolated to particular groups Organizations, Inventories, and Job Templates should be variously assigned to one or many groups and jobs should execute - in those groups in preferential order as resources are available. + in those groups in preferential order as resources are available. ## Performance Testing diff --git a/docs/inventory_plugins.md b/docs/inventory_plugins.md index a5905e7243..1d60a337cb 100644 --- a/docs/inventory_plugins.md +++ b/docs/inventory_plugins.md @@ -1,153 +1,79 @@ # Transition to Ansible Inventory Plugins -Inventory updates change from using scripts which are vendored as executable -python scripts to using dynamically-generated -YAML files which conform to the specifications of the `auto` inventory plugin -which are then parsed by their respective inventory plugin. +Inventory updates have changed from using scripts which are vendored as executable Python scripts to using dynamically-generated YAML files which conform to the specifications of the `auto` inventory plugin. These are then parsed by their respective inventory plugin. -The major organizational change is that the inventory plugins are -part of the Ansible core distribution, whereas the same logic used to -be a part of AWX source. +The major organizational change is that the inventory plugins are part of the Ansible core distribution, whereas the same logic used to be a part of AWX source. ## Prior Background for Transition -AWX used to maintain logic that parsed `.ini` inventory file contents, -in addition to interpreting the JSON output of scripts, re-calling with -the `--host` option in the case the `_meta.hostvars` key was not provided. +AWX used to maintain logic that parsed `.ini` inventory file contents, in addition to interpreting the JSON output of scripts, re-calling with the `--host` option in cases where the `_meta.hostvars` key was not provided. ### Switch to Ansible Inventory -The CLI entry point `ansible-inventory` was introduced in Ansible 2.4. -In Tower 3.2, inventory imports began running this command -as an intermediary between the inventory and -the import's logic to save content to database. Using `ansible-inventory` -eliminates the need to maintain source-specific logic, -relying on Ansible's code instead. This also allows us to -count on a consistent data structure outputted from `ansible-inventory`. -There are many valid structures that a script can provide, but the output -from `ansible-inventory` will always be the same, -thus the AWX logic to parse the content is simplified. -This is why even scripts must be ran through the `ansible-inventory` CLI. +The CLI entry point `ansible-inventory` was introduced in Ansible 2.4. In Tower 3.2, inventory imports began running this command as an intermediary between the inventory and the import's logic to save content to database. Using `ansible-inventory` eliminates the need to maintain source-specific logic, relying on Ansible's code instead. This also allows us to count on a consistent data structure outputted from `ansible-inventory`. There are many valid structures that a script can provide, but the output from `ansible-inventory` will always be the same, thus the AWX logic to parse the content is simplified. This is why even scripts must be ran through the `ansible-inventory` CLI. -Along with this switchover, a backported version of -`ansible-inventory` was provided that supported Ansible versions 2.2 and 2.3. +Along with this switchover, a backported version of `ansible-inventory` was provided, which supports Ansible versions 2.2 and 2.3. ### Removal of Backport -In AWX 3.0.0 (and Tower 3.5), the backport of `ansible-inventory` -was removed, and support for using custom virtual environments was added. -This set the minimum version of Ansible necessary to run _any_ -inventory update to 2.4. +In AWX 3.0.0 (and Tower 3.5), the backport of `ansible-inventory` was removed, and support for using custom virtual environments was added. This set the minimum version of Ansible necessary to run _any_ inventory update to 2.4. ## Inventory Plugin Versioning -Beginning in Ansible 2.5, inventory sources in Ansible started migrating -away from "contrib" scripts (meaning they lived in the contrib folder) -to the inventory plugin model. +Beginning in Ansible 2.5, inventory sources in Ansible started migrating away from "contrib" scripts (meaning they lived in the contrib folder) to the inventory plugin model. -In AWX 4.0.0 (and Tower 3.5) inventory source types start to switchover -to plugins, provided that sufficient compatibility is in place for -the version of Ansible present in the local virtualenv. +In AWX 4.0.0 (and Tower 3.5) inventory source types start to switchover to plugins, provided that sufficient compatibility is in place for the version of Ansible present in the local virtualenv. -To see what version the plugin transition will happen, see -`awx/main/models/inventory.py` and look for the source name as a -subclass of `PluginFileInjector`, and there should be an `initial_version` -which is the first version that testing deemed to have sufficient parity -in the content its inventory plugin returns. For example, `openstack` will -begin using the inventory plugin in Ansible version 2.8. -If you run an openstack inventory update with Ansible -2.7.x or lower, it will use the script. +To see what version the plugin transition will happen, see `awx/main/models/inventory.py` and look for the source name as a subclass of `PluginFileInjector`, and there should be an `initial_version` which is the first version that testing deemed to have sufficient parity in the content its inventory plugin returns. For example, `openstack` will begin using the inventory plugin in Ansible version 2.8. If you run an openstack inventory update with Ansible 2.7.x or lower, it will use the script. ### Sunsetting the scripts -Eventually, it is intended that all source types will have moved to -plugins. For any given source, after the `initial_version` for plugin use -is higher than the lowest supported Ansible version, the script can be -removed and the logic for script credential injection will also be removed. +The eventual goal is for all source types to have moved to plugins. For any given source, after the `initial_version` for plugin use is higher than the lowest supported Ansible version, the script can be removed and the logic for script credential injection will also be removed. -For example, after AWX no longer supports Ansible 2.7, the script -`awx/plugins/openstack_inventory.py` will be removed. +For example, after AWX no longer supports Ansible 2.7, the script `awx/plugins/openstack_inventory.py` will be removed. ## Changes to Expect in Imports -An effort was made to keep imports working in the exact same way after -the switchover. However, the inventory plugins are a fundamental rewrite -and many elements of default behavior has changed. These changes also -include many backward incompatible changes. Because of this, what you -get via an inventory import will be a superset of what you get from the script -but will not match the default behavior you would get from the inventory -plugin on the CLI. +An effort was made to keep imports working in the exact same way after the switchover. However, the inventory plugins are a fundamental rewrite and many elements of default behavior have changed. These changes also include many backward-incompatible changes. Because of this, what you get via an inventory import will be a superset of what you get from the script but will not match the default behavior you would get from the inventory plugin on the CLI. -Because inventory plugins add additional variables, if you downgrade Ansible, you should -turn on `overwrite` and `overwrite_vars` to get rid of stale -variables (and potentially groups) no longer returned by the import. +Due to the fact that inventory plugins add additional variables, if you downgrade Ansible, you should turn on `overwrite` and `overwrite_vars` to get rid of stale variables (and potentially groups) no longer returned by the import. ### Changes for Compatibility -Programatically-generated examples of inventory file syntax used in -updates (with dummy data) can be found in `awx/main/tests/data/inventory/scripts`, -these demonstrate the inventory file syntax used to restore old behavior -from the inventory scripts. +Programatically-generated examples of inventory file syntax used in updates (with dummy data) can be found in `awx/main/tests/data/inventory/scripts`, these demonstrate the inventory file syntax used to restore old behavior from the inventory scripts. -#### hostvar keys and values +#### Hostvar Keys and Values -More hostvars will appear if the inventory plugins are used. -To maintain backward compatibility, -the old names are added back where they have the same meaning as a -variable returned by the plugin. New names are not removed. +More hostvars will appear if the inventory plugins are used. To maintain backward compatibility, the old names are added back where they have the same meaning as a variable returned by the plugin. New names are not removed. A small number of hostvars will be lost because of general deprecation needs. -#### Host names +#### Host Names -In many cases, the host names will change. In all cases, accurate host -tracking will still be maintained via the host `instance_id`. -(after: https://github.com/ansible/awx/pull/3362) +In many cases, the host names will change. In all cases, accurate host tracking will still be maintained via the host `instance_id`. (after: https://github.com/ansible/awx/pull/3362) -## How do I write my own Inventory File? +## Writing Your Own Inventory File -If you do not want any of this compatibility-related functionality, then -you can add an SCM inventory source that points to your own file. -You can also apply a credential of a `managed_by_tower` type to that inventory -source that matches the credential you are using, as long as that is -not `gce` or `openstack`. +If you do not want any of this compatibility-related functionality, then you can add an SCM inventory source that points to your own file. You can also apply a credential of a `managed_by_tower` type to that inventory source that matches the credential you are using, as long as it is not `gce` or `openstack`. -All other sources provide _secrets_ via environment variables, so this -can be re-used without any problems for SCM-based inventory, and your -inventory file can be used securely to specify non-sensitive configuration -details such as the keyed_groups to provide, or hostvars to construct. +All other sources provide _secrets_ via environment variables. These can be re-used without any problems for SCM-based inventory, and your inventory file can be used securely to specify non-sensitive configuration details such as the `keyed_groups` (to provide) or hostvars (to construct). ## Notes on Technical Implementation of Injectors -For an inventory source with a given value of the `source` field that is -of the built-in sources, a credential of the corresponding -credential type is required in most cases (exception being ec2 IAM roles). -This privileged credential is obtained by the method `get_cloud_credential`. +For an inventory source with a given value of the `source` field that is of the built-in sources, a credential of the corresponding credential type is required in most cases (ec2 IAM roles are an exception). This privileged credential is obtained by the method `get_cloud_credential`. -The `inputs` for this credential constitute one source of data for running -inventory updates. The following fields from the -`InventoryUpdate` model are also data sources, including: +The `inputs` for this credential constitute one source of data for running inventory updates. The following fields from the `InventoryUpdate` model are also data sources, including: - `source_vars` - `source_regions` - `instance_filters` - `group_by` -The way these data are applied to the environment (including files and -environment vars) is highly dependent on the specific source. +The way this data is applied to the environment (including files and environment vars) is highly dependent on the specific source. -With plugins, the inventory file may reference files that contain secrets -from the credential. With scripts, typically an environment variable -will reference a filename that contains a ConfigParser format file with -parameters for the update, and possibly including fields from the credential. +With plugins, the inventory file may reference files that contain secrets from the credential. With scripts, typically an environment variable will reference a filename that contains a ConfigParser format file with parameters for the update, and possibly including fields from the credential. -Caution: Please do not put secrets from the credential into the -inventory file for the plugin. Right now there appears to be no need to do -this, and by using environment variables to specify secrets, this keeps -open the possibility of showing the inventory file contents to the user -as a latter enhancement. +**Caution:** Please do not put secrets from the credential into the inventory file for the plugin. Right now there appears to be no need to do this, and by using environment variables to specify secrets, this keeps open the possibility of showing the inventory file contents to the user as a latter enhancement. -Logic for setup for inventory updates using both plugins and scripts live -inventory injector class, specific to the source type. +Logic for setup for inventory updates using both plugins and scripts live in the inventory injector class, specific to the source type. -Any credentials which are not source-specific will use the generic -injection logic which is also used in playbook runs. +Any credentials which are not source-specific will use the generic injection logic which is also used in playbook runs. diff --git a/docs/notification_system.md b/docs/notification_system.md index 61d28033a1..e3819f44ef 100644 --- a/docs/notification_system.md +++ b/docs/notification_system.md @@ -1,32 +1,35 @@ # Notification System Overview -A Notifier is an instance of a notification type (Email, Slack, Webhook, etc) with a name, description, and a defined configuration (A few examples: Username, password, server, recipients for the Email type. Token and list of channels for Slack. Url and Headers for webhooks) +A Notification Template is an instance of a notification type (Email, Slack, Webhook, etc.) with a name, description, and a defined configuration. A few examples include: -A Notification is a manifestation of the Notifier... for example, when a job fails a notification is sent using the configuration defined by the Notifier. +* Username, password, server, recipients for the Email type. +* Token and list of channels for Slack. +* URL and Headers for webhooks. -This PR implements the Notification system as outlined in the 3.0 Notifications spec. At a high level the typical flow is: +At a high level, the typical notification task flow is: -* User creates a Notifier at `/api/v1/notifiers` -* User assigns the notifier to any of the various objects that support it (all variants of job templates as well as organizations and projects) and at the appropriate trigger level for which they want the notification (error, success, or any). For example a user may wish to assign a particular Notifier to trigger when `Job Template 1` fails. In which case they will associate the notifier with the job template at `/api/v1/job_templates/n/notifiers_error`. +* User creates a `NotificationTemplate` at `/api/v2/notification_templates/`. +* User assigns the notification to any of the various objects that support it (all variants of Job Templates as well as organizations and projects) and at the appropriate trigger level for which they want the notification (error, success, or any). For example, a user may wish to assign a particular Notification Template to trigger when `Job Template 1` fails. -## Notifier hierarchy +## Notification Hierarchy -Notifiers assigned at certain levels will inherit notifiers defined on parent objects as such: +Notification templates assigned at certain levels will inherit notifications defined on parent objects as such: -* Job Templates will use notifiers defined on it as well as inheriting notifiers from the Project used by the Job Template and from the Organization that it is listed under (via the Project). -* Project Updates will use notifiers defined on the project and will inherit notifiers from the Organization associated with it. -* Inventory Updates will use notifiers defined on the Organization that it is listed under -* Ad-hoc commands will use notifiers defined on the Organization that the inventory is associated with +* Job Templates will use notifications defined on it as well as inheriting notifications from the Project used by the Job Template and from the Organization that it is listed under (via the Project). +* Project Updates will use notifications defined on the project and will inherit notifications from the Organization associated with it. +* Inventory Updates will use notifications defined on the Organization it is in. +* Ad-hoc commands will use notifications defined on the Organization with which that inventory is associated. ## Workflow -When a job succeeds or fails, the error or success handler will pull a list of relevant notifiers using the procedure defined above. It will then create a Notification object for each one containing relevant details about the job and then **send**s it to the destination (email addresses, slack channel(s), sms numbers, etc). These Notification objects are available as related resources on job types (jobs, inventory updates, project updates), and also at `/api/v1/notifications`. You may also see what notifications have been sent from a notifier by examining its related resources. +When a job succeeds or fails, the error or success handler will pull a list of relevant notifications using the procedure defined above. It will then create a Notification object for each one containing relevant details about the job and then **sends** it to the destination (email addresses, slack channel(s), SMS numbers, etc.). These Notification objects are available as related resources on job types (Jobs, Inventory Updates, Project Updates), and also at `/api/v2/notifications`. You may also see what notifications have been sent from a notifications by examining its related resources. -Notifications can succeed or fail but that will not cause its associated job to succeed or fail. The status of the notification can be viewed at its detail endpoint `/api/v1/notifications/` +Notifications can succeed or fail but that will not cause its associated job to succeed or fail. The status of the notification can be viewed at its detail endpoint: `/api/v2/notifications/` -## Testing Notifiers before using them +## Testing Notifications Before Using Them + +Once a Notification Template is created, its configuration can be tested by utilizing the endpoint at `/api/v2/notification_templates//test` This will emit a test notification given the configuration defined by the notification. These test notifications will also appear in the notifications list at `/api/v2/notifications` -Once a Notifier is created its configuration can be tested by utilizing the endpoint at `/api/v1/notifiers//test` This will emit a test notification given the configuration defined by the Notifier. These test notifications will also appear in the notifications list at `/api/v1/notifications` # Notification Types @@ -34,7 +37,7 @@ The currently defined Notification Types are: * Email * Slack -* Hipchat +* HipChat * Mattermost * Rocket.Chat * Pagerduty @@ -43,179 +46,161 @@ The currently defined Notification Types are: * Webhook * Grafana -Each of these have their own configuration and behavioral semantics and testing them may need to be approached in different ways. The following sections will give as much detail as possible. +Each of these have their own configuration and behavioral semantics and testing them may need to be approached in different ways. The following sections will give as much detail as possible. + ## Email -The email notification type supports a wide variety of smtp servers and has support for ssl/tls connections and timeouts. +The email notification type supports a wide variety of SMTP servers and has support for SSL/TLS connections and timeouts. -### Testing considerations +### Testing Considerations The following should be performed for good acceptance: -* Test plain authentication -* Test SSL and TLS authentication -* Verify single and multiple recipients +* Test plain authentication. +* Test SSL and TLS authentication. +* Verify single and multiple recipients. * Verify message subject and contents are formatted sanely. They should be plaintext but readable. ### Test Service -Either setup a local smtp mail service here are some options: +Set up a local SMTP mail service. Some options are listed below: + +* Postfix service on galaxy: https://galaxy.ansible.com/debops/postfix/ +* Mailtrap has a good free plan that should provide all of the features necessary: https://mailtrap.io/ +* Another option is to use a Docker container: `docker run --network="tools_default" -p 25:25 -e maildomain=mail.example.com -e smtp_user=user:pwd --name postfix -d catatnight/postfix` -* postfix service on galaxy: https://galaxy.ansible.com/debops/postfix/ -* Mailtrap has a good free plan and should provide all of the features we need under that plan: https://mailtrap.io/ ## Slack -Slack is pretty easy to configure, it just needs a token which you can get from creating a bot in the integrations settings for the slack team. +Slack is simple to configure; it requires a token, which you can get from creating a bot in the integrations settings for the Slack team. -### Testing considerations +### Testing Considerations The following should be performed for good acceptance: -* Test single and multiple channels and good formatting of the message. Note that slack notifications only contain the minimal information +* Test single and multiple channels and good formatting of the message. Note that slack notifications only contain the minimal information. -### Test Service - -Any user of the Ansible slack service can create a bot integration (which is how this notification is implemented). Remember to invite the bot to the channel first. - -## Hipchat - -There are several ways to integrate with hipchat. The Tower implementation uses Hipchat "Integrations". Currently you can find this at the bottom right of the main hipchat webview. From there you will select "Build your own Integration". After creating that it will list the `auth_token` that needs to be supplied to Tower. Some other relevant details on the fields accepted by Tower for the Hipchat notification type: - -* `color`: This will highlight the message as the given color. If set to something hipchat doesn't expect then the notification will generate an error, but it's pretty rad. I like green personally. -* `notify`: Selecting this will cause the bot to "notify" channel members. Normally it will just be stuck as a message in the chat channel without triggering anyone's notifications. This option will notify users of the channel respecting their existing notification settings (browser notification, email fallback, etc.) -* `message_from`: Along with the integration name itself this will put another label on the notification. I reckon this would be helpful if multiple services are using the same integration to distinguish them from each other. -* `api_url`: The url of the hipchat api service. If you create a team hosted by them it'll be something like `https://team.hipchat.com`. For a self-hosted service it'll be the http url that is accessible by Tower. - -### Testing considerations - -* Make sure all options behave as expected -* Test single and multiple channels -* Test that notification preferences are obeyed. -* Test formatting and appearance. Note that, like Slack, hipchat will use the minimal version of the notification. -* Test standalone hipchat service for parity with hosted solution - -### Test Service - -Hipchat allows you to create a team with limited users and message history for free, which is easy to set up and get started with. Hipchat contains a self-hosted server also which we should test for parity... it has a 30 day trial but there might be some other way to negotiate with them, redhat, or ansible itself: - -https://www.hipchat.com/server ## Mattermost -The mattermost notification integration uses Incoming Webhooks. A password is not required because the webhook URL itself is the secret. Webhooks must be enabled in the System Console of Mattermost. If the user wishes to allow Ansible Tower notifications to modify the Icon URL and username of the notification then they must enabled these options as well. +The Mattermost notification integration uses Incoming Webhooks. A password is not required because the webhook URL itself is the secret. Webhooks must be enabled in the System Console of Mattermost. If the user wishes to allow Ansible Tower notifications to modify the Icon URL and username of the notification, then they must enabled these options as well. In order to enable these settings in Mattermost: -1. First go to System Console > Integrations > Custom Integrations. Check Enable Incoming Webhooks -2. Optionally, go to System Console > Integrations > Custom Integrations. Check "Enable integrations to override usernames" and Check "Enable integrations to override profile picture icons" -3. Go to Main Menu > Integrations > Incoming Webhook. Click "Add Incoming Webhook" -4. Choose a "Display Name", "Description", and Channel. This channel will be overridden if the notification uses the `channel` option +1. Go to System Console > Integrations > Custom Integrations. Check "Enable Incoming Webhooks". +2. Optionally, go to System Console > Integrations > Custom Integrations. Check "Enable integrations to override usernames" and Check "Enable integrations to override profile picture icons". +3. Go to Main Menu > Integrations > Incoming Webhook. Click "Add Incoming Webhook". +4. Choose a "Display Name", "Description", and Channel. This channel will be overridden if the notification uses the `channel` option. -* `url`: The incoming webhook URL that was configured in Mattermost. Notifications will use this URL to POST. +* `url`: This is the incoming webhook URL that was configured in Mattermost. Notifications will use this URL to `POST`. * `username`: Optional. The username to display for the notification. -* `channel`: Optional. Override the channel to display the notification in. Mattermost incoming webhooks are tied to a channel by default, so if left blank then this will use the incoming webhook channel. Note, if the channel does not exist then the notification will error out. +* `channel`: Optional. Override the channel in which to display the notification. Mattermost incoming webhooks are tied to a channel by default, so if left blank then this will use the incoming webhook channel. Note, if the channel does not exist, then the notification will error out. * `icon_url`: Optional. A URL pointing to an icon to use for the notification. -### Testing considerations +### Testing Considerations -* Make sure all options behave as expected -* Test that all notification options are obeyed +* Make sure all options behave as expected. +* Test that all notification options are obeyed. * Test formatting and appearance. Mattermost will use the minimal version of the notification. ### Test Service -* Utilize an existing Mattermost installation or use their docker container here: `docker run --name mattermost-preview -d --publish 8065:8065 mattermost/mattermost-preview` +* Utilize an existing Mattermost installation or use their Docker container here: `docker run --name mattermost-preview -d --publish 8065:8065 mattermost/mattermost-preview` * Turn on Incoming Webhooks and optionally allow Integrations to override usernames and icons in the System Console. + ## Rocket.Chat The Rocket.Chat notification integration uses Incoming Webhooks. A password is not required because the webhook URL itself is the secret. An integration must be created in the Administration section of the Rocket.Chat settings. The following fields are available for the Rocket.Chat notification type: -* `url`: The incoming webhook URL that was configured in Rocket.Chat. Notifications will use this URL to POST. -* `username`: Optional. Change the displayed username from Rocket Cat to specified username +* `url`: The incoming webhook URL that was configured in Rocket.Chat. Notifications will use this URL to `POST`. +* `username`: Optional. Change the displayed username from Rocket.Chat to specified username. * `icon_url`: Optional. A URL pointing to an icon to use for the notification. -### Testing considerations +### Testing Considerations -* Make sure that all options behave as expected -* Test that all notification options are obeyed +* Make sure that all options behave as expected. +* Test that all notification options are obeyed. ### Test Service -* Utilize an existing Rocket.Chat installation or use their docker containers from https://rocket.chat/docs/installation/docker-containers/ +* Utilize an existing Rocket.Chat installation or use their Docker containers from https://rocket.chat/docs/installation/docker-containers/ * Create an Incoming Webhook in the Integrations section of the Administration settings ## Pagerduty -Pager duty is a fairly straightforward integration. The user will create an API Key in the pagerduty system (this will be the token that is given to Tower) and then create a "Service" which will provide an "Integration Key" that will be given to Tower also. The other options of note are: +Pagerduty is a fairly straightforward integration. The user will create an API Key in the Pagerduty system (this will be the token that is given to Tower) and then create a "Service" which will provide an "Integration Key" that will also be given to Tower. The other options of note are: -* `subdomain`: When you sign up for the pagerduty account you will get a unique subdomain to communicate with. For instance, if you signed up as "towertest" the web dashboard will be at towertest.pagerduty.com and you will give the Tower API "towertest" as the subdomain (not the full domain). -* `client_name`: This will be sent along with the alert content to the pagerduty service to help identify the service that is using the api key/service. This is helpful if multiple integrations are using the same api key and service. +* `subdomain`: When you sign up for the Pagerduty account, you will get a unique subdomain to communicate with. For instance, if you signed up as "towertest", the web dashboard will be at *towertest.pagerduty.com* and you will give the Tower API "towertest" as the subdomain (not the full domain). +* `client_name`: This will be sent along with the alert content to the Pagerduty service to help identify the service that is using the API key/service. This is helpful if multiple integrations are using the same API key and service. ### Testing considerations -* Make sure the alert lands on the pagerduty service -* Verify that the minimal information is displayed for the notification but also that the detail of the notification contains all fields. Pagerduty itself should understand the format in which we send the detail information. +* Make sure the alert lands on the Pagerduty service. +* Verify that the minimal information is displayed for the notification but also that the detail of the notification contains all fields. Pagerduty itself should understand the format in which we send the detail information. ### Test Service -Pagerduty allows you to sign up for a free trial with the service. We may also have a ansible-wide pagerduty service that we could tie into for other things. +Pagerduty allows you to sign up for a free trial with the service. + ## Twilio -Twilio service is an Voice and SMS automation service. Once you are signed in you'll need to create a phone number from which the message will be sent. You'll then define a "Messaging Service" under Programmable SMS and associate the number you created before with it. Note that you may need to verify this number or some other information before you are allowed to use it to send to any numbers. The Messaging Service does not need a status callback url nor does it need the ability to Process inbound messages. +Twilio is a Voice and SMS automation service. Once you are signed in, you'll need to create a phone number from which the message will be sent. You'll then define a "Messaging Service" under Programmable SMS and associate the number (the one you created for this purpose) with it. Note that you may need to verify this number or some other information before you are allowed to use it to send to any numbers. The Messaging Service does not need a status callback URL nor does it need the ability to process inbound messages. -Under your individual (or sub) account settings you will have API credentials. The Account SID and AuthToken are what will be given to Tower. There are a couple of other important fields: +Under your individual (or sub) account settings, you will have API credentials. The Account SID and AuthToken are what will be given to Tower. There are a couple of other important fields: -* `from_number`: This is the number associated with the messaging service above and must be given in the form of "+15556667777" +* `from_number`: This is the number associated with the messaging service above and must be given in the form of "+15556667777". * `to_numbers`: This will be the list of numbers to receive the SMS and should be the 10-digit phone number. -### Testing considerations +### Testing Considerations -* Test notifications with single and multiple recipients +* Test notifications with single and multiple recipients. * Verify that the minimal information is displayed for the notification. Note that this notification type does not display the full detailed notification. ### Test Service -Twilio is fairly straightforward to sign up for but I don't believe it has a free plan, a credit card will be needed to sign up for it though the charges are fairly minimal per message. +Twilio is fairly straightforward to sign up for but there may not be a free trial offered; a credit card will be needed to sign up for it though the charges are fairly minimal per message. + ## IRC -The Tower irc notification takes the form of an IRC bot that will connect, deliver its messages to channel(s) or individual user(s), and then disconnect. The Tower notification bot also supports SSL authentication. The Tower bot does not currently support Nickserv identification. If a channel or user does not exist or is not on-line then the Notification will not fail, the failure scenario is reserved specifically for connectivity. +The Tower IRC notification takes the form of an IRC bot that will connect, deliver its messages to channel(s) or individual user(s), and then disconnect. The Tower notification bot also supports SSL authentication. The Tower bot does not currently support Nickserv identification. If a channel or user does not exist or is not online, then the Notification will not fail; the failure scenario is reserved specifically for connectivity. Connectivity information is straightforward: -* `server`: The host name or address of the irc server -* `port`: The irc server port -* `nickname`: The bot's nickname once it connects to the server -* `password`: IRC servers can require a password to connect. If the server doesn't require one then this should be an empty string -* `use_ssl`: Should the bot use SSL when connecting +* `server`: The host name or address of the IRC server. +* `port`: The IRC server port. +* `nickname`: The bot's nickname once it connects to the server. +* `password`: IRC servers can require a password to connect. If the server doesn't require one, then this should be an empty string. +* `use_ssl`: If you want the bot to use SSL when connecting. * `targets`: A list of users and/or channels to send the notification to. ### Test Considerations -* Test both plain and SSL connectivity +* Test both plain and SSL connectivity. * Test single and multiples of both users and channels. ### Test Service -There are a few modern irc servers to choose from but we should use a fairly full featured service to get good test coverage. I recommend inspircd because it is actively maintained and pretty straightforward to configure. +There are a few modern IRC servers to choose from. [InspIRCd](http://www.inspircd.org/) is recommended because it is actively maintained and pretty straightforward to configure. + ## Webhook -The webhook notification type in Ansible Tower provides a simple interface to sending POSTs to a predefined web service. Tower will POST to this address using `application/json` content type with the data payload containing all relevant details in json format. -The parameters are pretty straightforward: +The webhook notification type in Ansible Tower provides a simple interface to sending `POST`s to a predefined web service. Tower will `POST` to this address using `application/json` content type with the data payload containing all relevant details in json format. -* `url`: The full url that will be POSTed to -* `headers`: Headers in json form where the keys and values are strings. For example: `{"Authentication": "988881adc9fc3655077dc2d4d757d480b5ea0e11", "MessageType": "Test"}` +The parameters are fairly straightforward: + +* `url`: The full URL that will be `POST`ed to +* `headers`: Headers in json form where the keys and values are strings. For example: `{"Authentication": "988881adc9fc3655077dc2d4d757d480b5ea0e11", "MessageType": "Test"}` ### Test Considerations * Test HTTP service and HTTPS, also specifically test HTTPS with a self signed cert. -* Verify that the headers and payload are present and that the payload is json and the content type is specifically `application/json` +* Verify that the headers and payload are present, that the payload is json, and the content type is specifically `application/json` ### Test Service @@ -225,39 +210,41 @@ A very basic test can be performed by using `netcat`: netcat -l 8099 ``` -and then sending the request to: http://\:8099 +...and then sending the request to: *http://\:8099* -Note that this won't respond correctly to the notification so it will yield an error. I recommend using a very basic Flask application for verifying the POST request, you can see an example of mine here: +Note that this won't respond correctly to the notification, so it will yield an error. Using a very basic Flask application for verifying the `POST` request is recommended; you can see an example here: https://gist.github.com/matburt/73bfbf85c2443f39d272 -This demonstrates how to define an endpoint and parse headers and json content, it doesn't show configuring Flask for HTTPS but this is also pretty straightforward: http://flask.pocoo.org/snippets/111/ +The link below shows how to define an endpoint and parse headers and json content. It doesn't show how to configure Flask for HTTPS, but is fairly straightforward: +http://flask.pocoo.org/snippets/111/ ## Grafana -The Grafana notification type allows you to create Grafana annotations, Details about this feature of Grafana are available at http://docs.grafana.org/reference/annotations/. In order to allow Tower to add annotations an API Key needs to be created in Grafana. Note that the created annotations are region events with start and endtime of the associated Tower Job. The annotation description is also provided by the subject of the associated Tower Job, e.g.: +The Grafana notification type allows you to create Grafana annotations. Details about this feature of Grafana are available at http://docs.grafana.org/reference/annotations/. In order to allow Tower to add annotations, an API Key needs to be created in Grafana. Note that the created annotations are region events with start and endtime of the associated Tower Job. The annotation description is also provided by the subject of the associated Tower Job, for example: + ``` Job #1 'Ping Macbook' succeeded: https://towerhost/#/jobs/playbook/1 ``` The configurable options of the Grafana notification type are: -* `Grafana URL`: The base URL of the Grafana server (required). **Note**: the /api/annotations endpoint will be added automatically to the base Grafana URL. -* `API Key`: The Grafana API Key to authenticate (required) -* `ID of the Dashboard`: To create annotations in a specific Grafana dashboard enter the ID of the dashboard (optional). -* `ID of the Panel`: To create annotations in a specific Panel enter the ID of the panel (optional). -**Note**: If neither dashboardId nor panelId are provided then a global annotation is created and can be queried in any dashboard that adds the Grafana annotations data source. -* `Annotations tags`: List of tags to add to the annotation. One tag per line. -* `Disable SSL Verification`: Disable the verification of the ssl certificate, e.g. when using a self-signed SSL certificate for Grafana. +* `Grafana URL`: Required. The base URL of the Grafana server. **Note**: the `/api/annotations` endpoint will be added automatically to the base Grafana URL. +* `API Key`: Required. The Grafana API Key to authenticate. +* `ID of the Dashboard`: Optional. To create annotations in a specific Grafana dashboard, enter the ID of the dashboard. +* `ID of the Panel`: Optional. To create annotations in a specific Panel, enter the ID of the panel. +**Note**: If neither `dashboardId` nor `panelId` are provided, then a global annotation is created and can be queried in any dashboard that adds the Grafana annotations data source. +* `Annotations tags`: The list of tags to add to the annotation. One tag per line. +* `Disable SSL Verification`: Disable the verification of the SSL certificate, _e.g._, when using a self-signed SSL certificate for Grafana. ### Test Considerations -* Make sure that all options behave as expected -* Test that all notification options are obeyed -* e.g. Make sure the annotation gets created on the desired dashboard and/or panel and with the configured tags +* Make sure that all options behave as expected. +* Test that all notification options are obeyed. +* Make sure the annotation gets created on the desired dashboard and/or panel and with the configured tags. ### Test Service -* Utilize an existing Grafana installation or use their docker containers from http://docs.grafana.org/installation/docker/ -* Create an API Key in the Grafana configuration settings -* (Optional) Lookup dashboardId and/or panelId if needed -* (Optional) define tags for the annotation +* Utilize an existing Grafana installation or use their Docker containers from http://docs.grafana.org/installation/docker/. +* Create an API Key in the Grafana configuration settings. +* (Optional) Lookup `dashboardId` and/or `panelId` if needed. +* (Optional) Define tags for the annotation. diff --git a/docs/task_manager_system.md b/docs/task_manager_system.md index 5f9cd9a09d..a28f947d5f 100644 --- a/docs/task_manager_system.md +++ b/docs/task_manager_system.md @@ -1,63 +1,85 @@ # Task Manager Overview -The task manager is responsible for deciding when jobs should scheduled to run. When choosing a task to run the considerations are: (1) creation time, (2) job dependency, (3) capacity. +The task manager is responsible for deciding when jobs should be scheduled to run. When choosing a task to run, the considerations are: +1. Creation time +2. Job dependencies +3. Capacity -Independent jobs are ran in order of creation time, earliest first. Jobs with dependencies are also ran in creation time order within the group of job dependencies. Capacity is the final consideration when deciding to release a job to be ran by the task dispatcher. +Independent jobs are run in order of creation time, earliest first. Jobs with dependencies are also run in creation time order within the group of job dependencies. Capacity is the final consideration when deciding to release a job to be run by the task dispatcher. ## Task Manager Architecture -The task manager has a single entry point, `Scheduler().schedule()`. The method may be called in parallel, at any time, as many times as the user wants. The `schedule()` function tries to aquire a single, global, lock using the Instance table first record in the database. If the lock cannot be aquired the method returns. The failure to aquire the lock indicates that there is another instance currently running `schedule()`. +The task manager has a single entry point, `Scheduler().schedule()`. The method may be called in parallel, at any time, as many times as the user wants. The `schedule()` function tries to acquire a single, global lock using the Instance table first recorded in the database. If the lock cannot be acquired, the method returns. The failure to acquire the lock indicates that there is another instance currently running `schedule()`. + +### Hybrid Scheduler: Periodic + Event +The `schedule()` function is run (a) periodically by a background task and (b) on job creation or completion. The task manager system would behave correctly if it ran, exclusively, via (a) or (b). + +`schedule()` is triggered via both mechanisms because of the following properties: +1. It reduces the time from launch to running, resulting a better user experience. +2. It is a fail-safe in case we miss code-paths, in the present and future, that change the scheduling considerations for which we should call `schedule()` (_i.e._, adding new nodes to Tower changes the capacity, obscure job error handling that fails a job). + +Empirically, the periodic task manager has been effective in the past and will continue to be relied upon with the added event-triggered `schedule()`. -### Hybrid Scheduler: Periodic + Event -The `schedule()` function is ran (a) periodically by a background task and (b) on job creation or completion. The task manager system would behave correctly if ran, exclusively, via (a) or (b). We chose to trigger `schedule()` via both mechanisms because of the nice properties I will now mention. (b) reduces the time from launch to running, resulting a better user experience. (a) is a fail-safe in case we miss code-paths, in the present and future, that change the 3 scheduling considerations for which we should call `schedule()` (i.e. adding new nodes to tower changes the capacity, obscure job error handling that fails a job) - Emperically, the periodic task manager has served us well in the past and we will continue to rely on it with the added event-triggered `schedule()`. - ### Scheduler Algorithm + * Get all non-completed jobs, `all_tasks` * Detect finished workflow jobs * Spawn next workflow jobs if needed - * For each pending jobs; start with oldest created job - * If job is not blocked, and there is capacity in the instance group queue, then mark the as `waiting` and submit the job to RabbitMQ. - + * For each pending job, start with the oldest created job + * If the job is not blocked, and there is capacity in the instance group queue, then mark it as `waiting` and submit the job to RabbitMQ. + + ### Job Lifecycle + | Job Status | State | |:----------:|:------------------------------------------------------------------------------------------------------------------:| -| pending | Job launched.
1. Hasn't yet been seen by the scheduler
2. Is blocked by another task
3. Not enough capacity | +| pending | Job has been launched.
1. Hasn't yet been seen by the scheduler
2. Is blocked by another task
3. Not enough capacity | | waiting | Job published to an AMQP queue. -| running | Job running on a Tower node. -| successful | Job finished with ansible-playbook return code 0. | -| failed | Job finished with ansible-playbook return code other than 0. | +| running | Job is running on a Tower node. +| successful | Job finished with `ansible-playbook` return code 0. | +| failed | Job finished with `ansible-playbook` return code other than 0. | | error | System failure. | + + ### Node Affinity Decider -The Task Manager decides what exact node a job will run on. It does so by considering user-configured (1) group execution policy and (2) capacity. First, the set of groups on which a job _can_ run on is constructed (see clustering.md). The groups are traversed until a node within that group is found. The node with the largest remaining capacity that is idle is chosen first. If there are no idle nodes, then the node with the largest remaining capacity >= the job capacity requirements is chosen. + +The Task Manager decides which exact node a job will run on. It does so by considering user-configured group execution policy and user-configured capacity. First, the set of groups on which a job _can_ run on is constructed (see the AWX document on [Clustering](https://github.com/ansible/awx/blob/devel/docs/clustering.md)). The groups are traversed until a node within that group is found. The node with the largest remaining capacity that is idle is chosen first. If there are no idle nodes, then the node with the largest remaining capacity greater than or equal to the job capacity requirements is chosen. + ## Code Composition -The main goal of the new task manager is to run in our HA environment. This translates to making the task manager logic run on any tower node. To support this we need to remove any reliance on state between task manager schedule logic runs. We had a secondary goal in mind of designing the task manager to have limited/no access to the database for the future federation feature. This secondary requirement combined with performance needs led us to create partial models that wrap dict database model data. + +The main goal of the new task manager is to run in our HA environment. This translates to making the task manager logic run on any Tower node. To support this, we need to remove any reliance on the state between task manager schedule logic runs. A future goal of AWX is to design the task manager to have limited/no access to the database for this feature. This secondary requirement, combined with performance needs, led to the creation of partial models that wrap dict database model data. + ### Blocking Logic -The blocking logic is handled by a mixture of ORM instance references and task manager local tracking data in the scheduler instance + +The blocking logic is handled by a mixture of ORM instance references and task manager local tracking data in the scheduler instance. + ## Acceptance Tests -The new task manager should, basically, work like the old one. Old task manager features were identified and new ones discovered in the process of creating the new task manager. Rules for the new task manager behavior are iterated below. Testing should ensure that those rules are followed. +The new task manager should, in essence, work like the old one. Old task manager features were identified while new ones were discovered in the process of creating the new task manager. Rules for the new task manager behavior are iterated below; testing should ensure that those rules are followed. + ### Task Manager Rules + * Groups of blocked tasks run in chronological order -* Tasks that are not blocked run whenever there is capacity available in the instance group they are set to run in*** - * ***1 job is always allowed to run per instance group, even if there isn't enough capacity. -* Only 1 Project Updates for a Project may be running -* Only 1 Inventory Update for an Inventory Source may be running -* For a related Project, only a Job xor Project Update may be running -* For a related Inventory, only a Job xor Inventory Update(s) may be running -* Only 1 Job for a Job Template may be running** - * **allow_simultaneous feature relaxes this condition -* Only 1 System Job may be running +* Tasks that are not blocked run whenever there is capacity available in the instance group that they are set to run in (one job is always allowed to run per instance group, even if there isn't enough capacity) +* Only one Project Update for a Project may be running at a time +* Only one Inventory Update for an Inventory Source may be running at a time +* For a related Project, only a Job xor Project Update may be running at a time +* For a related Inventory, only a Job xor Inventory Update(s) may be running at a time +* Only one Job for a Job Template may be running at a time (the `allow_simultaneous` feature relaxes this condition) +* Only one System Job may be running at a time + ### Update on Launch Logic -Feature in Tower where dynamic inventory and projects associated with Job Templates may be set to invoke and update when related Job Templates are launch. Related to this feature is a cache feature on dynamic inventory updates and project updates. The rules for these two intertwined features are below. -* projects marked as update on launch should trigger a project update when a related job template is launched -* inventory sources marked as update on launch should trigger an inventory update when a related job template is launched -* spawning of project update and/or inventory updates should **not** be triggered when a related job template is launched **IF** there is an update && the last update finished successfully && the finished time puts the update within the configured cache window. -* **Note:** Update on launch spawned jobs (i.e. InventoryUpdate and ProjectUpdate) are considered dependent jobs. The `launch_type` is `dependent`. If a `dependent` jobs fails then the dependent job should also fail. -Example permutations of blocking: https://docs.google.com/a/redhat.com/document/d/1AOvKiTMSV0A2RHykHW66BZKBuaJ_l0SJ-VbMwvu-5Gk/edit?usp=sharing +This is a feature in Tower where dynamic inventory and projects associated with Job Templates may be set to invoke and update when related Job Templates are launched. Related to this feature is a cache feature on dynamic inventory updates and project updates. The rules for these two intertwined features are below: + +* Projects marked as `update on launch` should trigger a project update when a related job template is launched. +* Inventory sources marked as `update on launch` should trigger an inventory update when a related job template is launched. +* Spawning of project updates and/or inventory updates should **not** be triggered when a related job template is launched **IF** there is an update && the last update finished successfully && the finished time puts the update within the configured cache window. +* **Note:** `update on launch` spawned jobs (_i.e._, InventoryUpdate and ProjectUpdate) are considered dependent jobs; in other words, the `launch_type` is `dependent`. If a `dependent` job fails, then everything related to it should also fail. + +For example permutations of blocking, take a look at this [Task Manager Dependency Dependency Rules and Permutations](https://docs.google.com/a/redhat.com/document/d/1AOvKiTMSV0A2RHykHW66BZKBuaJ_l0SJ-VbMwvu-5Gk/edit?usp=sharing) doc.