Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC2 destroy.yml will terminate other instances if no match is found #121

Open
beargiles opened this issue Apr 13, 2023 · 7 comments
Open
Labels
bug Something isn't working ec2 Amazon EC2

Comments

@beargiles
Copy link

The EC2 driver maintains information about the instances it creates then uses that information to destroy the instance.

This isn't a problem (usually) when running molecule tests to completion. However when using test-driven development (TDD) you may routinely run the test to 'converge', do a bit more work, and then complete the molecule tests to see if the results have improved. There's a significant downside to this - the longer this instance is running the greater the risk that another process will terminate the instance. It might be a local process, it might be another process interacting with EC2.

When this happens (ouch) the code blocks

    # Merging defaults into a list of dicts is, it turns out, not straightforward
    platforms: >-
      {{ [platform_defaults | dict2items]
           | product(molecule_yml.platforms | map('dict2items') | list)
           | map('flatten', levels=1)
           | list
           | map('items2dict')
           | list }}

    - name: Validate discovered information
      assert:
        that: platform.vpc_id or (subnet_info.results[index].subnets | length > 0)
        quiet: true
      loop: "{{ platforms }}"
      loop_control:
        loop_var: platform
        index_var: index
        label: "{{ platform.name }}"

    - name: Destroy ephemeral EC2 instances
      ec2_instance:
        profile: "{{ item.aws_profile | default(omit) }}"
        region: "{{ item.region | default(omit) }}"
        instance_ids: "{{ instance_config | map(attribute='instance_ids') | flatten }}"
        state: absent
      loop: "{{ platforms }}"
      loop_control:
        label: "{{ item.name }}"
      register: ec2_instances_async
      async: 7200
      poll: 0

will quietly set instance_ids to null instead of failing. That means the play will match all EC2 instances matching that profile and region.

I suspect the playbook had originally verified that the instance(s) was still running as part of the creation of the platforms list and that list would have been empty if the instance(s) was no longer running but that's no longer the case.

Fortunately there's a simple fix:

    - ansible.builtin.set_facts:
        instance_ids: "{{ instance_ids | default(instance_config) | map(attribute='instance_ids') }}"
      loop: "{{ platforms }}"

    - amazon.aws.ec2_instance:
         instance_ids: '{{ instance_ids | flatten }}'
         ...
      when: instance_ids is defined and instance_ids[0] is defined

with related updates below. (This isn't the final code - just a demonstration that it's easy to check whether the list will be empty.)

PR to follow.

@apatard apatard added bug Something isn't working ec2 Amazon EC2 labels Apr 14, 2023
@apatard
Copy link
Member

apatard commented Apr 14, 2023

hm. sounds bad. Can you please submit a PR ?

@beargiles
Copy link
Author

Still working on it...

@beargiles
Copy link
Author

I've been able to reproduce the behavior, although I don't recall exactly how I got to it.

There is no run-config.yml file. This might be an important clue when running the 'destroy' task!

instance_config.yml

{}

(I know - this doesn't look very much like yml!)

state.yml

# Molecule managed

---
converged: false
created: false
driver: null
is_parallel: false
prepared: null
run_uuid: eceff0f9-52b6-40ad-8b96-778f43a3a0a8

The really scary play is

    - name: Destroy ephemeral EC2 instances
      ec2_instance:
        profile: "{{ item.aws_profile | default(omit) }}"
        region: "{{ item.region | default(omit) }}"
        instance_ids: "{{ instance_config | map(attribute='instance_ids') | flatten }}"
        state: absent
      loop: "{{ platforms }}"
      loop_control:
        label: "{{ item.name }}"
      register: ec2_instances_async
      async: 7200
      poll: 0

because adding this play immediately before it

    - name: list of instance ids...
      debug:
        msg: "{{ instance_config | map(attribute='instance_ids') | flatten }}"
      loop: "{{ platforms }}"
      loop_control:
        label: "{{ item.name }}"

results in

TASK [list of instance ids...] *************************************************
Thursday 20 April 2023  17:09:21 -0600 (0:00:00.124)       0:00:00.476 ********
Thursday 20 April 2023  17:09:21 -0600 (0:00:00.124)       0:00:00.476 ********
[WARNING]: Unable to find '/home/bgiles/.cache/molecule/freeipa/default/run-
config.yml' in expected paths (use -vvvvv to see paths)
ok: [localhost] => (item=molecule-test-freeipa) => 
  msg: []

If I didn't immediately 'fail' I would have seen

  msg: 'Unable to terminate instances: An error occurred (OperationNotPermitted) when calling the TerminateInstances operation: The instance ''i-0e19535f5a288e6fc'' may not be terminated. Modify its ''disableApiTermination'' instance attribute and try again.'
  results_file: /home/bgiles/.ansible_async/255023698564.4155266
  started: 1
  stderr: ''
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
  terminate_failed:
  - i-011529bd170f8d765
  - i-0aaa746642d96f508
  - i-063bd9a60e1168700
  - i-0def08ff5a17bf1d0
  - i-016c8680ff0fb57cf
  - i-0f8848b6cb98e6a5c
  - i-06b46dddb84820d9a
  - i-0aa3b858d938d79ab
  - i-02b479461c88a51f6
  - i-008723f22d5dd9bef
  - i-0e19535f5a288e6fc
  - i-0974455a156207f96
  - i-001732bd76214758a
  - i-01cfb1b4b84002204
  - i-08ca360a86f590fdb
  - i-03df2c30f27a97bae
  - i-06ecb13be3e37f6ac
  - i-03bb783117c63a926
  - i-04b6bde567b32e0bd
  - i-044e8d0655937e5db
  - i-0266cb409a1947325
  - i-0614ac8f419da9021
  - i-06c69c1b5c021d610
  - i-07474f12b6ad658ba
  - i-032b3e77f72f97d1c
  - i-0f32470a91df12f8b
  - i-0868fe1cddc6a94c2
  - i-03f7c088088d2b68b
  - i-0812c02988d05499c
  - i-0865f1cf78a1c4db1
  terminate_success: []

There's an immediate improvement by adding just one line:

    - name: Destroy ephemeral EC2 instances
      ec2_instance:
        profile: "{{ item.aws_profile | default(omit) }}"
        region: "{{ item.region | default(omit) }}"
        instance_ids: "{{ instance_config | map(attribute='instance_ids') | flatten }}"

        # --- start of new code --- 
       subnet_id: "{{ item.vpc_subnet_id }}"
        # --- end of new code ---

        state: absent
      loop: "{{ platforms }}"
      loop_control:
        label: "{{ item.name }}"
      register: ec2_instances_async
      async: 7200
      poll: 0

This only works if the molecule tests are run in a dedicated subnet... but that would be a Good Thing anyway. It could still cause problems if there's permanent infrastructure in the same subnet, e.g., a centralized log server.

It looks like this might work - but it will require changes to at least one other play.

    - name: Destroy ephemeral EC2 instances
      ec2_instance:
        profile: "{{ item.aws_profile | default(omit) }}"
        region: "{{ item.region | default(omit) }}"
        instance_ids: "{{ instance_ids }}"
        state: absent
      loop: "{{ platforms }}"
      loop_control:
        label: "{{ item.name }}"
 
      # --- start of new code ---
      vars:
        instance_ids: "{{ instance_config | map(attribute='instance_ids') | flatten }}"
      when:
        - instance_ids is defined
        - instance_ids | length > 0
      # --- end of new code ---

      register: ec2_instances_async
      async: 7200
      poll: 0

Finally, something else I've been looking at is modifying the plays that create ephemeral keys, security groups, and instances so they include one or two additional tags:

  • molecule-instance: '{{ platform.name }}' (only for keys and security groups - EC2 already has this as 'instance')
  • molecule-run-id: '{{ run_config.run_id }}'

And the 'run_config.yml' file also captures the run_id.

There's three reasons for this:

  • it's a solid way to identify ephemeral resources in searches.
  • it's a solid way to identify orphan resources
  • it's a solid way to identify resources to destroy when there's a fatal error in the create.yml playbook

(I have a ton of orphaned keys and security groups...)

The search can still include criteria like the expected subnet, etc.

I've also tried adding any ephemeral keys and groups to the run_config.yml file.

The biggest downside is that you now need to delete the ephemeral keys if there was a failure in the create playbook. Otherwise the next run will reuse the existing key - which means it won't be automatically deleted later if you rely on just the run-id value.

@beargiles
Copy link
Author

I forgot to mention - I added a bit more code when adding the tags to the EC2 instance. Basically I split 'platform_generated_tags' into two.

The filtered tags include what's already present - {{ platform.name }}.

The unfiltered tags include the molecule-run-id mentioned above.

I also added the vpc-id, image-id, and key-name values to the filter since that's the default values. The first two should be safe, the third one might not be with ephemeral keys. It won't change the results but it might be a little more efficient on the AWS side.

....

Also - I used 'molecule-run-id' instead of just 'run-id' since it's possible that something else might be creating keys and security groups with just 'run-id'. Adding 'molecule-' should make it unique.

@shubham823
Copy link

Please resolve this??

@beargiles
Copy link
Author

Sorry I dropped this - I got pulled into other tasks... then laid off. The change set for the unfiltered tag was large enough that I knew it would take some time to prepare it for review.

In addition... the more I looked at the security issues and our needs (which potentially include S3, SQS, EKS, etc.) I decided that a better approach for us would be a lambda function. The molecule driver would just perform input validation and then call the lambda function for the actual work.

This solves the security issues since the lambda function would have them (via IAM role), not the process calling it.

It makes it easier to create a default security group restricted to the host running the molecule tests. Molecule tests running on EC2 instances usually only have their internal IP address but if you use a dedicated VPC for your tests but haven't set up a gateway you would need to use the public IP address.

It solves the issue of accidental deletions since the lambda function could maintain a small database of the instances it creates and then check that list before deleting anything.

It makes it easier to add support for additional resources since the lambda function is written in java (or python, etc.) instead of ansible.

Finally it lets you update the implementation without changing any existing tests. (Once they use the lambda function.)

The lambda-based approach should probably be an 'aws' driver, not 'ec2' driver, in order to avoid confusion and to reflect the intention to support much more than just ec2.

@danielpodwysocki
Copy link
Contributor

Just a note, this looks like it has been fixed via: #170

A fix for anyone still encountering it would be to update the plugin and delete -> re-generate the destroy scenarios under molecule/ in their roles directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ec2 Amazon EC2
Projects
None yet
Development

No branches or pull requests

4 participants