TCP socket poisoned by agent restart #99

brandondahler · 2022-04-20T13:56:00Z

Problem

When using EMF in an agent-based environment with a TCP endpoint, all metrics are silently dropped if the agent is restarted after the first metric publish.

Reproduction

Minimal Code

import java.util.concurrent.ExecutionException;
import software.amazon.cloudwatchlogs.emf.config.Configuration;
import software.amazon.cloudwatchlogs.emf.config.EnvironmentConfigurationProvider;
import software.amazon.cloudwatchlogs.emf.environment.Environment;
import software.amazon.cloudwatchlogs.emf.environment.EnvironmentProvider;
import software.amazon.cloudwatchlogs.emf.environment.Environments;
import software.amazon.cloudwatchlogs.emf.logger.MetricsLogger;

public class Program {
  
  public static void main(String[] args) throws InterruptedException, ExecutionException {
    final Configuration config = EnvironmentConfigurationProvider.getConfig();
    config.setEnvironmentOverride(Environments.Agent);

    final Environment environment = new EnvironmentProvider()
        .resolveEnvironment()
        .get();

    final MetricsLogger metricsLogger = new MetricsLogger(environment);
    metricsLogger.putMetric("Test", 1.0d);
    metricsLogger.flush();

    System.out.println("Sleeping, restart agent socket now...");
    Thread.sleep(10_000);

    final MetricsLogger metricsLogger2 = new MetricsLogger(environment);
    metricsLogger2.putMetric("Test2", 1.0d);
    metricsLogger2.flush();


    environment.getSink()
        .shutdown()
        .join();
  }

}

Steps

Start a fake agent listener with netcat - nc -l 127.0.0.1 25888 -k
Execute the provided minimal code as a command line program in a separate terminal
When Sleeping, restart agent socket now... is printed, Ctrl-C the netcat command and run it again - nc -l 127.0.0.1 25888 -k
Wait for the provided code to finish

Expected behavior

Two metric entries are printed in the netcat terminal:

$ nc -l 127.0.0.1 25888 -k
{"_aws":{"Timestamp":1650461190933,"CloudWatchMetrics":[{"Namespace":"aws-embedded-metrics","Metrics":[{"Name":"Test","Unit":"None"}],"Dimensions":[["LogGroup","ServiceName","ServiceType"]]}],"LogGroupName":"Unknown-metrics"},"LogGroup":"Unknown-metrics","ServiceName":"Unknown","Test":1.0,"ServiceType":"Unknown"}
^C
$ nc -l 127.0.0.1 25888 -k
{"_aws":{"Timestamp":1650461201045,"CloudWatchMetrics":[{"Namespace":"aws-embedded-metrics","Metrics":[{"Name":"Test2","Unit":"None"}],"Dimensions":[["LogGroup","ServiceName","ServiceType"]]}],"LogGroupName":"Unknown-metrics"},"LogGroup":"Unknown-metrics","ServiceName":"Unknown","ServiceType":"Unknown","Test2":1.0}

Actual behavior

The second metric entry is never received by netcat:

$ nc -l 127.0.0.1 25888 -k
{"_aws":{"Timestamp":1650460771166,"CloudWatchMetrics":[{"Namespace":"aws-embedded-metrics","Metrics":[{"Name":"Test","Unit":"None"}],"Dimensions":[["LogGroup","ServiceName","ServiceType"]]}],"LogGroupName":"Unknown-metrics"},"LogGroup":"Unknown-metrics","ServiceName":"Unknown","Test":1.0,"ServiceType":"Unknown"}
^C
$ nc -l 127.0.0.1 25888 -k

Details

This appears to be caused by how the TCP communication is handled, specifically there is no mechanism that actually verifies that a given log event is actually delivered to the CloudWatch agent. TCP sockets (and by extension Java TCP sockets) do not make any guarantees about when the the bytes written to it are sent.

If the receiving side closes the socket in a disorderly manner, our Socket won't recognize that it is closed until an operation on it goes through all of the TCP retries, which takes upwards of 15 minutes by default.

In general, the recommended way to deal with disorderly disconnects is to have the application protocol acknowledge each message sent. If the reading the expected acknowledgement times out (via SO_TIMEOUT), the connection is closed on the client side and handled by the application.

Workaround

To work around this, you can switch to the UDP endpoint by setting AgentEndpoint to udp://127.0.0.1:25888. This still doesn't confirm that events are written; however, it does not get poisoned by severed connections.

The text was updated successfully, but these errors were encountered:

markkuhn · 2022-08-03T00:02:50Z

Thanks for the detailed report! The race condition in #100 is not ideal but the idea of reading from the input stream seems to be the best way to detect unexpected socket closure. We do want to try to reconnect on socket closure instead of throwing an error right away.

markkuhn · 2022-08-03T21:56:24Z

After performing a couple of more tests I found that the TCP client does indeed reconnect on its own (after minimum 2 retries), but still drops packets in the meantime. I've modified the PR to include a retry queue.

brandondahler · 2022-08-04T16:04:59Z

The race condition in #100 is not ideal

This isn't really something we can avoid -- it will always be possible for us to send a packet and the server receive it, but the connection gets severed prior to any mechanism confirming the message was received. Dealing with this is an inherent issue with networking communication in general.

We'd need to solve duplicate sends in the application layer by having some idempotency mechanism (which isn't available in the EMF specification or the cloudwatch agent protocol).

brandondahler mentioned this issue Apr 20, 2022

Update TCPClient to detect and handle TCP socket closures. #100

Closed

markkuhn added the bug Something isn't working label Jul 26, 2022

markkuhn self-assigned this Jul 27, 2022

markkuhn added a commit to markkuhn/aws-embedded-metrics-java that referenced this issue Aug 2, 2022

Merge branch 'awslabs:master' into fix-awslabs#99

25bc48e

markkuhn mentioned this issue Aug 3, 2022

Fix TCP client to detect connection closures #108

Closed

markkuhn added a commit to markkuhn/aws-embedded-metrics-java that referenced this issue Aug 17, 2022

Merge branch 'master' into fix-awslabs#99

918acae

markkuhn added a commit to markkuhn/aws-embedded-metrics-java that referenced this issue Aug 19, 2022

Merge branch 'master' into fix-awslabs#99

2dc5900

markkuhn pushed a commit to markkuhn/aws-embedded-metrics-java that referenced this issue Aug 30, 2022

Merge branch 'master' into fix-awslabs#99

cdcb0c6

markkuhn added the help wanted Extra attention is needed label Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCP socket poisoned by agent restart #99

TCP socket poisoned by agent restart #99

brandondahler commented Apr 20, 2022 •

edited

Loading

markkuhn commented Aug 3, 2022 •

edited

Loading

markkuhn commented Aug 3, 2022

brandondahler commented Aug 4, 2022

TCP socket poisoned by agent restart #99

TCP socket poisoned by agent restart #99

Comments

brandondahler commented Apr 20, 2022 • edited Loading

Problem

Reproduction

Minimal Code

Steps

Expected behavior

Actual behavior

Details

Workaround

markkuhn commented Aug 3, 2022 • edited Loading

markkuhn commented Aug 3, 2022

brandondahler commented Aug 4, 2022

brandondahler commented Apr 20, 2022 •

edited

Loading

markkuhn commented Aug 3, 2022 •

edited

Loading