Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark performance for xAPI over Kafka bus to Ralph #204

Closed
Tracked by #195
bmtcril opened this issue Mar 7, 2024 · 4 comments
Closed
Tracked by #195

Benchmark performance for xAPI over Kafka bus to Ralph #204

bmtcril opened this issue Mar 7, 2024 · 4 comments

Comments

@bmtcril
Copy link
Contributor

bmtcril commented Mar 7, 2024

Tests for various Kafka bus configurations.

Remaining tests:

  • Multiple workers with batching
@bmtcril
Copy link
Contributor Author

bmtcril commented Apr 4, 2024

Test 1 (e9d952) - 1k rows, no batching

Test system configuration:

  • Tutor version: tutor, version 17.0.2-nightly
  • Aspects version: 0.91.0
  • Environment specifications:
  • Relevant settings
    • RUN_CLICKHOUSE: true
    • RUN_KAFKA_SERVER: true
    • RUN_RALPH: true
    • RUN_SUPERSET: true
    • RUN_VECTOR: false
    • EVENT_ROUTING_BACKEND_BATCHING_ENABLED: False
    • EVENT_BUS_BACKEND: kafka

Load generation specifications:

  • Tool - platform-plugin-aspects load_test_tracking_events management command
  • Exact scripts
    • tutor local run lms ./manage.py lms monitor_load_test_tracking --sleep_time 5 --backend kafka_bus
    • tutor local run cms ./manage.py cms load_test_tracking_events --num_events 1000 --sleep_time 0 --tags kafka 1k local novector nobatch

Data captured for results:

  • Length of run
    • Event generation: 0:00:28.648144
    • Monitoring / total run length: 0:07:26.738801
  • Sleep time: 0
  • Events: 1000
  • Raw stats attached: e9d952_stats.txt

Findings:

  • The consumer could not keep up, ClickHouse was 396 seconds behind on a 425 second run at the end
  • Kafka queue grew to 950 pending events before the generation stopped and then took 395 seconds to catch up
  • Inserted rows per second: 2.4 - 2.6

@bmtcril
Copy link
Contributor Author

bmtcril commented Apr 4, 2024

Test 2 (2eed44) - 1k rows, batch size 10

Test system configuration:

  • Tutor version: tutor, version 17.0.2-nightly
  • Aspects version: 0.91.0
  • Environment specifications:
  • Relevant settings
    • RUN_CLICKHOUSE: true
    • RUN_KAFKA_SERVER: true
    • RUN_RALPH: true
    • RUN_SUPERSET: true
    • RUN_VECTOR: false
    • EVENT_BUS_BACKEND: kafka
    • EVENT_ROUTING_BACKEND_BATCH_SIZE: 10
    • EVENT_ROUTING_BACKEND_BATCHING_ENABLED: True

Load generation specifications:

  • Tool - platform-plugin-aspects load_test_tracking_events management command
  • Exact scripts
    • tutor local run lms ./manage.py lms monitor_load_test_tracking --sleep_time 5 --backend kafka_bus
    • tutor local run cms ./manage.py cms load_test_tracking_events --num_events 1000 --sleep_time 0 --tags kafka 1k local novector batch10

Data captured for results:

  • Length of run
    • Event generation: 0:00:32.620124
    • Monitoring / total run length: 0:01:27.334202
  • Sleep time: 0
  • Events: 1000
  • Raw stats attached: 2eed44_stats.txt

Findings:

  • The consumer was not quite able keep up, ClickHouse lag grew to 35 seconds on a 65 second run
  • Maximum kafka queue size was 621
  • Inserted rows per second was 14-18, with a maximum of 22

@bmtcril
Copy link
Contributor Author

bmtcril commented Apr 4, 2024

Test 3 (708fb0) - 10k rows, batch size 100

Test system configuration:

  • Tutor version: tutor, version 17.0.2-nightly
  • Aspects version: 0.91.0
  • Environment specifications:
  • Relevant settings
    • RUN_CLICKHOUSE: true
    • RUN_KAFKA_SERVER: true
    • RUN_RALPH: true
    • RUN_SUPERSET: true
    • RUN_VECTOR: false
    • EVENT_BUS_BACKEND: kafka
    • EVENT_ROUTING_BACKEND_BATCH_SIZE: 100
    • EVENT_ROUTING_BACKEND_BATCHING_ENABLED: True

Load generation specifications:

  • Tool - platform-plugin-aspects load_test_tracking_events management command
  • Exact scripts
    • tutor local run lms ./manage.py lms monitor_load_test_tracking --sleep_time 5 --backend kafka_bus
    • tutor local run cms ./manage.py cms load_test_tracking_events --num_events 10000 --sleep_time 0 --tags kafka 10k local novector batch100

Data captured for results:

  • Length of run
    • Event generation: 0:03:28.310482
    • Monitoring / total run length: 0:03:44.309783
  • Sleep time: 0
  • Events: 10,000
  • Raw stats attached: 708fb0_stats.txt

Findings:

  • The consumer was able keep up, ClickHouse was never more than 6 seconds behind, usually 2-3.
  • Maximum kafka queue size was 19, which is to be expected on a batch size of 100.
  • Inserted rows per second was consistently 40 in the middle of the test, though never higher

@bmtcril
Copy link
Contributor Author

bmtcril commented Aug 6, 2024

What we've found in #202 and earlier tests is that insert performance exceeds our ability to generate events up to that ~55 / sec line. Once we have more production information from partners we can determine if we should test to a higher threshold, but for now I'm closing these tasks out.

@bmtcril bmtcril closed this as completed Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant