Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrate netobserv to reliability tests #798

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions reliability-v2/config/qe-index.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: qe-app-registry
namespace: openshift-marketplace
spec:
displayName: QE Catalog
image: quay.io/openshift-qe-optional-operators/aosqe-index:v${CLUSTER_VERSION}
sourceType: grpc
---
apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
name: brew-registry
spec:
repositoryDigestMirrors:
- mirrors:
- brew.registry.redhat.io
source: registry.redhat.io
- mirrors:
- brew.registry.redhat.io
source: registry.stage.redhat.io
- mirrors:
- brew.registry.redhat.io
source: registry-proxy.engineering.redhat.com
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiliRedHat do you want this to be separate config? I was thinking to have these tasks under standard reliability tasks and we'd have these checks run always.
Of course that also means we'd have to change netobserv operator would be installed by default for all reliability
runs and optionally exclude it, current implementation in start.sh is inverse, wdyt? this would be the initial idea had discuss so that netobserv could piggyback off your standard reliability runs.

If we have netobserv install by default, we'd need to up standard instance types or somehow have configuration that would accommodate Loki resource requirements, couple of questions here:

  • Do you typically have infra nodes set up for reliability runs? if so, what are instance types of infra nodes? I can look if you can bring up that environment and I can see if Loki is able to fit on infra nodes.
  • would you be willing to up the reliability run instance types to m5.2xlarge or m5.4xlarge?

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@memodi Reliability test has many test profiles, consider of the cost, I don't want to have noo on all profiles by default. Usually there is no infra nodes in reliability test. But reliability has the option to configure it with -i in start.sh, default size should be same as the worker, size can be configured.
If test it once (7 days) a release, I think we can do m5.2xlarge on one of the profiles.

Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ reliability:
- oc get project -l purpose=reliability
- func check_nodes
- kubectl get pods -A -o wide | egrep -v "Completed|Running"
- func check_flowcollector
- func check_netobserv_pods
# Run test case as scripts. KUBECONFIG of the current user is set as env variable by reliability-v2.
#- . <path_to_script>/create-delete-pod-ensure-service.sh

Expand Down Expand Up @@ -148,4 +150,4 @@ reliability:
AWS_DEFAULT_REGION: us-east-2
AWS_ACCESS_KEY_ID: xxxx
AWS_SECRET_ACCESS_KEY: xxxx
CLOUD_TYPE: aws
CLOUD_TYPE: aws
50 changes: 49 additions & 1 deletion reliability-v2/start.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ Usage: $(basename "${0}") [-p <path_to_auth_files>] [-n <folder_name> ] [-t <tim

-h : Help

-o <operator(s) to enable> : To enable optional operators, pass csv list of operators to install. Optional.

END
}

Expand All @@ -36,7 +38,7 @@ if [[ "$1" = "" ]];then
exit 1
fi

while getopts ":n:t:p:c:r:iuh" opt; do
while getopts ":n:t:p:c:r:o:iuh" opt; do
case ${opt} in
n)
folder_name=${OPTARG}
Expand All @@ -59,6 +61,10 @@ while getopts ":n:t:p:c:r:iuh" opt; do
u)
upgrade=true
;;
o)
operators=${OPTARG}
IFS=',' read -ra operatorsToInstall <<< "$OPTARG"
;;
h)
_usage
exit 1
Expand Down Expand Up @@ -186,6 +192,18 @@ function dhms_to_seconds {
echo "Total seconds to run is: $SECONDS_TO_RUN"
}

function setup_netobserv(){
log "Setting up Network Observability operator"
git clone https://github.com/openshift-qe/ocp-qe-perfscale-ci.git --branch netobserv-perf-tests
OCPQE_PERFSCALE_DIR=$PWD/ocp-qe-perfscale-ci
source ocp-qe-perfscale-ci/scripts/env.sh
source ocp-qe-perfscale-ci/scripts/netobserv.sh
deploy_lokistack
deploy_kafka
deploy_netobserv
ceateFlowCollector "-p KafkaConsumerReplicas=6"
}

RELIABILITY_DIR=$(cd $(dirname ${BASH_SOURCE[0]});pwd)
SECONDS_TO_RUN=0
start_log=start_$(date +"%Y%m%d_%H%M%S").log
Expand Down Expand Up @@ -361,6 +379,24 @@ if [[ $os == "linux" ]]; then
date_end_format=$(date --date=@$timestamp_end)
elif [[ $os == "mac" ]]; then date_end_format=$(date -j -f "%s" $timestamp_end "+%Y-%m-%d %H:%M:%S")
fi

# Configure QE index image if optional operators needs to be deployed
# and call respective functions for the operator set up
if [[ $operators ]]; then
CLUSTER_VERSION=$(oc get clusterversion/version -o jsonpath='{.spec.channel}' | cut -d'-' -f 2)
export CLUSTER_VERSION
log "Setting up QE index image for optional operators"
envsubst < config/qe-index.yaml | oc apply -f -

for operator in "${operatorsToInstall[@]}"; do
if [[ $operator == "netobserv" ]]; then
setup_netobserv
fi
done
fi



log "info" "Reliability test will run $time_to_run. Test will end on $date_end_format. \
If you want to halt the test before that, open another terminal and 'touch halt' under reliability-v2 folder."
log "warning" "DO NOT CTRL+c or terminate this session."
Expand Down Expand Up @@ -422,6 +458,18 @@ fi
if [[ -z $tolerance_rate ]]; then
tolerance_rate=1
fi

# Clean up operators setup if operators were installed
if [[ $operators ]]; then
for operator in "${operatorsToInstall[@]}"; do
if [[ $operator == "netobserv" ]]; then
# shellcheck source=reliability-v2/ocp-qe-perfscale-ci/scripts/netobserv.sh
source ${RELIABILITY_DIR}/${OCPQE_PERFSCALE_DIR}/scripts/netobserv.sh
nukeobserv
fi
done
fi

cd $folder_name
if [ ! -f reliability.log ]; then
echo "reliability.log is not found."
Expand Down
39 changes: 39 additions & 0 deletions reliability-v2/tasks/Tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -399,3 +399,42 @@ def shell_task(self,task,user,group_name):
task_name=task[start_index+1:]
self.__log_result(rc,task_name)
return (result,rc)

# check flowcollector status for netobserv
def check_flowcollector(self, user):
self.logger.info(f"[Task] User {user}: check flowcollector")
# Check if nodes are Ready
(result, rc) = oc(
f"get flowcollector --no-headers| grep -v ' Ready'",
self.__get_kubeconfig(user),
ignore_log=True,
ignore_slack=True,
)
if rc == 0:
self.logger.error(f"Flowcollector is not Ready: {result}")
slackIntegration.error(f"Flowcollector not Ready: {result}")
rc_return = 1
elif rc == 1 and result == "":
self.logger.info(f"Flowcollector is Ready.")
rc_return = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@memodi You can add an 'else' to cover the rest of the cases: rc == 1 and result != "", or rc !=1. And log the result to see what you can get.

return (result, rc_return)

# check netobserv pods health
def check_netobserv_pods(self, user):
self.logger.info(f"[Task] User {user}: check pods")
# Check if nodes are Ready
for ns in ("netobserv", "netobserv-privileged"):
(result, rc) = oc(
f"get pods -n ${ns} -o wide --no-headers| grep -v ' Ready'",
self.__get_kubeconfig(user),
ignore_log=True,
ignore_slack=True,
)
if rc == 0:
self.logger.error(f"Some pods are not Ready in ${ns} ns: {result}")
slackIntegration.error(f"Some pods are not Ready in ns ${ns}: {result}")
rc_return = 1
elif rc == 1 and result == "":
self.logger.info(f"Pods in ns ${ns} are healthy.")
rc_return = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

return (result, rc_return)