Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running de-client --stop-channel <channel> results in KeyError #275

Closed
shreyb opened this issue Feb 12, 2021 · 4 comments
Closed

Running de-client --stop-channel <channel> results in KeyError #275

shreyb opened this issue Feb 12, 2021 · 4 comments

Comments

@shreyb
Copy link
Contributor

shreyb commented Feb 12, 2021

While troubleshooting decisionengine_modules issue 200, we found that running

de_client --stop-channel <channel_name>

produced a KeyError in TaskManager.decision_cycle that can be seen in the logs. This is true for multiple channels (we tested Nersc, gce, resource_request, and a dummy no-op channel called "test_channel). Here are two examples of the errors we found, for Nersc and gce respectively:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 284, in decision_cycle
    a_f['actions'], a_f['newfacts'], data_block_t1)
  File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 482, in run_publishers
    publisher.worker.publish(data_block)
  File "/usr/lib/python3.6/site-packages/decisionengine_modules/graphite/publishers/generic_publisher.py", line 46, in publish
    data = data_block[self.consumes()[0]]
  File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datablock.py", line 359, in __getitem__
    raise KeyError(key)
KeyError: 'Nersc_Figure_Of_Merit'
2021-02-12 11:13:50,887 - root - TaskManager - 29403 - Thread-10 - ERROR - error in decision cycle(publishers) 'GCE_Figure_Of_Merit'
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 284, in decision_cycle
    a_f['actions'], a_f['newfacts'], data_block_t1)
  File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 482, in run_publishers
    publisher.worker.publish(data_block)
  File "/usr/lib/python3.6/site-packages/decisionengine_modules/graphite/publishers/generic_publisher.py", line 46, in publish
    data = data_block[self.consumes()[0]]
  File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datablock.py", line 359, in __getitem__
    raise KeyError(key)
KeyError: 'GCE_Figure_Of_Merit'

The de-client command returned "OK", so unless one looked at the logs, there would be no way of knowing this.

@shreyb
Copy link
Contributor Author

shreyb commented Feb 12, 2021

Adding some steps to reproduce at @knoepfel's request:

Running on fermicloud117, this error occurs on the channels we tested (and we believe every channel) if you start the decisionengine, and stop any channel:

systemctl start decision-engine
de-client --status  # To verify successful startup
de-client --stop-channel Nersc # But can be any channel

Then look in /var/log/decisionengine/Nersc.log, for the above example. We can see that this error occurs when stopping a channel, and not before, by tailing the log from one terminal and running the above from another (tail -f /var/log/decisionengine/Nersc.log in the above example).

@DmitryLitvintsev
Copy link
Contributor

DmitryLitvintsev commented Feb 23, 2021

DmitryLitvintsev added a commit to HEPCloud/decisionengine_modules that referenced this issue Mar 4, 2021
DmitryLitvintsev added a commit to HEPCloud/decisionengine_modules that referenced this issue Mar 4, 2021
@StevenCTimm
Copy link

As of the current 1.6 release it still throws a KeyError with StopChannel.. whatever was handling this exception didn't work right.

2021-03-19T10:56:08-0500 - root - TaskManager - 25328 - Thread-3 - ERROR - error in decision cycle(publishers)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 289, in decision_cycle
a_f['actions'], a_f['newfacts'], data_block_t1)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 495, in run_publishers
publisher.worker.publish(data_block)
File "/usr/lib/python3.6/site-packages/decisionengine_modules/htcondor/publishers/publisher.py", line 125, in publish
dataframe = datablock.get(key)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datablock.py", line 277, in get
return self.getitem(key, default=default)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datablock.py", line 371, in getitem
raise KeyError
KeyError

@StevenCTimm
Copy link

Now there are no KeyErrors at shutdown of the decision engine with 1.6.1.

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants