-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fit() function hangs indefinitely with no logging or debugging info #4967
Comments
I think, currently I am facing the same issue. I already tried downgrading to an older version of sagemaker but the behavior stays exactly the same, even though I verified everything stated above. Even when setting wait=False, it does not continue beyond fit(). Here is my minimal working example that I tried for testing. When started, it uploads the source files to the specified S3-bucket, but after that it just stops at fit() without any error message: import sagemaker
from sagemaker.estimator import Estimator
import boto3
# Configure session and IAM role
region = "eu-central-1"
session = boto3.Session(region_name=region)
sagemaker_session = sagemaker.Session(boto_session=session)
role = sagemaker.get_execution_role()
# Public PyTorch CPU Training Image (PyTorch 2.5.1, Python 3.11)
image_uri = "763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:2.5.1-cpu-py311-ubuntu22.04-sagemaker"
# Example test.py from the same directory
# if __name__ == "__main__":
# print("Hello from test.py!")
estimator = Estimator(
role=role,
image_uri=image_uri,
instance_count=1,
instance_type="ml.t3.xlarge",
output_path="s3://ffg-bp/minimal-example/output",
sagemaker_session=sagemaker_session,
entry_point="test.py"
)
# Start the training job without waiting for completion
estimator.fit(job_name="minimal-example", wait=False, logs="All")
# Print the latest training job name to confirm the job submission
print("Job submitted. Latest training job:", estimator.latest_training_job) And this is the console output that I recieve:
|
Describe the bug
I'm attempting to run a basic, simple, proof of concept model training workflow with Sagemaker in Python and I cannot get anything to work. The estimator's fit() function just hangs. No errors, no logs being generated in console, no DEBUG info lines being generated. It just hangs. I've already validated the IAM functions, the S3 inputs, etc. and everything is fine. If I bypass the estimator and create jobs manually with boto3 it works fine (although very, very clunkily due to how much code is required).
To reproduce
This is the python script I'm attempting to run
The last line just hangs forever, with nothing happening. No errors. No logs generated in CloudWatch. No debug lines spat out when logging level is set to debug. No jobs being generated in SageMaker. It just fails to do anything at all.
Expected behavior
The code would work and a job would get created.
Screenshots or logs
This is what the code looks like running with debug level set.
sagemaker.config INFO - Not applying SDK defaults from location: C:\ProgramData\sagemaker\sagemaker\config.yaml sagemaker.config INFO - Not applying SDK defaults from location: C:\Users\scott\AppData\Local\sagemaker\sagemaker\config.yaml [12/17/24 06:26:26] INFO Loading cached SSO token for pvcts tokens.py:305 [12/17/24 06:26:28] INFO Ignoring unnecessary instance type: None. image_uris.py:528 DEBUG sagemaker_session found, preparing to emit telemetry... telemetry_logging.py:89 INFO SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver telemetry_logging.py:90 additional features. To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk. DEBUG TelemetryOptOut flag is set to: False telemetry_logging.py:102 DEBUG Train args after processing defaults: {'input_config': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': estimator.py:2513 's3://sagemaker-sample-data-us-east-1/processing/census/census-income.csv', 'S3DataDistributionType': 'FullyReplicated'}}, 'ChannelName': 'train'}], 'role': 'arn:aws:iam::<redacted>:role/aws-reserved/sso.amazonaws.com/AWSReservedSSO_AdministratorAccess_<redacted>', 'output_config': {'S3OutputPath': 's3://pbn-sagemaker-simplemodel-test/output'}, 'resource_config': {'VolumeSizeInGB': 5, 'InstanceCount': 1, 'InstanceType': 'ml.m5.large'}, 'stop_condition': {'MaxRuntimeInSeconds': 86400}, 'vpc_config': None, 'input_mode': 'File', 'job_name': 'sagemaker-xgboost-2024-12-17-06-26-28-534', 'hyperparameters': {'max_depth': '5', 'eta': '0.2', 'objective': 'reg:squarederror', 'num_round': '10'}, 'tags': None, 'metric_definitions': None, 'experiment_config': None, 'environment': None, 'enable_network_isolation': False, 'retry_strategy': None, 'image_uri': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.5-1', 'debugger_hook_config': {'S3OutputPath': 's3://pbn-sagemaker-simplemodel-test/output', 'CollectionConfigurations': []}, 'profiler_config': {'S3OutputPath': 's3://pbn-sagemaker-simplemodel-test/output', 'DisableProfiler': False}}
System information
A description of your system. Please provide:
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: