-
Notifications
You must be signed in to change notification settings - Fork 140
/
troubleshooting.html.md.erb
280 lines (198 loc) · 12.1 KB
/
troubleshooting.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
---
title: Troubleshooting Deployment Problems
owner: Ops Manager
---
This guide is intended to provides assistance with resolving issues encountered when deploying products with <%= vars.ops_manager %>.
## <a id='retry'></a> Retrying the Deployment
An installation or update can fail for many reasons, but the system is self-healing, and can often
automatically correct or work around hardware or network faults.
Click **Install** or **Review Pending Changes**, then **Apply Changes** again, and the system may
resolve a problem on its own.
Some failures only produce generic errors like `Exited with 1`. In cases like this, where a failure
is not accompanied by useful information, click **Install** or **Review Pending Changes**, then
**Apply Changes** to retry.
When the system does provide informative evidence, review the [Common Issues](#common_issues)
section below to see if your problem is covered.
## <a id='common_issues'></a>Common Issues ##
Compare evidence that you have gathered to the descriptions below. If your issue is covered, try
the recommended remediation procedures.
### <a id='reinstall_bosh'></a>BOSH Does Not Reinstall ###
You might want to reinstall BOSH for troubleshooting purposes.
However, if <%= vars.platform_name %> does not detect any changes, BOSH does not reinstall.
To force a reinstall of BOSH, select **BOSH Director > Resource Sizes**
and change a resource value.
For example, you could increase the amount of RAM by 4 MB.
### <a id='time_out'></a>Creating Bound Missing VMs Times Out ###
This task happens immediately following package compilation, but before job assignment to agents.
For example:
<pre class="terminal">
cloud_controller/0: Timed out pinging to f690db09-876c-475e-865f-2cece06aba79 after 600 seconds (00:10:24)
</pre>
This is most likely a NATS issue with the VM in question.
To identify a NATS issue, inspect the agent log for the VM.
Since the BOSH director is unable to reach the BOSH agent, you must access the
VM using another method.
You will likely also be unable to access the VM using TCP.
In this case, access the VM using your virtualization console.
To diagnose:
1. Access the VM using your virtualization console and log in.
1. Navigate to the **Credentials** tab of the
<%= vars.app_runtime_full %> (<%= vars.app_runtime_abbr %>) tile and locate the VM in question to
find the **VM credentials**.
1. Become root.
1. Run `cd /var/vcap/bosh/log`.
1. Open the file `current`.
1. First, determine whether the BOSH agent and director have successfully
completed a handshake, represented in the logs as a "ping-pong":
<pre class="terminal">
2013-10-03_14:35:48.58456 #[608] INFO: Message: {"method"=>"ping", "arguments"=>[],
"reply_to"=>"director.f4b7df14-cb8f.19719508-e0dd-4f53-b755-58b6336058ab"}
2013-10-03_14:35:48.60182 #[608] INFO: reply_to: director.f4b7df14-cb8f.19719508-e0dd-4f53-b755-58b6336058ab:
payload: {:value=>"pong"}
</pre>
This handshake must complete for the agent to receive instructions from the director.
1. If you do not see the handshake, look for another line near the beginning of
the file, prefixed `INFO: loaded new infrastructure settings`.
For example:
<pre class="terminal">
2013-10-03_14:35:21.83222 #[608] INFO: loaded new infrastructure settings:
{"vm"=>{"name"=>"vm-4d80ede4-b0a5-4992-aea6a0386e18e", "id"=>"vm-360"},
"agent_id"=>"56aea4ef-6aa9-4c39-8019-7024ccfdde4",
"networks"=>{"default"=>{"ip"=>"192.0.2.19",
"netmask"=>"255.255.255.0", "cloud_properties"=>{"name"=>"VMNetwork"},
"default"=>["dns", "gateway"],
"dns"=>["192.0.2.2", "192.0.2.17"], "gateway"=>"192.0.2.2",
"dns_record_name"=>"0.nats.default.cf-d729343071061.microbosh",
"mac"=>"00:50:56:9b:71:67"}}, "disks"=>{"system"=>0, "ephemeral"=>1,
"persistent"=>{}}, "ntp"=>[], "blobstore"=>{"provider"=>"dav",
"options"=>{"endpoint"=>"http://192.0.2.17:25250",
"user"=>"agent", "password"=>"agent"}},
"mbus"=>"nats://nats:nats@192.0.2.17:4222",
"env"=>{"bosh"=>{"password"=>"$6$40ftQ9K4rvvC/8ADZHW0"}}}
</pre>
This is a JSON blob of key-value pairs representing the expected infrastructure
for the BOSH agent.
For this issue, the following section is the most important:
`"mbus"=>"nats://nats:nats@192.0.2.17:4222"`
This key-value pair represents where the agent expects the NATS server to be.
One diagnostic tactic is to try pinging this NATS IP address from the VM to
determine whether you are experiencing routing issues.
### <a id='RSA-cert'></a>Install Exits With a Creates/Updates/Deletes App Failure or With a 403 Error###
**Scenario 1:**
Your <%= vars.platform_name %> install exits with the following 403 error when you attempt to
log in to the Apps Manager:
<pre>
{"type": "step_finished", "id": "apps-manager.deploy"}
/home/tempest-web/tempest/web/vendor/bundle/ruby/1.9.1/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:306:in
`fetch': 403 => Net::HTTPForbidden for https://login.api.example.net/oauth/authorizeresponse_type=code&client_id=portal&redirect_uri=https%3...
-- unhandled response (Mechanize::ResponseCodeError)
</pre>
**Scenario 2:**
Your <%= vars.platform_name %> install exits with a `creates/updates/deletes an app (FAILED - 1)` error message with the following stack trace:
<pre>
1) App CRUD creates/updates/deletes an app
Failure/Error: Unable to find matching line from backtrace
CFoundry::TargetRefused:
Connection refused - connect(2)
</pre>
In either of the above scenarios, ensure that you have correctly entered your
domains in wildcard format:
1. Browse to the fully qualified domain name (FQDN) of Ops Manager.
1. Click the <%= vars.app_runtime_abbr %> tile.
1. Select **HAProxy** and click **Generate Self-Signed RSA Certificate**.
1. Enter your system and app domains in wildcard format, as well as optionally
any custom domains, and click **Save**.
Refer to <%= vars.app_runtime_abbr %> **Cloud Controller** for explanations of these
domain values.
<%= image_tag('rsa_cert.png') %>
### <a id='runtime_depend'></a>Install Fails When Gateway Instances Exceed Zero ###
If you configure the number of Gateway instances to be greater than zero for a
given product, you create a dependency on <%= vars.app_runtime_abbr %> for that product
installation.
If you attempt to install a product tile with a <%= vars.app_runtime_abbr %> dependency
before installing <%= vars.app_runtime_abbr %>, the install fails.
To change the number of Gateway instances, click the product
tile, then select **Settings > Resource sizes > INSTANCES** and change the
value next to the product Gateway job.
To remove the <%= vars.app_runtime_abbr %> dependency, change the value of this field to `0`.
### <a id='out_of_disk_space'></a>Out of Disk Space Error ###
<%= vars.platform_name %> displays an `Out of Disk Space` error if log files expand to fill
all available disk space.
If this happens, rebooting the <%= vars.platform_name %> installation VM clears the tmp
directory of these log files and resolves the error.
If users receive `Out of Disk Space` errors when trying to push apps, this can mean that
Diego cells may be running out of disk space capacity.
To perform a detailed analysis of disk usage by containers and host VMs in your <%= vars.platform_name %> deployment,
see [Examining GrootFS Disk Usage](../adminguide/examining_grootfs_disk.html).
### <a id='vsphere_fails'></a>Installing BOSH Director Fails in vSphere ###
If the DNS information for the <%= vars.ops_manager %> VM is incorrectly specified when
deploying the <%= vars.ops_manager %> .ova file, installing BOSH Director fails at the "Installing Micro BOSH" step.
To resolve this issue, correct the DNS settings in the <%= vars.ops_manager %> Virtual
Machine properties.
### <a id='delete_om_fails'></a>Deleting a Deployment Fails ###
<%= vars.ops_manager %> displays an error message when it cannot delete your installation. This
scenario might happen if the BOSH Director cannot access the VMs or is experiencing other issues.
To manually delete your installation and all VMs, you must do the following:
1. Use your IaaS dashboard to manually delete the VMs for all installed products, with the exception of the <%= vars.ops_manager %> VM.
1. SSH into your <%= vars.ops_manager %> VM and remove the `installation.yml` file from
`/var/tempest/workspaces/default/`.
<p class="note"><strong>Note</strong>: Deleting the <code>installation.yml</code> file does not
prevent you from reinstalling a deployment. For future deploys,
<%= vars.ops_manager %> regenerates this file when you click <strong>Save</strong> on any
page in the BOSH Director tile.</p>
Your installation is now deleted.
### <a id='elastic_runtime_fails'></a>Installing <%= vars.app_runtime_abbr %> Fails ###
The following sections describe errors that may occur when installing <%= vars.app_runtime_abbr %>
and how to resolve them.
#### <%= vars.app_runtime_abbr %> Cannot Verify App Push
If the DNS information for the <%= vars.ops_manager %> VM becomes incorrect after BOSH Director
has been installed, installing <%= vars.app_runtime_abbr %> fails at the "Verifying app push" step.
To resolve this issue, correct the DNS settings in the <%= vars.ops_manager %> Virtual
Machine properties.
#### MySQL Monitor Not Running After Update
When MySQL cannot communicate with UAA, <%= vars.ops_manager %> shows the following error:
> **Error: 'mysql_monitor/12a3b456-cd7e-8fgh-9012-345678b90ijk (0)' is not running after update. Review logs for failed jobs: replication-canary.**
If you see this error, create firewall rules that allow MySQL to reach UAA, using the
[MySQL Network Communications](https://docs.pivotal.io/ops-manager/<%= product_info['local_product_version'].to_s.sub('.','-') %>/security/networking/mysql-network-paths.html) topic as a
reference.
### <a id='ip_address_taken'></a><%= vars.ops_manager %> Hangs During MicroBOSH Install or HAProxy States "IP Address Already Taken" ###
During a <%= vars.app_runtime_abbr %> installation, you might receive the following errors:
* The <%= vars.ops_manager %> GUI shows that the installation stops at the "Setting MicroBOSH
deployment manifest" task.
* When you set the IP address for the HAProxy, the "IP Address Already Taken" message appears.
When you install the BOSH Director, you assign it an IP address. <%= vars.platform_name %>
then takes the next two consecutive IP addresses, assigns the first to MicroBOSH, and reserves the
second. For example:
```
203.0.113.1 - <%= vars.platform_name %> (User assigned)
203.0.113.2 - MicroBOSH (<%= vars.platform_name %> assigned)
203.0.113.3 - Reserved (<%= vars.platform_name %> reserved)
```
To resolve this issue, ensure that the next two subsequent IP addresses from the manually assigned
address are unassigned.
### <a id='pcf_slow'></a>Poor Performance ###
If you notice poor network performance by your deployment and your
deployment uses a Network Address Translation (NAT) gateway,
your NAT gateway may be
under-resourced.
####<a id='troubleshoot-slow'></a>Troubleshoot
To troubleshoot the issue, set a custom firewall rule in your IaaS console to route traffic
originating from your private network directly to an S3-compatible object store. If you see
decreased average latency and improved network performance, perform the solution below to scale up
your NAT gateway.
####<a id='solution-slow'></a>Scale Up Your NAT Gateway
Perform the following steps to scale up your NAT gateway:
1. Navigate to your IaaS console.
1. Spin up a new NAT gateway of a larger VM size than your previous NAT gateway.
1. Change the routes to direct traffic through the new NAT gateway.
1. Spin down the old NAT gateway.
The specific procedures will vary depending on your IaaS. Consult your IaaS documentation for more
information.
## <a id='common_problems_firewalls'></a>Common Issues Caused by Firewalls ##
This section describes various issues you might encounter when installing
<%= vars.platform_name %> in an environment that uses a strong firewall.
### <a id='DNS_res_fails'></a>DNS Resolution Fails ###
When you install <%= vars.platform_name %> in an environment that uses a strong firewall, the
firewall might block DNS resolution. To resolve this issue, refer to
[Verify Ops Manager Resolves DNS Entries Behind a Firewall](./config_firewall.html#DNS_fails) in
_Preparing Your Firewall_.