12. Diagnosing System Problems and Outages¶
12.1. Using dimscli
¶
This chapter covers using dimscli
as a distributed
shell for diagnosing problems throughout a DIMS deployment.
Ansible has two primary CLI programs, ansible
and
ansible-playbook
. Both of these programs are passed a
set of hosts on which they are to operate using an
Inventory.
Note
Read about Ansible and how it is used by the DIMS project in Section ansibleplaybooks:ansiblefundamentals of ansibleplaybooks:ansibleplaybooks.
[dimsenv] dittrich@dimsdemo1:~/dims/git/python-dimscli (develop*) $ cat complete_inventory
[all]
floyd2-p.prisem.washington.edu
foswiki-int.prisem.washington.edu
git.prisem.washington.edu
hub.prisem.washington.edu
jenkins-int.prisem.washington.edu
jira-int.prisem.washington.edu
lapp-int.prisem.washington.edu
lapp.prisem.washington.edu
linda-vm1.prisem.washington.edu
rabbitmq.prisem.washington.edu
sso.prisem.washington.edu
time.prisem.washington.edu
u12-dev-svr-1.prisem.washington.edu
u12-dev-ws-1.prisem.washington.edu
wellington.prisem.washington.edu
Using this inventory, the modules command
and shell
can be used to run commands
as needed to diagnose all of these hosts at once.
[dimsenv] dittrich@dimsdemo1:~/dims/git/python-dimscli (develop*) $ dimscli ansible command --program "uptime" --inventory complete_inventory --remote-port 8422 --remote-user dittrich
+-------------------------------------+--------+-------------------------------------------------------------------------+
| Host | Status | Results |
+-------------------------------------+--------+-------------------------------------------------------------------------+
| rabbitmq.prisem.washington.edu | GOOD | 22:07:53 up 33 days, 4:32, 1 user, load average: 0.07, 0.13, 0.09 |
| wellington.prisem.washington.edu | GOOD | 22:07:57 up 159 days, 12:16, 1 user, load average: 1.16, 0.86, 0.58 |
| linda-vm1.prisem.washington.edu | GOOD | 22:07:54 up 159 days, 12:03, 1 user, load average: 0.00, 0.01, 0.05 |
| git.prisem.washington.edu | GOOD | 22:07:54 up 159 days, 12:03, 2 users, load average: 0.00, 0.01, 0.05 |
| time.prisem.washington.edu | GOOD | 22:07:55 up 33 days, 4:33, 2 users, load average: 0.01, 0.07, 0.12 |
| jenkins-int.prisem.washington.edu | GOOD | 22:07:55 up 159 days, 12:03, 1 user, load average: 0.00, 0.01, 0.05 |
| u12-dev-ws-1.prisem.washington.edu | GOOD | 22:07:56 up 159 days, 12:03, 1 user, load average: 0.00, 0.02, 0.05 |
| sso.prisem.washington.edu | GOOD | 22:07:56 up 159 days, 12:03, 1 user, load average: 0.00, 0.01, 0.05 |
| lapp-int.prisem.washington.edu | GOOD | 22:07:54 up 159 days, 12:04, 2 users, load average: 0.00, 0.01, 0.05 |
| foswiki-int.prisem.washington.edu | GOOD | 22:07:55 up 159 days, 12:04, 1 user, load average: 0.00, 0.01, 0.05 |
| u12-dev-svr-1.prisem.washington.edu | GOOD | 22:07:59 up 155 days, 14:56, 1 user, load average: 0.05, 0.08, 0.06 |
| hub.prisem.washington.edu | GOOD | 06:07:53 up 141 days, 12:19, 1 user, load average: 0.08, 0.03, 0.05 |
| floyd2-p.prisem.washington.edu | GOOD | 22:07:53 up 33 days, 4:32, 1 user, load average: 0.00, 0.01, 0.05 |
| jira-int.prisem.washington.edu | GOOD | 22:07:54 up 159 days, 12:03, 2 users, load average: 0.00, 0.01, 0.05 |
| lapp.prisem.washington.edu | GOOD | 22:07:54 up 159 days, 12:04, 2 users, load average: 0.00, 0.01, 0.05 |
+-------------------------------------+--------+-------------------------------------------------------------------------+
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | To: dims-devops@uw.ops-trust.net
From: Jenkins <dims@eclipse.prisem.washington.edu>
Subject: [dims devops] [Jenkins] [FAILURE] jenkins-update-cifbulk-server-develop-16
Date: Thu Jan 14 20:35:21 PST 2016
Message-ID: <20160115043521.C7D5E1C004F@jenkins>
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building in workspace /var/lib/jenkins/jobs/update-cifbulk-server-develop/workspace
Deleting project workspace... done
[ssh-agent] Using credentials ansible (Ansible user ssh key - root)
[ssh-agent] Looking for ssh-agent implementation...
[ssh-agent] Java/JNR ssh-agent
[ssh-agent] Started.
...
TASK: [cifbulk-server | Make config change available and restart if updating existing] ***
<rabbitmq.prisem.washington.edu> REMOTE_MODULE command . /opt/dims/envs/dimsenv/bin/activate && supervisorctl -c /etc/supervisord.conf reread #USE_SHELL
failed: [rabbitmq.prisem.washington.edu] => (item=reread) => {"changed": true, "cmd": ". /opt/dims/envs/dimsenv/bin/activate && supervisorctl -c /etc/supervisord.conf reread", "delta": "0:00:00.229614", "end": "2016-01-14 20:34:49.409784", "item": "reread", "rc": 2, "start": "2016-01-14 20:34:49.180170"}
stderr: Error: could not find config file /etc/supervisord.conf
For help, use /usr/bin/supervisorctl -h
<rabbitmq.prisem.washington.edu> REMOTE_MODULE command . /opt/dims/envs/dimsenv/bin/activate && supervisorctl -c /etc/supervisord.conf update #USE_SHELL
failed: [rabbitmq.prisem.washington.edu] => (item=update) => {"changed": true, "cmd": ". /opt/dims/envs/dimsenv/bin/activate && supervisorctl -c /etc/supervisord.conf update", "delta": "0:00:00.235882", "end": "2016-01-14 20:34:50.097224", "item": "update", "rc": 2, "start": "2016-01-14 20:34:49.861342"}
stderr: Error: could not find config file /etc/supervisord.conf
For help, use /usr/bin/supervisorctl -h
FATAL: all hosts have already failed -- aborting
PLAY RECAP ********************************************************************
to retry, use: --limit @/var/lib/jenkins/cifbulk-server-configure.retry
rabbitmq.prisem.washington.edu : ok=11 changed=4 unreachable=0 failed=1
Build step 'Execute shell' marked build as failure
[ssh-agent] Stopped.
Warning: you have no plugins providing access control for builds, so falling back to legacy behavior of permitting any downstream builds to be triggered
Finished: FAILURE
--
[[ UW/DIMS ]]: All message content remains the property of the author
and must not be forwarded or redistributed without explicit permission.
|
[dimsenv] dittrich@dimsdemo1:~/dims/git/ansible-playbooks (develop*) $ grep -r supervisord.conf
roles/supervisor-install/tasks/main.yml: template: "src=supervisord.conf.j2 dest={{ dims_supervisord_conf }} owner=root group=root"
roles/supervisor-install/tasks/main.yml: file: path=/etc/dims-supervisord.conf state=absent
roles/supervisor-install/templates/supervisor.j2:DAEMON_OPTS="-c {{ dims_supervisord_conf }} $DAEMON_OPTS"
roles/cifbulk-server/tasks/main.yml: shell: ". {{ dimsenv_activate }} && supervisorctl -c {{ dims_supervisord_conf }} {{ item }}"
roles/cifbulk-server/tasks/main.yml: shell: ". {{ dimsenv_activate }} && supervisorctl -c {{ dims_supervisord_conf }} start {{ name_base }}:"
roles/prisem-scripts-deploy/tasks/main.yml: shell: ". {{ dimsenv_activate }} && supervisorctl -c {{ dims_supervisord_conf }} restart {{ item }}:"
roles/anon-server/tasks/main.yml: shell: ". {{ dimsenv_activate }} && supervisorctl -c {{ dims_supervisord_conf }} {{ item }}"
roles/anon-server/tasks/main.yml: shell: ". {{ dimsenv_activate }} && supervisorctl -c {{ dims_supervisord_conf }} start {{ name_base }}:"
roles/consul-install/tasks/main.yml: shell: ". {{ dimsenv_activate }} && supervisorctl -c {{ dims_supervisord_conf }} remove {{ consul_basename }}"
roles/consul-install/tasks/main.yml: shell: ". {{ dimsenv_activate }} && supervisorctl -c {{ dims_supervisord_conf }} {{ item }}"
roles/consul-install/tasks/main.yml: shell: ". {{ dimsenv_activate }} && supervisorctl -c {{ dims_supervisord_conf }} start {{ consul_basename }}:"
roles/crosscor-server/tasks/main.yml: shell: ". {{ dimsenv_activate }} && supervisorctl -c {{ dims_supervisord_conf }} {{ item }}"
roles/crosscor-server/tasks/main.yml: shell: ". {{ dimsenv_activate }} && supervisorctl -c {{ dims_supervisord_conf }} start {{ name_base }}:"
group_vars/all:dims_supervisord_conf: '/etc/supervisord.conf'
[dimsenv] dittrich@dimsdemo1:~/dims/git/python-dimscli (develop*) $ dimscli ansible shell --program "find /etc -name supervisord.conf" --inventory complete_inventory --remote-port 8422 --remote-u
ser dittrich
+-------------------------------------+--------+----------------------------------+
| Host | Status | Results |
+-------------------------------------+--------+----------------------------------+
| rabbitmq.prisem.washington.edu | GOOD | /etc/supervisor/supervisord.conf |
| wellington.prisem.washington.edu | GOOD | |
| hub.prisem.washington.edu | GOOD | |
| git.prisem.washington.edu | GOOD | /etc/supervisor/supervisord.conf |
| u12-dev-ws-1.prisem.washington.edu | GOOD | |
| sso.prisem.washington.edu | GOOD | |
| jenkins-int.prisem.washington.edu | GOOD | /etc/supervisor/supervisord.conf |
| foswiki-int.prisem.washington.edu | GOOD | |
| lapp-int.prisem.washington.edu | GOOD | |
| u12-dev-svr-1.prisem.washington.edu | GOOD | /etc/supervisor/supervisord.conf |
| linda-vm1.prisem.washington.edu | GOOD | |
| lapp.prisem.washington.edu | GOOD | |
| floyd2-p.prisem.washington.edu | GOOD | |
| jira-int.prisem.washington.edu | GOOD | /etc/supervisor/supervisord.conf |
| time.prisem.washington.edu | GOOD | |
+-------------------------------------+--------+----------------------------------+
[dimsenv] dittrich@dimsdemo1:~/dims/git/python-dimscli (develop*) $ dimscli ansible shell --program "find /etc -name '*supervisor'*" --inventory complete_inventory --remote-port 8422 --remote-use
r dittrich
+-------------------------------------+--------+-------------------------------------------------+
| Host | Status | Results |
+-------------------------------------+--------+-------------------------------------------------+
| rabbitmq.prisem.washington.edu | GOOD | /etc/rc0.d/K20supervisor |
| | | /etc/rc3.d/S20supervisor |
| | | /etc/rc1.d/K20supervisor |
| | | /etc/default/supervisor |
| | | /etc/rc2.d/S20supervisor |
| | | /etc/rc6.d/K20supervisor |
| | | /etc/supervisor |
| | | /etc/supervisor/supervisord.conf.20140214204135 |
| | | /etc/supervisor/supervisord.conf.20140214200547 |
| | | /etc/supervisor/supervisord.conf.20140616162335 |
| | | /etc/supervisor/supervisord.conf.20140814132409 |
| | | /etc/supervisor/supervisord.conf.20140616162451 |
| | | /etc/supervisor/supervisord.conf.20140616162248 |
| | | /etc/supervisor/supervisord.conf.20140131230939 |
| | | /etc/supervisor/supervisord.conf.20140222154901 |
| | | /etc/supervisor/supervisord.conf.20140214194415 |
| | | /etc/supervisor/supervisord.conf.20140222155042 |
| | | /etc/supervisor/supervisord.conf.20150208174308 |
| | | /etc/supervisor/supervisord.conf.20140814132717 |
| | | /etc/supervisor/supervisord.conf.20140215134451 |
| | | /etc/supervisor/supervisord.conf.20150208174742 |
| | | /etc/supervisor/supervisord.conf.20140911193305 |
| | | /etc/supervisor/supervisord.conf.20140219200951 |
| | | /etc/supervisor/supervisord.conf.20140911202633 |
| | | /etc/supervisor/supervisord.conf |
| | | /etc/supervisor/supervisord.conf.20140222154751 |
| | | /etc/supervisor/supervisord.conf.20150208174403 |
| | | /etc/supervisor/supervisord.conf.20140814132351 |
| | | /etc/supervisor/supervisord.conf.20140814132759 |
| | | /etc/rc4.d/S20supervisor |
| | | /etc/init.d/supervisor |
| | | /etc/rc5.d/S20supervisor |
| wellington.prisem.washington.edu | GOOD | |
| linda-vm1.prisem.washington.edu | GOOD | /etc/rc0.d/K20supervisor |
| | | /etc/rc3.d/S20supervisor |
| | | /etc/rc1.d/K20supervisor |
| | | /etc/rc2.d/S20supervisor |
| | | /etc/rc6.d/K20supervisor |
| | | /etc/supervisor |
| | | /etc/rc4.d/S20supervisor |
| | | /etc/dims-supervisord.conf |
| | | /etc/init.d/supervisor |
| | | /etc/rc5.d/S20supervisor |
| git.prisem.washington.edu | GOOD | /etc/rc0.d/K20supervisor |
| | | /etc/rc3.d/S20supervisor |
| | | /etc/rc1.d/K20supervisor |
| | | /etc/default/supervisor |
| | | /etc/rc2.d/S20supervisor |
| | | /etc/rc6.d/K20supervisor |
| | | /etc/supervisor |
| | | /etc/supervisor/supervisord.conf |
| | | /etc/rc4.d/S20supervisor |
| | | /etc/init.d/supervisor |
| | | /etc/rc5.d/S20supervisor |
| time.prisem.washington.edu | GOOD | |
| jenkins-int.prisem.washington.edu | GOOD | /etc/rc0.d/K20supervisor |
| | | /etc/rc3.d/S20supervisor |
| | | /etc/rc1.d/K20supervisor |
| | | /etc/default/supervisor |
| | | /etc/rc2.d/S20supervisor |
| | | /etc/rc6.d/K20supervisor |
| | | /etc/supervisor |
| | | /etc/supervisor/supervisord.conf |
| | | /etc/rc4.d/S20supervisor |
| | | /etc/init.d/supervisor |
| | | /etc/rc5.d/S20supervisor |
| u12-dev-ws-1.prisem.washington.edu | GOOD | |
| sso.prisem.washington.edu | GOOD | |
| lapp-int.prisem.washington.edu | GOOD | |
| foswiki-int.prisem.washington.edu | GOOD | |
| u12-dev-svr-1.prisem.washington.edu | GOOD | /etc/rc2.d/S20supervisor |
| | | /etc/rc4.d/S20supervisor |
| | | /etc/init.d/supervisor |
| | | /etc/rc5.d/S20supervisor |
| | | /etc/rc3.d/S20supervisor |
| | | /etc/supervisor |
| | | /etc/supervisor/supervisord.conf |
| | | /etc/rc6.d/K20supervisor |
| | | /etc/rc1.d/K20supervisor |
| | | /etc/rc0.d/K20supervisor |
| hub.prisem.washington.edu | GOOD | |
| floyd2-p.prisem.washington.edu | GOOD | |
| jira-int.prisem.washington.edu | GOOD | /etc/rc0.d/K20supervisor |
| | | /etc/rc3.d/S20supervisor |
| | | /etc/rc1.d/K20supervisor |
| | | /etc/default/supervisor |
| | | /etc/rc2.d/S20supervisor |
| | | /etc/rc6.d/K20supervisor |
| | | /etc/supervisor |
| | | /etc/supervisor/supervisord.conf |
| | | /etc/rc4.d/S20supervisor |
| | | /etc/init.d/supervisor |
| | | /etc/rc5.d/S20supervisor |
| lapp.prisem.washington.edu | GOOD | |
+-------------------------------------+--------+-------------------------------------------------+
While the concept of putting a list of host names into a file with a label is simple
to understand, it is not very flexible or scalable. Ansible supports a concept
called a Dynamic Inventory. Rather than passing a hosts file using -i
or
--inventory
, you can pass a Python script that produces a special JSON object.
What is not very widely known is that you can also trigger creation of a
dynamic inventory within ansible
or ansible-playbook
by passing
a list for the -i
or --inventory
option. Rather than creating
a temporary file with [all]
at the top, followed by a list of
three host names, then passing that file with -i
or --inventory
, just
pass a comma-separated list instead:
[dimsenv] dittrich@dimsdemo1:~/dims/git/python-dimscli (develop*) $ dimscli ansible shell --program "find /etc -name supervisord.conf" --inventory rabbitmq.prisem.washington.edu,time.prisem.washi
ngton.edu,u12-dev-svr-1.prisem.washington.edu --remote-port 8422 --remote-user dittrich
+-------------------------------------+--------+----------------------------------+
| Host | Status | Results |
+-------------------------------------+--------+----------------------------------+
| rabbitmq.prisem.washington.edu | GOOD | /etc/supervisor/supervisord.conf |
| time.prisem.washington.edu | GOOD | |
| u12-dev-svr-1.prisem.washington.edu | GOOD | /etc/supervisor/supervisord.conf |
+-------------------------------------+--------+----------------------------------+
There is a subtle trick for passing just a single host, and that is to pass
the name with a trailing comma (,
), as seen here:
[dimsenv] dittrich@dimsdemo1:~/dims/git/python-dimscli (develop*) $ dimscli ansible shell --program "find /etc -name supervisord.conf" --inventory rabbitmq.prisem.washington.edu, --remote-port 84
22 --remote-user dittrich
+--------------------------------+--------+----------------------------------+
| Host | Status | Results |
+--------------------------------+--------+----------------------------------+
| rabbitmq.prisem.washington.edu | GOOD | /etc/supervisor/supervisord.conf |
+--------------------------------+--------+----------------------------------+
...
12.2. Debugging Vagrant¶
Vagrant has a mechanism for enabling debugging output to determine
what it is doing. That mechanism is to set an environment variable
VAGRANT_LOG=debug
before running vagrant
.
$ vagrant halt
$ VAGRANT_LOG=debug vagrant up --no-provision > /tmp/debug.log.1 2>&1
The debugging log looks like the following:
INFO global: Vagrant version: 1.8.6
INFO global: Ruby version: 2.2.5
INFO global: RubyGems version: 2.4.5.1
INFO global: VAGRANT_LOG="debug"
INFO global: VAGRANT_OLD_ENV_TMPDIR="/tmp"
INFO global: VAGRANT_OLD_ENV_COMMAND=""
INFO global: VAGRANT_OLD_ENV_LANG="en_US.UTF-8"
INFO global: VAGRANT_OLD_ENV_UNDEFINED="__undefined__"
INFO global: VAGRANT_OLD_ENV_TERM="screen-256color"
INFO global: VAGRANT_OLD_ENV_VAGRANT_LOG="debug"
. . .
INFO global: VAGRANT_INTERNAL_BUNDLERIZED="1"
INFO global: Plugins:
INFO global: - bundler = 1.12.5
INFO global: - unf_ext = 0.0.7.2
INFO global: - unf = 0.1.4
INFO global: - domain_name = 0.5.20161129
INFO global: - http-cookie = 1.0.3
INFO global: - i18n = 0.7.0
INFO global: - log4r = 1.1.10
INFO global: - micromachine = 2.0.0
INFO global: - mime-types-data = 3.2016.0521
INFO global: - mime-types = 3.1
INFO global: - net-ssh = 3.0.2
INFO global: - net-scp = 1.1.2
INFO global: - netrc = 0.11.0
INFO global: - rest-client = 2.0.0
INFO global: - vagrant-scp = 0.5.7
INFO global: - vagrant-share = 1.1.6
INFO global: - vagrant-triggers = 0.5.3
INFO global: - vagrant-vbguest = 0.13.0
. . .
INFO vagrant: `vagrant` invoked: ["up"]
DEBUG vagrant: Creating Vagrant environment
INFO environment: Environment initialized (#<Vagrant::Environment:0x00000002618e68>)
INFO environment: - cwd: /vm/run/blue14
INFO environment: Home path: /home/ansible/.vagrant.d
DEBUG environment: Effective local data path: /vm/run/blue14/.vagrant
INFO environment: Local data path: /vm/run/blue14/.vagrant
DEBUG environment: Creating: /vm/run/blue14/.vagrant
INFO environment: Running hook: environment_plugins_loaded
INFO runner: Preparing hooks for middleware sequence...
INFO runner: 3 hooks defined.
INFO runner: Running action: environment_plugins_loaded #<Vagrant::Action::Builder:0x000000025278b0>
. .
DEBUG meta: Finding driver for VirtualBox version: 5.1.10
INFO meta: Using VirtualBox driver: VagrantPlugins::ProviderVirtualBox::Driver::Version_5_1
INFO base: VBoxManage path: VBoxManage
INFO subprocess: Starting process: ["/usr/bin/VBoxManage", "showvminfo", "d1f7ffcb-3fab-4878-a77d-5fdb8d2f7fae"]
INFO subprocess: Command not in installer, restoring original environment...
DEBUG subprocess: Selecting on IO
DEBUG subprocess: stdout: Name: blue14_default_1482088614789_39851
Groups: /
Guest OS: Ubuntu (64-bit)
UUID: d1f7ffcb-3fab-4878-a77d-5fdb8d2f7fae
Config file: /home/ansible/VirtualBox VMs/blue14_default_1482088614789_39851/blue14_default_1482088614789_39851.vbox
Snapshot folder: /home/ansible/VirtualBox VMs/blue14_default_1482088614789_39851/Snapshots
Log folder: /home/ansible/VirtualBox VMs/blue14_default_1482088614789_39851/Logs
Hardware UUID: d1f7ffcb-3fab-4878-a77d-5fdb8d2f7fae
Memory size: 3072MB
Page Fusion: off
VRAM size: 32MB
CPU exec cap: 100%
. . .
Effective Paravirt. Provider: KVM
State: powered off (since 2016-10-30T20:11:22.000000000)
Monitor count: 1
3D Acceleration: off
2D Video Acceleration: off
Teleporter Enabled: off
. . .
For this debugging scenario, we are trying to add the ability to toggle whether
Vagrant brings up the Virtualbox VM with or without a GUI (i.e., “headless” or
not). The line we are concerned about here is the following line, which shows
the startvm
line used to run the Virtualbox VM:
INFO subprocess: Starting process: ["/usr/bin/VBoxManage", "startvm", "89e0e942-3b3b-4f0a-b0e4-6d0bb51fef04", "--type", "headless"]
The default for Vagrant is to start VMs in headless mode. To instead boot
with a GUI, the Vagrantfile
should contain a provisioner block with
the following setting:
config.vm.provider "virtualbox" do |v|
v.gui = true
end
Note
It is important to note that the Vagrantfile
is Ruby code,
and that the above sets a Ruby boolean to the value true
,
which is not necessarily the same as the string "true"
.
Rather than requiring that the user edit the Vagrantfile
,
it would be more convenient to support passing an environment
variable into the child process.
Using the following code snippets, we can inherit an environment variable (which is a string) and turn it into a boolean using a string comparison operation in a ternary logical expression.
# Set GUI to boolean false if environment variable GUI == 'true'
GUI = ENV['GUI'].nil? ? false : (ENV['GUI'] == 'true')
. . .
# Conditionally control whether startvm uses "--type gui"
# or "--type headless" using GUI (set earlier)
config.vm.provider "virtualbox" do |v|
v.gui = GUI
end
. . .
Now we can test the setting of the environment variable on
a vagrant
command line, again with debug logging enabled
and redirected into a second log file.
$ vagrant halt
==> default: Attempting graceful shutdown of VM...
$ vagrant destroy --force
==> default: Destroying VM and associated drives...
$ GUI=true VAGRANT_LOG=debug vagrant up --no-provision > /tmp/debug.log.2 2>&1
Now looking for the specific string in the output of both files, we can compare the results and see that we have the desired effect:
$ grep 'Starting process.*startvm' /tmp/debug.log.{1,2}
/tmp/debug.log.1: INFO subprocess: Starting process: ["/usr/bin/VBoxManage", "startvm", "89e0e942-3b3b-4f0a-b0e4-6d0bb51fef04", "--type", "headless"]
/tmp/debug.log.2: INFO subprocess: Starting process: ["/usr/bin/VBoxManage", "startvm", "3921e4e9-fdb4-4191-90b3-f7415ec0b37d", "--type", "gui"]
12.3. Other Tools for Diagnosing System Problems¶
12.3.1. smartmontools¶
Hardware makes up the physical layer of the DIMS system. Developers are currently using Dell Precision M4800 laptops to develop the software layers of DIMS.
These laptops have had multiple issues, specifically including not sleeping properly and heating up to extreme temperatures, heating up to extreme temperatures when not sitting on solid, very well ventilated surfaces, and these specific problems have led to malfunctions with the hard drives. At least one laptop has completely stopped being able to boot. Multiple other laptops have struggled during the boot up process and have had other problems that may indicate a near-term hard drive failure.
In an effort to turn a black box into less of a black box and to try to see
ahead of time if there are any indicators that may be pointing to a failure
before a failure, we are now employing the use of a tool called smartmontools
.
This package comes with two tools – smartctl
and smartd
– which
control and monitor storage systems using the Self-Monitoring, Analysis
and Reporting Technology System (SMART)
built in to a lot of modern hard drives,
including the ones on the developer laptops. When using this tool as a daemon,
it can give advanced warning of disk degradation and failure. (For more
information, see smartmontools home.
The package will be added to the list of base packages installed on all DIMS systems, and the rest of this section will be devoted to a brief introduction for how to use the tool.
Note
These instructions were taken from ubuntu smartmontools docs. If it differs on other Linux flavors (particularly Debian Jessie), new instructions will be added.
You will be using the smartctl
utility to manually monitor your drives.
First, you need to double check that your hard drive is SMART-enabled.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | [dimsenv] mboggess@dimsdev2:it/dims-adminguide/docs/source (develop*) $ sudo smartctl -i /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.4.0-42-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Laptop SSHD
Device Model: ST1000LM014-1EJ164
Serial Number: W771CY1P
LU WWN Device Id: 5 000c50 089fc94f9
Firmware Version: DEMB
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Oct 14 11:08:25 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
|
This output gives you information about the hard drive, including if SMART is support and enabled.
In the event that somehow SMART is available but not enabled, run
sudo smartctl -s on /dev/sda
There are several different types of tests you can run via smartctl
. A full
list is documented in the help/usage output which you can obtain by running
[dimsenv] mboggess@dimsdev2:it/dims-adminguide/docs/source (develop*) $ smartctl -h
To find an estimate of the time it will take to complete the various tests, run
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | [dimsenv] mboggess@dimsdev2:it/dims-adminguide/docs/source (develop*) $ sudo smartctl -c /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.4.0-42-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 139) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 191) minutes.
Conveyance self-test routine
recommended polling time: ( 3) minutes.
SCT capabilities: (0x10b5) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
|
As you can see, the long
test is rather long–191 minutes!
To run the long
test, run
[dimsenv] mboggess@dimsdev2:it/dims-adminguide/docs/source (develop*) $ sudo smartctl -t long /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.4.0-42-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 191 minutes for test to complete.
Test will complete after Fri Oct 14 15:00:32 2016
Use smartctl -X to abort test.
To abort the test:
[dimsenv] mboggess@dimsdev2:it/dims-adminguide/docs/source (develop*) $ sudo smartctl -X /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.4.0-42-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Abort SMART off-line mode self-test routine".
Self-testing aborted!
To get test results, for a SATA drive, run
[dimsenv] mboggess@dimsdev2:it/dims-adminguide/docs/source (develop*) $ sudo smartctl -a -d ata /dev/sda
To get test results, for an IDE drive, run
[dimsenv] mboggess@dimsdev2:it/dims-adminguide/docs/source (develop*) $ sudo smartctl -a /dev/sda
Additionally, you can run smartmontools
as a daemon, but for now, that
will be left for an admin to research and develop on their own. In the future,
this has potential to be turned into an Ansible role. Documentation from Ubuntu
on how to use smartmontools
as a daemon can be found in the daemon subsection
of the Ubuntu smartmontools
documentation.