Q: What do I need to watch out for when I'm using SE?
A: Rich and I get a lot of questions about SE on
our se-feedback alias. In order to reduce the number of times we have to answer the same question this month's column is a FAQ.
Where can I get a version of SE that runs on Solaris 2.6?
This is still the most common question!
A new version, SE3.0 was introduced last month. It's a major
rewrite with many new features and includes Solaris 2.5, 2.5.1, and
2.6 support for SPARC and x86 (no older releases any more) and many
more network interface types. Keep reading Unix Insider to stay up to date. The SE distribution directory is now http://www.sun.com/sun-on-net/performance/se3.
Where can I get a version of SE that runs on Solaris 2.3 or 2.4?
We don't have enough time or test systems to support many releases, you can use the previous SE2.5.0.2
release on Solaris 2.3 through 2.5.1.
The SE2.5.0.2 distribution directory is http://www.sun.com/960601/columns/adrian/se2.5.0.2.
Why did Solaris 2.6 support take so long?
For Solaris 2.6 we had to selectively filter out the
partition, tape, and NFS data from the iostat class and
disk rule. The rule now makes sure it only has disks that contain
partitions to look at. The changes in TCP that are in 2.6 and were
backported to 2.5.1 were taken care of; SE now looks for some key
patches to see if they are installed and sets preprocessor #defines
to cope with the changes.
Why do I get the error "Fatal: member: txunderruns vanished!: Near line 201"?
This problem occurs with SE2.5.0.2 and FDDI 5.0 interfaces. There are three
possible fixes:
- Upgrade to SE3.0.
- Visit the SE2.5.0.2 download page and get the FDDI patch.
- Add the latest patch to FDDI which should reinstate the metric that went missing.
Why do I get the error "Fatal: member: txunderruns0 vanished!: Near line 255"?
This problem occurs with SE3.0 and older FDDI interface code. Update your
FDDI patch level or try running scripts with % se -DOLD_FDDI script.se
Why do I get the error "Fatal: member: defer vanished!: Near line 285"?
This problem occurs with Solaris 2.5 and hme interfaces. There are three
possible fixes:
- Upgrade to a later Solaris release.
- Get the hme patch for Solaris 2.5.
- As a temporary work-around, change the member "defer" to
"missing1" in the ks_hme_network structure in /opt/RICHPse/include/kstat.se
like this:
#ifdef MINOR_VERSION >= 51
ulong defer;
#else
ulong missing1;
#endif
The next build of SE3 will figure this out automatically.
Why do I get the error "Fatal: member: framming vanished!: Near line 160"?
This problem occurs with Solaris 2.5.1 and le interfaces. The le patch for Solaris 2.5.1 corrects the spelling from framming to framing. SE3.0 tries
to detect this patch, but if you don't have the patch directory in
/var/sadm/patch it can't tell that the patch is installed. There are three
possible fixes. 1) Upgrade to Solaris 2.6. 2) Reinstall the le patch for
Solaris 2.5.1. 3) Create the directory /var/sadm/patch/103903-03 by hand.
4) As a temporary work-around, run scripts using % se -DLE_PATCH script.se to force the update.
Why do my networks keep indicating BLACK?
You may see messages like this from virtual_adrian, or black states reported
by zoom saying "Errors seen, fix hardware or cables."
Adrian detected slow net(s): Wed Dec 3 20:29:42 1997
Problem: network failure
State Name Ipkt/s Ierr/s Opkt/s Oerr/s Coll% NoCP/s Defr/s
black le0 2.4 0.1 0.8 0.0 0.00 0.00 0.00
You can see that it is reporting 0.1 input errors/s. Over a 30-second default
period this is more than one error. Single errors generate a warning.
Multiple errors generate this message. You may have a bad cable or
misconfigured Ethernet switch. It is also possible that another system
on the network is generating bad packets that are appearing as broadcast
packets on the network, so your Solaris system picks them up. This turns
out to be very common with PCs, as there is a lot of flaky PC networking
hardware in use. The way to track it down is to use snoop to capture
packets until you find a bad one, and look at the Ethernet address that
it came from. I found packets claiming to be old-version DECnet or Novell IPX
packets that were being generated by a PC at random. The PC was on a TCP/IP-only network. If the from address looks like 8:0:20:xx:xx:xx, then it is
very likely to be a Sun system. If not then look for another type of
hardware on that network segment. You may also want to look at the raw
network interface counters to see what kind of errors is being picked up.
% netstat -k hme0
hme1:
ipackets 1282154 ierrors 298 opackets 2439971 oerrors 10 collisions 209327
defer 8 framing 1 crc 0 sqe 0 code_violations 0 len_errors 1
drop 0 buff 0 oflo 0 uflo 0 missed 0 tx_late_collisions 2
retry_error 8 first_collisions 0 nocarrier 0 inits 7 nocanput 0
allocbfail 0 runt 0 jabber 0 babble 0 tmd_error 0 tx_late_error 0
rx_late_error 0 slv_parity_error 0 tx_parity_error 0 rx_parity_error 0
slv_error_ack 0 tx_error_ack 0 rx_error_ack 0 tx_tag_error 0
rx_tag_error 0 eop_error 0 no_tmds 0 no_tbufs 0 no_rbufs 0
rx_late_collisions 0 rbytes 320156380 obytes 645775764 multircv 7184 multixmt 3
brdcstrcv 449782 brdcstxmt 1295 norcvbuf 0 noxmtbuf 1
The command shown is Solaris 2.6 specific, use % netstat -k | more on older releases to find the network data. The interface shown has had
a few specific errors (framing, len_errors, retry_error), but not enough to
worry about. The input errors are most probably coming from bad broadcast
packets.
It's up to you to figure out how to fix network errors, all my tool did was tell
you that they were there in case you weren't looking for them. I can't
figure out all possible failure modes for you. You can increase the threshold
that SE uses to one/second using: setenv ENET_ERROR_PROBLEM 1.0 before you
run a script, and you won't see black states unless it gets very bad.
How can I see disk stripes, RAID units, etc. with SE?
You need to be using the latest DiskSuite. It automatically registers I/O kstats
so that iostat shows mdXX performance. If you are using an older
release neither iostat or SE will see the md devices. DiskSuite 4.1
is the version that started reporting data. Veritas Volume Manager does not
generate I/O kstats, so SE cannot get at the data.
SE picks up mdXX, and the SE rules code filters out disks that don't have
partitions on them, so only the top level md entries get picked up
by tools like zoom and virtual_adrian. (This filtering is needed in 2.6
to filter out disk partition data.)
The RSM2000 is a hardware RAID controller-based system that uses dual
redundant access to the RAID unit over two SCSI buses. Since there are two
ways to get at each disk with different controller, target numbers, a
pseudo device is used. The RAID unit presents each RAID5 or stripe as if
it was a single disk partition, but SE cannot figure out what the disk really is
and omits it entirely. We know this is a problem, but it is hard to
solve and, we haven't had time to figure it out yet.
Do my old modified scripts still work?
SE3.0 is an incompatible upgrade to the language specification. If
you have your own custom scripts they will need some changes to the
code and the APIs. Most of the changes are to make it more C-like.
The call to time() now takes an argument. You have to use
time(0).
The
addr() function is replaced by an ampersand (&) operator,
although there is still no real pointer support. Some of the C
interface APIs are a bit cleaner.
Many of the classes have also been upgraded. In particular the
p_iostat_class now includes full disk name and partition
information, so the path_to_inst class is no longer needed. The code
takes longer to start up, but is much more efficient at runtime when
you have a large number of disks. The set of #includes at the head
of each script is slightly different to previous releases.
Does SE3 work reliably? How many people are using it?
SE3 seems to be off to a good start. So far several hundred users have used the
notification e-mail to tell us that they have installed it, and the only
problems we have seen so far are related to patch levels as listed above.
Wrap up
Remember that you have the source to the scripts. If you find a
problem you may be able to fix it yourself. Any problems, suggested
fixes, or queries about SE should go to the
se-feedback@chessie.eng.sun.com alias only, which gets to myself and
Rich Pettit and is logged for posterity.
Disclaimer -- SE is an unsupported experimental toolkit that is
designed to make it easy to rapidly generate prototypes and try out
new ideas. It is not a production quality performance management
product. Vendors of such products are welcome to use the ideas
expressed in the SE toolkit to improve their products' ability to
manage Solaris-based systems.
Acknowledgments -- Rich Pettit no longer works at Sun but has still
found time to update SE. Please pay him back for his time and effort
by taking a look at his real product, the Resolute Software RAPS -- Realtime
Application Performance System. Mike Bennett wrote the
tcp_monitor GUI and several of the TCP rules. Thanks to the TCP
group at SunSoft for feedback on the rules, which should be
considered experimental at this stage.
Resources and Related Links
Other Cockcroft columns at www.sun.com