Thursday, July 30, 2009

Getting A Handle On Dropped Packets – 4 Key Troubleshooting Tips

In today’s net-centric environment, organizations often depend on the network for voice over IP, video conferencing and webcasts. Network problems caused by packet loss can create noticeable performance issues.

Is packet loss a problem your organization is experiencing? Could it be contributing to larger problems? One thing’s for sure, almost every network experiences packet loss to one degree or another. Dropped packets can originate in almost any part of the network path, from bad cables to flakey applications. Here are a few common causes and what you can do to fix them:

1 . Find the source. First, understand where the packet loss is occurring. The command line tracer tool helps determine the exact location. Then, learn the extent of the problem by using the “Netstat –s –p tcp” command. This will display the total segments sent and total segments retransmitted. (Check out this site for more information on how to use command line tools)

2. It's probably the cables. More often than not, it’s this simple. Check that nothing has been placed on top of the cable and that the connections are tight. Test by replacing the potentially bad cable with a known good one. Frequently, very long cables will only show excessive problems when they are processing a heavy load.

3. Check for duplex mismatches. Many systems auto negotiate the duplex speed, however errors do occur. Consider manually setting it to a known level and see if the problem goes away.

4. Unleash the routers! If your routers are overwhelmed, they will drop packets. Check for excessive utilization across each link and make sure the system overall is not saturated. Not all routers drop packets at the same traffic level. Some Cisco routers can begin dropping packets at a CPU load of 50%; on other models this may not occur until 95% or more.

Tip for dopplerVUE users: The Locator view lets you sort all interfaces by packet loss so you can isolate the location of dropped packets instantly.

Friday, July 24, 2009

Taking Advantage of Network Virtualization

I’ve been hearing a lot about the benefits of virtualization for network management. From improved productivity to increased efficiency, network virtualization holds much promise. However, for all the benefits that virtualization brings to the field of network management, it also brings a few challenges.

How do you keep track of your virtual assets and real ones? Do you need a special team, tools or hardware to get the job done? It turns out that the most popular virtualization system (VMware), makes this an easy job for most network management systems when configured correctly.

VMware’s workstation product gives you three choices for network virtualization: Bridged, NAT and Host Only.

Bridged: This mode creates a virtual switch that sits between the host NIC and the VM instance. The VM instance looks like another PC on the network, it shares the host NIC resources and has an IP Address assigned via DHCP or static entry. A bridged VM instance looks and feels very much like a separate server on the network. It provides full monitoring capabilities similar to that of the host.

NAT: NAT mode uses the host machine's IP address to communicate with the network. As a result, no external IP address is assigned and the VM instance is not visible to the external network. This method provides a high level of security, but does not allow you to poll the VM instance directly. This method requires additional specialized software and agents.

Host Only: This method sets up a network that is completely contained within the host. It has no ability to communicate with the outside world. You will not be able to see the VM instance at all.

As you can see, the methods used for setting up networking on a virtual instance will determine what an IT management application will “see” and monitor. While this example is specific to VMware, most products offer similar options.

If you’re a dopplerVUE user you can create a group and associate the virtual devices to a physical device. dopplerVUE groups provide a view of all alarms and performance overlays in a single view, and allows drill down access into the performance of each virtual server.

Thursday, July 23, 2009

Free Cyber Security Training! – No catches (ok you have to be a US citizen)

FEMA has put together a nice Cyber training website that is open to all US citizens. Once you enroll, there are a variety of courses on Cyber security (listed below) for both the technical and non-technical.









I took the program on “Network Assurance” and found it to be well written and learned quite a bit. It filled in some gaps and adjusted my knowledge on some terminology.

Friday, July 17, 2009

Where Is All This Traffic going?

Have you ever wondered where all the network traffic is going? The standard SNMP data gives an overview of all traffic in and out of an interface, but little in the way of details regarding source/destination and protocols in use. To learn where the traffic is going and what protocols are in use, you should check out what flow-based products can provide. The most popular is NetFlow, although there are other similar products available as well (JFlow, sFlow, etc…). Each product version has a few unique attributes, but they all provide a core set of information.

So what does a flow based product do?
It provides an answer to the question,“Where is the traffic going?”. Netflow displays the top source and destinations (who and what destination IP), and does packet level inspection of your network traffic to check for source and destinations and ports and protocols. So, not only can you tell who is going to what server or website, but you can also tell what port and protocol is being used. This can often be used to identify popular applications and external “resources”. Here are some sample Netflow reports:






With this report I can see the top conversations – multiple people hitting the same IP.








Here I can learn more about the type of traffic on my network.






And here I can see the top sources coming into my network. Very helpful as a supplement to your security measures.

*Note – the IPs have been changed to protect the innocent. A 192 address would not normally be an incoming source.

What do you need to get started with Netflow?
NetFlow is already installed on many Cisco routers, so make sure to check the Cisco website for your model and version or buy a Netflow enabled router. You’ll also need a Netflow enabled network management system to create quality reports. Keep in mind, monitoring flow based data is information intensive and will use resources on the router and storage space on your network management system. One option to save storage space is to use Netflow on demand: only enabling the monitoring when necessary to troubleshoot.

It’s not hard to configure!
Here is a sample of the steps for setting up NetFlow v5:
1) Enter global configuration mode on the router, and issue the following commands for each interface on which you want to enable Netflow:
a) Router#configure terminal
b) Router >(config)#interface {interface} {interface_number} (Example: interface FastEthernet 0/1)
c) Router >(config-if)#ip route-cache flow
d) Router >(config-if)#exit
Export Netflow data to your netflow enable network management system:
a) Router#enable
b) Password:
c) Router#configure terminal
d) Router >(config)#ip flow-export 9996
e) Router >(config)#ip flow-export version 5
There is much to learn about Netflow. When you’re ready for the deep technical stuff, check out this Cisco article.



Tuesday, July 14, 2009

Network Troubleshooting: IP SLA+ WMI = Better Web Services

Why is the network so slow? I’m sure you’ve never heard this complaint before :) Diagnosing the problem isn’t always easy with so many possible culprits. You can start by running down the network troubleshooting checklist:
The DNS service?
The web server?
The WAN link?

IP SLA and WMI information is critical to diagnosing potential network problems. For most Cisco devices, IP SLA can give you performance information for the connectivity layers of a net-centric service like a web application or VoIP. In Microsoft environments, WMI can do the same for the application/server/desktop layer. Combining WMI with IP SLA provides performance information about both layers and gives an end-to-end view of your web application or other net-centered service to most efficiently troubleshoot any issues.

Using IP SLA to Access the User Experience
IP SLA (Internet Protocol Service Level Agreements) is embedded in the Cisco IOS (Internet Operating System) for most Cisco routers and switches. IP SLA operations can measure delay (round trip time), jitter, packet loss, connectivity, voice quality scores, and many other key metrics for monitoring and troubleshooting network elements.

Additionally, threshold levels can be set for most metrics. When a metric crosses a threshold level, IP SLA sends an SNMP trap to the specified IP addresses.

You can configure an IP SLA HTTP operation to monitor the overall user experience for the “connectivity layer” of a web application (or any other net-centered application such as email, VoIP or videoconferencing). This operation uses a synthetic web transaction to measure the total round trip time (RTT) to perform a DNS query, establish a TCP connection to the HTTP service, and retrieve the web site’s home page. By configuring the HTTP operation on the LAN switch closest to users, the total RTT (or latency) is an accurate measure of the users’ experience (as opposed to measuring RTT from a central network management server).

Next, configure an IP SLA ICMP Echo operation to monitor RTT between the switch on the user LAN and the switch to which the web server is connected. This way, if the HTTP operation indicates the web transaction is slow or unresponsive, you can check the WAN RTT between the switches to see whether the problem is related to the WAN link or something on the web server.

Watching the Applications and Servers: Adding WMI
WMI (Windows Management Instrumentation) is an instrumentation tool similar to IP SLA that Microsoft has created for its products. WMI provides thousands of performance metrics for applications such as MS Exchange and MS SQL Server, as well as for server hardware and operating system components.

Microsoft has a built-in performance administration tool for monitoring WMI data for applications and servers. Using the tool you can view each server’s CPU utilization, physical memory and free disk space. Each of these sub systems is critical to the server’s performance regardless of the application running. Lack of memory, CPU cycles and low disk space are common causes of slowdowns on a server. You’ll have to go into each server to view the individual performance counters or you can use network management software to simplify the process by collecting any of the thousands of available WMI counters from across multiple servers.

Getting the End-To-End View
An end-to-end view of the network will really help you troubleshoot network problems much faster and avoid the common complaints you often hear. To get an end-to-end view consider network management software such as dopplerVUE that integrates fault and performance data from a variety of sources, including SNMP, syslog, WMI and IP SLA, you can integrate metrics from both layers of a web service into a single end-to-end dashboard view. Using dopplerVUE’s drag-and-drop interface, you can quickly create an integrated view of both layers of the service without having to shift between tools or viewers (screenshot below).

Monday, July 13, 2009

Heading Off Trouble with Exchange Servers

I recently discussed the frequency of email failures in a June post. As a follow-up I wanted to provide some practical tips on managing Microsoft Exchange Servers to ensure the highest possible service levels for your users and head off problems before they become critical.

For Exchange servers, Microsoft's Windows Management Interface (WMI) performance counters provide a simple and effective method for monitoring Exchange servers. If your network management solution supports WMI, you can easily leverage WMI to manage Exchange servers.

Monitoring Queue Size
With over one thousand WMI performance counters available for an Exchange server, you can get very sophisticated in managing your devices and processes. For most people, however, the following counters for the Information Store service can provide a good indication of overall Exchange performance.
- MSExchangeISMailbox:SendQSize
- MSExchangeISMailbox:ReceiveQSize
- MSExchangeISPublic:SendQSize
- MSExchangeISPublic:ReceiveQSize
These counters reflect the message queue sizes for each instance of the public or mailbox stores. Although brief spikes are not uncommon, all of these counters should be close to zero during normal operations. Queue sizes that do not return to nearly zero within 10 to 15 minutes indicate a potential issue with message routing or service processing; however, larger environments may have queue sizes ranging from 5 to 10 while exhibiting acceptable performance. For these environments, queue sizes between 5 and 10 are not uncommon.

Another counter to consider is the MTA Work Queue Length (MSExchangeMTA:WorkQueueLength), which shows the number of queued messages being sent to or received from email servers other than Exchange Server 2003. A queue size that consistently exceeds 10 or 20 messages may indicate a problem with the MTA service.

Monitoring Email Delivery
Additionally, six more performance counters related to email delivery can provide a more rounded view of Exchange server performance. The counter values are unique to each environment, but monitoring them over time provides a baseline for a server’s steady state performance.
- MSExchangeISMailbox:AvgDeliveryTime(s)
- MSExchangeISMailbox:MsgsSentPerMin
- MSExchangeISMailbox:MsgsDeliveredPerMin
- MSExchangeISPublic:AvgDeliveryTime(s)
- MSExchangeISPublic:MsgsSentPerMin
- MSExchangeISPublic:MsgsDeliveredPerMin

Average delivery time values should be in the range of 600 to 900 milliseconds. Values greater than 1500 milliseconds indicate a performance problem. While the number of messages sent and delivered per minute is mostly informational in nature, it provides a good indication of general performance.

Monitoring Server Performance
To effectively monitor an Exchange server, it is important to monitor the underlying server resources as well. Again, there are thousands of available performance counters, but the following counters offer a good overview of server performance and resources without swamping you in data.
- Processor:%ProcessorTime. Processor or CPU utilization, on average, should be less than 70%. Utilization greater than 85-90% for more than 30 minutes, or 90-100% for more than 10 minutes, indicates an overloaded server.
- PagingFile:%Usage. The paging file for virtual memory should be less than 75%. Excessive paging, say 85-90% for any period of time, is cause for concern.
- Memory:AvailableMBytes. Physical memory values below 20MB indicate insufficient RAM.
- LogicalDisk:%DiskTime. The amount of time a disk spends reading and writing data should be in the neighborhood of 60-70%, although brief spikes are not unusual.
- LogicalDisk:%FreeSpace. Exchange uses a lot of disk space, so overall free space should be monitored closely. The Windows and Exchange volume should have 256MB of free space; the Exchange database volume should have 1GB of free space; and the transaction log volume should have 100MB of free space.
- NetworkInterface:CurrentBandwidth. Acceptable interface bandwidth will depend on the type and size of the network, but generally speaking the average bandwidth should be 50-60% of maximum capacity.

If your network management solution doesn't support WMI or you are looking for a proven solution consider dopplerVUE. It provides powerful network management capabilities in an easy to use software package.

Friday, July 10, 2009

Monitoring Bandwidth Part 2: Examining SNMP Traffic Data

Let’s start by discussing what we really want to know about bandwidth:

1. How much is moving across any given interface?
2. Is the interface maxed out?
3. Is the device or devices beyond this one slow (or down)?

SNMP MIB-II enabled devices provide the following key metrics that will be used to derive answers to 1 & 2.

ifSpeed - The interfaces current bandwidth in bits per second
ifInOctets - The total number of octets received on the interface
ifOutOctets - The total number of octets transmitted out on the interface
Source: RFC 1213

The octet metrics are simple counters that grow as traffic is passed on an interface. Using these metrics we can poll devices two times and use some “simple” math to determine the delta between the polling jobs. This will give us the amount of traffic that has passed in the interval. You can divide this by the amount of time to get an average bit per second rate. Or you could simply use a tool like dopplerVUE that does the math for you (screenshot below).




* Important Tip - The measurement for the size of a file and the speed that an interface passes traffic is not the same. Despite looking and sounding similar each measurement is calculated in a different way. This is a common error. For example, network speeds are notated in bits per second. Files are normally referred to in bytes. There are 8 bits in a byte, then you need to factor in that file notation grows by 1024 not simple 1000s.

Notation examples:
Network Speed
1 Kbps = 1,000 bits per second
1 Mbps = 1,000,000 bits per second
1 Gbps = 1,000,000,000 bits per second

Data file size
1 KB = 1,024 Bytes
1 MB = 1,024 KB
1 GB = 1,024 MB

Now that the amount of traffic is known you can compare this information to the ifSpeed metric to determine the percentage of the pipe that is full. You can figure out the math or let the tools do it for you (dopplerVUE screenshot below).




To answer the final question about if the traffic is causing a slowdown on the network, check the ping response time to the device and devices beyond (if router or switch).

There are many other items we can look at regarding traffic that indicate problems in the network. You can look for packet loss, discards and errors that are occurring (dopplerVUE screenshot below). We’ll explain why these issues occur and how to correct them in a different posting, but you should consider checking these metrics as well.

Thursday, July 2, 2009

Cisco Live…Great Training, but Not Much New Technology

Cisco Live 2009 was this week, if you didn’t attend here are a few observations:

There were MASSIVE amounts of training sessions, high quality keynotes, but not much new in terms of technology at this event. It’s not just a show, it’s a training session. While many of the educational items cost money, not all of them do. Even some of the vendor presentations can offer you insights into the types of thinking you should be considering (see OPNET and their explanation of the types of problems that can cause application delay). Much of the free educational content can be had by getting an online account at the CiscoLive Virtual center.

Cisco recently launched a wiki providing simple content and clear explanations. It still needs to grow some but, the content is a good start for key Cisco products and technologies.

Counted only 3 green/power saving programs – Really in S.F? I expected more considering all the talk about green computing in the industry.

The B-52s and DEVO were the bands for Wednesday night – WOW!!!

If you came looking for new technologies, you would be disappointed. I didn’t see anything that wasn’t at Interop or some of the other shows earlier in the year. I think it’s more about making do with what we have and squeezing out all the value possible.

The next post will be a continuation on how to monitor bandwidth – helping you pack more through the pipes you already own.