[SCore-users-jp] Re: [SCore-users] some timeout problem?

Shinji Sumimoto s-sumi @ flab.fujitsu.co.jp
2005年 3月 25日 (金) 11:06:59 JST


Hi.

From: David Werner <david.werner @ iws.uni-stuttgart.de>
Subject: [SCore-users] some timeout problem?
Date: Thu, 24 Mar 2005 11:04:21 +0100
Message-ID: <20050324100421.GA3244 @ nalle.bauingenieure.uni-stuttgart.de>

david.werner> Dear List, 
david.werner> 
david.werner> We use score with the network-trunking facility over two ethernet
david.werner> network-cards (100 Mbit/sec).  All network cards in the score have its
david.werner> own exclusive interrupt and are also exclusively used
david.werner> by the pm_ethernet driver.
david.werner> Every node-computer has a third network card which is only used
david.werner> for the TCP/IP-traffic.  We run SCore-5.6.0 on a 2.4.21 Linux kernel. 
david.werner> What I occasionally observe is the following kernel message occuring
david.werner> randomly once in a few weeks on some singly nodes of the cluster: 
david.werner> 
david.werner> eth2: TX underrun, threshold adjusted.
david.werner> or
david.werner> eth1: TX underrun, threshold adjusted.

Maybe you are using eepro100 NIC, isn't it?
This is not an error, especially, the error occurs once per week.

Here is a description about the problem:

http://www.ussg.iu.edu/hypermail/linux/kernel/0401.1/0651.html
==========================================================================
This isn't really an error, it's an indicator that the pci-bus doesn't
really keep up, then the NIC has to increase the threshold (it tries to
start sending the packet out before it's fully transferred from main
memory to the NIC, it hopes the rest of the packet will have been
transferred in time, this message indicates that it wasn't so the NIC
had to increase the threshold of how much of the packet has to have been
transferred before it starts sending it out)

This happens with the eepro100 driver as well but it doesn't tell you
about it, it just increases the threshold and goes on.
The e100 driver tells you about it _and_ it actually decreases the
threshold if there hasn't been any underruns for a while, and when it is
decreased, the threshold gets too small and you get an underrun
again....
==========================================================================

david.werner> As we use eth1 and eth2 for pm_ethernet.
david.werner> Today I observed that it's occurence correlated with a crash of scored 
david.werner> run by sc_watch.
david.werner> Is there something I can do to improve the stability of our
david.werner> score installation?

How many nodes are you using scored multi-user mode? And are there any
problems about your cluster hardware? 

If there is no problem on your cluster hardware, 
please try to increase a timeout of sc_watch.

Kameyama-san, could you explain the method to increase a timeout of
sc_watch?

Shinji.

david.werner> Greetings,
david.werner> 	David
david.werner> _______________________________________________
david.werner> SCore-users mailing list
david.werner> SCore-users @ pccluster.org
david.werner> http://www.pccluster.org/mailman/listinfo/score-users
david.werner> 
------
Shinji Sumimoto, Fujitsu Labs
_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users



SCore-users-jp メーリングリストの案内