[Mckernel-users 14] Re: Some questions about large scale tests with mckernel
Balazs Gerofi
bgerofi at riken.jp
Sun Apr 23 09:00:34 JST 2017
Hello Jeremie,
I would suggest that you use MVAPICH or Intel MPI, but let's focus on the
boot bug first. I've added a change to make sure we can see the full kmsg
when boot fails.
Could you please pull the IHK repository, retry the boot script, and send
me the new dmesg log?
Thanks,
Balazs
On Thu, Apr 20, 2017 at 6:37 AM, FINIEL, JEREMIE <jeremie.finiel at atos.net>
wrote:
> Hello Balazs,
>
> Thank you for your quick reply.
>
>
>
> I catch a set of hwloc and mpi_hello_world executions in the ‘execution’
> attached file.
> It contain my console output with hwloc-ls execution in both, Linux and
> McKernel, and execution of mpi_hello_world with and without pinning.
>
> Has you can see (Line 161), in case where pinning is not explicit, the
> command print an error and then hangs.
>
> And by explicitly disabling pinning (Line 217), the error does not appear,
> but the command hang too.
>
>
>
> Here is my environment:
>
> 1 node
>
> CPU Xeon E5-2680, 2 sockets, 12 cores per socket, 1 thread per core
>
> Uname –r : 3.10.0-327.el7.x86_64
>
> OpenMPI version: 2.0.0
>
>
>
> Best regards,
>
>
>
> Jérémie Finiel
>
>
>
>
>
>
>
> *From:* Balazs Gerofi [mailto:bgerofi at riken.jp]
> *Sent:* Thursday, April 20, 2017 6:11 AM
> *To:* FINIEL, JEREMIE; mckernel-users at pccluster.org
> *Cc:* LAFERRIERE, Christophe; WELTERLEN, BENOIT; Olivier Gruber
> *Subject:* Re: Some questions about large scale tests with mckernel
>
>
>
> Hello Jeremie,
>
>
>
> I have added the mckernel-users mailing list to CC so that we all see your
> messages, please keep it in the loop!
>
>
>
> On Wed, Apr 19, 2017 at 8:07 AM, FINIEL, JEREMIE <jeremie.finiel at atos.net>
> wrote:
>
> We had difficulties to launch McKernel on Nehalem processor (E5540 in this
> case). When starting mcreboot.sh, we just got this error two times: “error:
> booting”. Please find the dmesg log attached.
>
>
>
> I haven't had a chance so far to try McKernel on Nehalem, but I would
> expect it's more related to your Linux kernel version, what is the version
> you are running? Also, I looked at the dmesg but unfortunately the root
> cause of the failure is not visible, I will need to adjust how the kmsg is
> printed to make that part visible. Let me try to get a patch done this
> week, I'll contact you again.
>
>
>
> We tried to launch MPI program with openMPI, but we had an error about
> hwloc pinning functions (hwloc_set_cpubind returned "Error" for bitmap "0").
>
> As written in your paper, we tried with mvapich2 which work perfectly.
>
> Could you let me know what specific development you had to do to have
> mvapich working with McKernel? I'm trying to see how hard it would be to
> have OpenMPI working too.
>
>
>
> Are you trying to run OpenMPI on another host? Could you let me know the
> platform and the configuration how you boot McKernel?
>
> Hwloc generally is supported, what do you get for mcexec hwloc-ls? Does
> that work? Another thing you could try is to disable binding in OpenMPI,
> just as a test..
>
>
>
> Now we would like to do some tests at a larger scale. So we develop a
> script to launch McKernel on each machines, but we got difficulties about
> right access. Mcreboot.sh must be launch with privilege access, but mcexec
> can be executed by a user only if /dev/mcos0 is accessible by this user.
> For the moment, in order to avoid using root for every execution, I can
> change owner of /dev/mcos0 as a workaround.
>
> Furthermore, when executing “./mcexec mpirun …” we notice that execution
> on the other machine is done in the Linux side and not in McKernel, but
> “mpirun ./mcexec …” seems to do the job. I would be interested to know how
> you launch your tests at large scale. If, by any chances, you have a
> template script, it would be helpful.
>
>
>
> Ahh, yes, we are aware of this. One way to get around it is to set your
> umask to 0002 and use sudo to run the script (I mean not directly by root).
>
>
>
> Best,
>
> Balazs
>
>
>
> Thank you in advance.
>
>
>
> Best regards,
>
> Jérémie Finiel
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pccluster.org/pipermail/mckernel-users/attachments/20170422/194ebce1/attachment.html>
More information about the Mckernel-users
mailing list