客户机和服务器之间的通信不稳定

Communication between client and server is erratic

本文关键字:通信 不稳定 之间 服务器 客户机      更新时间:2023-10-16

我修改了geekinthecorner示例,使其能够连续发送数据。我使用的是g++4.9.2.

我试着从这里卸载官方最新的OFED http://downloads.openfabrics.org/OFED/

OFED Distribution Software Installation Menu
   1) View OFED Installation Guide
   2) Install OFED Software
   3) Show Installed Software
   4) Configure IPoIB
   5) Uninstall OFED Software
   Q) Exit
Select Option [1-5]:5
Uninstalling the previous version of OFED
Running rpm -e --allmatches libibverbs libibverbs-devel libibverbs-utils libmthca libmlx4 libcxgb3 libnes libipathverbs libibcm libibumad libibumad-devel libibmad ibacm librdmacm librdmacm-utils librdmacm-devel opensm opensm-libs dapl perftest mstflint ibutils infiniband-diags qperf infinipath-psm opensm opensm-libs libipathverbs dapl libibcm libibmad libibumad libibumad-devel libibverbs libibverbs-devel libibverbs-utils libipathverbs libmthca libmlx4 librdmacm librdmacm-devel librdmacm-utils ibacm ibutils ibutils-libs libnes infinipath-psm
Failed to uninstall the previous installation
See /tmp/OFED.22320.logs/ofed_uninstall.log
[idf@node1 OFED-1.5.4-20110726-0732]$ 
[idf@node1 OFED-1.5.4-20110726-0732]$ 

如果我只是尝试安装它,我得到这个:

OFED Distribution Software Installation Menu
   1) Basic (OFED modules and basic user level libraries)
   2) HPC (OFED modules and libraries, MPI and diagnostic tools)
   3) All packages (all of Basic, HPC)
   4) Customize
   Q) Exit
Select Option [1-4]:3
Please choose an implementation of MVAPICH2:
1) OFA (IB and iWARP)
2) uDAPL
Implementation [1]: 1
Enable ROMIO support [Y/n]: 
Enable shared library support [Y/n]: 
Enable Checkpoint-Restart support [y/N]: 
Kernel 3.10.0-229.7.2.el7.x86_64 is not supported.
For the list of Supported Platforms and Operating Systems see
/mnt/gluster/Downloads/OFED-1.5.4-20110726-0732/docs/OFED_release_notes.txt
[idf@node1 OFED-1.5.4-20110726-0732]$ 
[idf@node2 Release]$ lspci | grep -i mel
02:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
[idf@node2 Release]$ 

[idf@node1 Release]$ ibv_devinfo
hca_id: mlx4_0
    transport:          InfiniBand (0)
    fw_ver:             2.7.200
    node_guid:          0025:90ff:ff1a:081c
    sys_image_guid:         0025:90ff:ff1a:081f
    vendor_id:          0x02c9
    vendor_part_id:         26428
    hw_ver:             0xB0
    board_id:           SM_2092000001000
    phys_port_cnt:          1
        port:   1
            state:          PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:     4096 (5)
            sm_lid:         1
            port_lid:       2
            port_lmc:       0x00
            link_layer:     InfiniBand
[idf@node1 Release]$ ifconfig -a
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
        inet 192.168.0.1  netmask 255.255.255.0  broadcast 192.168.0.255
        inet6 fe80::225:90ff:ff1a:71  prefixlen 64  scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
        RX packets 5  bytes 280 (280.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 27 overruns 0  carrier 0  collisions 0

下面是客户端和服务器。当我运行这个程序时,客户端会发送消息,但是它发送的消息数量是不稳定的,错误消息经常是

客户:

#include <iostream>
#include <thread>
#include <netdb.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <rdma/rdma_cma.h>
#define TEST_NZ(x) do { if ( (x)) die("error: " #x " failed (returned non-zero)." ); } while (0)
#define TEST_Z(x)  do { if (!(x)) die("error: " #x " failed (returned zero/null)."); } while (0)
const int BUFFER_SIZE = 2048;
const int TIMEOUT_IN_MS = 500; /* ms */
struct context
{
    struct ibv_context *ctx;
    struct ibv_pd *pd;
    struct ibv_cq *cq;
    struct ibv_comp_channel *comp_channel;
    pthread_t cq_poller_thread;
};
struct connection
{
    struct rdma_cm_id *id;
    struct ibv_qp *qp;
    struct ibv_mr *recv_mr;
    struct ibv_mr *send_mr;
    char *recv_region;
    char *send_region;
    int num_completions;
};
static pthread_t msgThread;
static void die(const char *reason);
static void build_context(struct ibv_context *verbs);
static void build_qp_attr(struct ibv_qp_init_attr *qp_attr);
static void * poll_cq(void *);
static void post_receives(struct connection *conn);
static void register_memory(struct connection *conn);
static int on_addr_resolved(struct rdma_cm_id *id);
static void on_completion(struct ibv_wc *wc);
static int on_connection(void *context);
static int on_disconnect(struct rdma_cm_id *id);
static int on_event(struct rdma_cm_event *event);
static int on_route_resolved(struct rdma_cm_id *id);
static struct context *s_ctx = NULL;
#include <mutex>              // std::mutex, std::unique_lock
#include <condition_variable> // std::condition_variable
std::mutex mtx;
std::condition_variable cv;
bool ok_to_send_next_message = 1;
bool message_available()
{
    return 0 != ok_to_send_next_message;
}
int main(int argc, char **argv)
{
    struct addrinfo *addr;
    struct rdma_cm_event *event = NULL;
    struct rdma_cm_id *conn= NULL;
    struct rdma_event_channel *ec = NULL;
    if (argc != 3)
        die("usage: client <server-address> <server-port>");
    TEST_NZ(getaddrinfo(argv[1], argv[2], NULL, &addr));
    TEST_Z(ec = rdma_create_event_channel());
    TEST_NZ(rdma_create_id(ec, &conn, NULL, RDMA_PS_TCP));
    TEST_NZ(rdma_resolve_addr(conn, NULL, addr->ai_addr, TIMEOUT_IN_MS));
    freeaddrinfo(addr);
    while (0 == rdma_get_cm_event(ec, &event))
        //while (rdma_get_cm_event(ec, &event))
    {
        std::cout << "rdma_get_cm_eventn";
        struct rdma_cm_event event_copy;
        memcpy(&event_copy, event, sizeof(*event));
        rdma_ack_cm_event(event);
        if (on_event(&event_copy))
            break;
    }
    rdma_destroy_event_channel(ec);
    return 0;
}
void die(const char *reason)
{
    fprintf(stderr, "%sn", reason);
    exit(EXIT_FAILURE);
}
void build_context(struct ibv_context *verbs)
{
    if (s_ctx)
    {
        if (s_ctx->ctx != verbs)
            die("cannot handle events in more than one context.");
        return;
    }
    s_ctx = (struct context *)malloc(sizeof(struct context));
    s_ctx->ctx = verbs;
    TEST_Z(s_ctx->pd = ibv_alloc_pd(s_ctx->ctx));
    TEST_Z(s_ctx->comp_channel = ibv_create_comp_channel(s_ctx->ctx));
    TEST_Z(s_ctx->cq = ibv_create_cq(s_ctx->ctx, 100, NULL, s_ctx->comp_channel, 0)); /* cqe=10 is arbitrary */
    TEST_NZ(ibv_req_notify_cq(s_ctx->cq, 0));
    TEST_NZ(pthread_create(&s_ctx->cq_poller_thread, NULL, poll_cq, NULL));
}
void *SendMessages(void *context)
{
    static int loopcount = 0;
    while(1)
    {
        std::unique_lock<std::mutex> lck(mtx);
        cv.wait(lck, message_available);
        //std::this_thread::sleep_for(std::chrono::microseconds(50));
        ok_to_send_next_message = 0;
        struct connection *conn = (struct connection *)context;
        struct ibv_send_wr wr, *bad_wr = NULL;
        struct ibv_sge sge;
        std::cout << "looping send..." << loopcount << 'n' << std::flush;
        memset(&wr, 0, sizeof(wr));
        wr.wr_id = (uintptr_t)conn;
        wr.opcode = IBV_WR_SEND;
        wr.sg_list = &sge;
        wr.num_sge = 1;
        wr.send_flags = IBV_SEND_SIGNALED;
        sge.addr = (uintptr_t)conn->send_region;
        sge.length = BUFFER_SIZE;
        sge.lkey = conn->send_mr->lkey;
        snprintf(conn->send_region, BUFFER_SIZE, "message from active/client side with count %d", loopcount++);
        TEST_NZ(ibv_post_send(conn->qp, &wr, &bad_wr));
    }
}
void build_qp_attr(struct ibv_qp_init_attr *qp_attr)
{
    std::cout << "build_qp_attrn";
    memset(qp_attr, 0, sizeof(*qp_attr));
    qp_attr->send_cq = s_ctx->cq;
    qp_attr->recv_cq = s_ctx->cq;
    qp_attr->qp_type = IBV_QPT_RC;
    qp_attr->cap.max_send_wr = 100;
    qp_attr->cap.max_recv_wr = 100;
    qp_attr->cap.max_send_sge = 1;
    qp_attr->cap.max_recv_sge = 1;
}
void * poll_cq(void *ctx)
{
    struct ibv_cq *cq;
    struct ibv_wc wc;
    while (1)
    {
        TEST_NZ(ibv_get_cq_event(s_ctx->comp_channel, &cq, &ctx));
        ibv_ack_cq_events(cq, 1);
        TEST_NZ(ibv_req_notify_cq(cq, 0));
        int ne;
        struct ibv_wc wc;
        do
        {
            std::cout << "pollingn";
            ne = ibv_poll_cq(cq, 1, &wc);
        }
        while(ne == 0);
        on_completion(&wc);
        //if (wc.opcode == IBV_WC_SEND)
        if (wc.status == IBV_WC_SUCCESS)
        {
            {
                ok_to_send_next_message = 1;
                //while (message_available()) std::this_thread::yield();
                //std::cout << "past yieldn";
                std::unique_lock<std::mutex> lck(mtx);
                cv.notify_one();
            }
        }
    }
    return NULL;
}
void post_receives(struct connection *conn)
{
    std::cout << "post_receivesn";
    struct ibv_recv_wr wr, *bad_wr = NULL;
    struct ibv_sge sge;
    wr.wr_id = (uintptr_t)conn;
    wr.next = NULL;
    wr.sg_list = &sge;
    wr.num_sge = 1;
    sge.addr = (uintptr_t)conn->recv_region;
    sge.length = BUFFER_SIZE;
    sge.lkey = conn->recv_mr->lkey;
    TEST_NZ(ibv_post_recv(conn->qp, &wr, &bad_wr));
}
void register_memory(struct connection *conn)
{
    std::cout << "register_memoryn";
    conn->send_region = (char *)malloc(BUFFER_SIZE);
    conn->recv_region = (char *)malloc(BUFFER_SIZE);
    TEST_Z(conn->send_mr = ibv_reg_mr(
                               s_ctx->pd,
                               conn->send_region,
                               BUFFER_SIZE,
                               IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE));
    TEST_Z(conn->recv_mr = ibv_reg_mr(
                               s_ctx->pd,
                               conn->recv_region,
                               BUFFER_SIZE,
                               IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE));
}
int on_addr_resolved(struct rdma_cm_id *id)
{
    std::cout << "on_addr_resolvedn";
    struct ibv_qp_init_attr qp_attr;
    struct connection *conn;
    build_context(id->verbs);
    build_qp_attr(&qp_attr);
    TEST_NZ(rdma_create_qp(id, s_ctx->pd, &qp_attr));
    id->context = conn = (struct connection *)malloc(sizeof(struct connection));
    conn->id = id;
    conn->qp = id->qp;
    conn->num_completions = 0;
    register_memory(conn);
    post_receives(conn);
    TEST_NZ(rdma_resolve_route(id, TIMEOUT_IN_MS));
    return 0;
}
void on_completion(struct ibv_wc *wc)
{
    std::cout << "on_completionn";
    struct connection *conn = (struct connection *)(uintptr_t)wc->wr_id;
    if (wc->status != IBV_WC_SUCCESS)
    {
        //die("ton_completion: status is not IBV_WC_SUCCESS.");
        printf("ton_completion: status is not IBV_WC_SUCCESS.");
        printf("t it is %d ", wc->status);
    }
    printf("n");
    if (wc->opcode & IBV_WC_RECV)
        printf("treceived message: %sn", conn->recv_region);
    else if (wc->opcode == IBV_WC_SEND)
        printf("tsend completed successfully.n");
    else
        die("ton_completion: completion isn't a send or a receive.");
    if (5 == ++conn->num_completions)
        rdma_disconnect(conn->id);
}
int on_connection(void *context)
{
    std::cout << "on_connectionn";
    TEST_NZ(pthread_create(&msgThread, NULL, SendMessages, context));
    return 0;
}
int on_disconnect(struct rdma_cm_id *id)
{
    struct connection *conn = (struct connection *)id->context;
    printf("disconnected.n");
    rdma_destroy_qp(id);
    ibv_dereg_mr(conn->send_mr);
    ibv_dereg_mr(conn->recv_mr);
    free(conn->send_region);
    free(conn->recv_region);
    free(conn);
    rdma_destroy_id(id);
    return 1; /* exit event loop */
}
int on_route_resolved(struct rdma_cm_id *id)
{
    struct rdma_conn_param cm_params;
    printf("route resolved.n");
    memset(&cm_params, 0, sizeof(cm_params));
    TEST_NZ(rdma_connect(id, &cm_params));
    return 0;
}
int on_event(struct rdma_cm_event *event)
{
    int r = 0;
    if (event->event == RDMA_CM_EVENT_ADDR_RESOLVED)
        r = on_addr_resolved(event->id);
    else if (event->event == RDMA_CM_EVENT_ROUTE_RESOLVED)
        r = on_route_resolved(event->id);
    else if (event->event == RDMA_CM_EVENT_ESTABLISHED)
        r = on_connection(event->id->context);
    else if (event->event == RDMA_CM_EVENT_DISCONNECTED)
        r = on_disconnect(event->id);
    else
        die("on_event: unknown event.");
    return r;
}

服务器:

#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <inttypes.h>
#include <rdma/rdma_cma.h>
#define TEST_NZ(x) do { if ( (x)) die("error: " #x " failed (returned non-zero)." ); } while (0)
#define TEST_Z(x)  do { if (!(x)) die("error: " #x " failed (returned zero/null)."); } while (0)
const int BUFFER_SIZE = 2048;
struct context
{
    struct ibv_context *ctx;
    struct ibv_pd *pd;
    struct ibv_cq *cq;
    struct ibv_comp_channel *comp_channel;
    pthread_t cq_poller_thread;
};
struct connection
{
    struct ibv_qp *qp;
    struct ibv_mr *recv_mr;
    struct ibv_mr *send_mr;
    char *recv_region;
    char *send_region;
};
static void die(const char *reason);
static void build_context(struct ibv_context *verbs);
static void build_qp_attr(struct ibv_qp_init_attr *qp_attr);
static void * poll_cq(void *);
static void post_receives(struct connection *conn);
static void register_memory(struct connection *conn);
static void on_completion(struct ibv_wc *wc);
static int on_connect_request(struct rdma_cm_id *id);
static int on_connection(void *context);
static int on_disconnect(struct rdma_cm_id *id);
static int on_event(struct rdma_cm_event *event);
static struct context *s_ctx = NULL;
int main(int argc, char **argv)
{
    struct sockaddr_in6 addr;
    struct rdma_cm_event *event = NULL;
    struct rdma_cm_id *listener = NULL;
    struct rdma_event_channel *ec = NULL;
    uint16_t port = 0;
    memset(&addr, 0, sizeof(addr));
    addr.sin6_family = AF_INET6;
    TEST_Z(ec = rdma_create_event_channel());
    TEST_NZ(rdma_create_id(ec, &listener, NULL, RDMA_PS_TCP));
    TEST_NZ(rdma_bind_addr(listener, (struct sockaddr *)&addr));
    TEST_NZ(rdma_listen(listener, 100)); /* backlog=10 is arbitrary */
    //printf("[ %"PRIu32" ]n", *addr.sin6_addr.s6_addr32);
    port = ntohs(rdma_get_src_port(listener));
    printf("listening on port %d.n", port);
    while (rdma_get_cm_event(ec, &event) == 0)
    {
        struct rdma_cm_event event_copy;
        memcpy(&event_copy, event, sizeof(*event));
        rdma_ack_cm_event(event);
        if (on_event(&event_copy))
            break;
    }
    rdma_destroy_id(listener);
    rdma_destroy_event_channel(ec);
    return 0;
}
void die(const char *reason)
{
    fprintf(stderr, "%sn", reason);
    exit(EXIT_FAILURE);
}
void build_context(struct ibv_context *verbs)
{
    if (s_ctx)
    {
        if (s_ctx->ctx != verbs)
            die("cannot handle events in more than one context.");
        return;
    }
    s_ctx = (struct context *)malloc(sizeof(struct context));
    s_ctx->ctx = verbs;
    TEST_Z(s_ctx->pd = ibv_alloc_pd(s_ctx->ctx));
    TEST_Z(s_ctx->comp_channel = ibv_create_comp_channel(s_ctx->ctx));
    TEST_Z(s_ctx->cq = ibv_create_cq(s_ctx->ctx, 100, NULL, s_ctx->comp_channel, 0)); /* cqe=10 is arbitrary */
    TEST_NZ(ibv_req_notify_cq(s_ctx->cq, 0));
    TEST_NZ(pthread_create(&s_ctx->cq_poller_thread, NULL, poll_cq, NULL));
}
void build_qp_attr(struct ibv_qp_init_attr *qp_attr)
{
    memset(qp_attr, 0, sizeof(*qp_attr));
    qp_attr->send_cq = s_ctx->cq;
    qp_attr->recv_cq = s_ctx->cq;
    qp_attr->qp_type = IBV_QPT_RC;
    qp_attr->cap.max_send_wr = 100;
    qp_attr->cap.max_recv_wr = 100;
    qp_attr->cap.max_send_sge = 1;
    qp_attr->cap.max_recv_sge = 1;
}
void * poll_cq(void *ctx)
{
    struct ibv_cq *cq;
    struct ibv_wc wc;
    while (1)
    {
        TEST_NZ(ibv_get_cq_event(s_ctx->comp_channel, &cq, &ctx));
        ibv_ack_cq_events(cq, 1);
        TEST_NZ(ibv_req_notify_cq(cq, 0));
        while (ibv_poll_cq(cq, 1, &wc))
        {
            std::cout << "pollingn";
            on_completion(&wc);
        }
    }
    return NULL;
}
void post_receives(struct connection *conn)
{
    std::cout << "post_receivesn";
    struct ibv_recv_wr wr, *bad_wr = NULL;
    struct ibv_sge sge;
    wr.wr_id = (uintptr_t)conn;
    wr.next = NULL;
    wr.sg_list = &sge;
    wr.num_sge = 1;
    sge.addr = (uintptr_t)conn->recv_region;
    sge.length = BUFFER_SIZE;
    sge.lkey = conn->recv_mr->lkey;
    TEST_NZ(ibv_post_recv(conn->qp, &wr, &bad_wr));
}
void register_memory(struct connection *conn)
{
    conn->send_region = (char *)malloc(BUFFER_SIZE);
    conn->recv_region = (char *)malloc(BUFFER_SIZE);
    TEST_Z(conn->send_mr = ibv_reg_mr(
                               s_ctx->pd,
                               conn->send_region,
                               BUFFER_SIZE,
                               IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE));
    TEST_Z(conn->recv_mr = ibv_reg_mr(
                               s_ctx->pd,
                               conn->recv_region,
                               BUFFER_SIZE,
                               IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE));
}
void on_completion(struct ibv_wc *wc)
{
    if (wc->status != IBV_WC_SUCCESS)
        die("on_completion: status is not IBV_WC_SUCCESS.");
    if (wc->opcode & IBV_WC_RECV)
    {
        struct connection *conn = (struct connection *)(uintptr_t)wc->wr_id;
        post_receives(conn);
        printf("received message: %sn", conn->recv_region);
    }
    else if (wc->opcode == IBV_WC_SEND)
    {
        printf("send completed successfully.n");
    }
}
int on_connect_request(struct rdma_cm_id *id)
{
    struct ibv_qp_init_attr qp_attr;
    struct rdma_conn_param cm_params;
    struct connection *conn;
    printf("received connection request.n");
    build_context(id->verbs);
    build_qp_attr(&qp_attr);
    TEST_NZ(rdma_create_qp(id, s_ctx->pd, &qp_attr));
    id->context = conn = (struct connection *)malloc(sizeof(struct connection));
    conn->qp = id->qp;
    register_memory(conn);
    post_receives(conn);
    memset(&cm_params, 0, sizeof(cm_params));
    TEST_NZ(rdma_accept(id, &cm_params));
    return 0;
}
int on_connection(void *context)
{
    struct connection *conn = (struct connection *)context;
    struct ibv_send_wr wr, *bad_wr = NULL;
    struct ibv_sge sge;
    snprintf(conn->send_region, BUFFER_SIZE, "message from passive/server side with pid %d", getpid());
    printf("connected. posting send...n");
    memset(&wr, 0, sizeof(wr));
    wr.opcode = IBV_WR_SEND;
    wr.sg_list = &sge;
    wr.num_sge = 1;
    wr.send_flags = IBV_SEND_SIGNALED;
    sge.addr = (uintptr_t)conn->send_region;
    sge.length = BUFFER_SIZE;
    sge.lkey = conn->send_mr->lkey;
    TEST_NZ(ibv_post_send(conn->qp, &wr, &bad_wr));
    return 0;
}
int on_disconnect(struct rdma_cm_id *id)
{
    struct connection *conn = (struct connection *)id->context;
    printf("peer disconnected.n");
    rdma_destroy_qp(id);
    ibv_dereg_mr(conn->send_mr);
    ibv_dereg_mr(conn->recv_mr);
    free(conn->send_region);
    free(conn->recv_region);
    free(conn);
    rdma_destroy_id(id);
    return 0;
}
int on_event(struct rdma_cm_event *event)
{
    std::cout << "on_eventn";
    int r = 0;
    if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST)
        r = on_connect_request(event->id);
    else if (event->event == RDMA_CM_EVENT_ESTABLISHED)
        r = on_connection(event->id->context);
    else if (event->event == RDMA_CM_EVENT_DISCONNECTED)
        r = on_disconnect(event->id);
    else
        die("on_event: unknown event.");
    return r;
}

这里有几个运行。完全随机发送的消息数:

[idf@node1 Release]$ ./TGKITCClient 192.168.0.1 47819
rdma_get_cm_event
on_addr_resolved
build_qp_attr
register_memory
post_receives
rdma_get_cm_event
route resolved.
rdma_get_cm_event
on_connection
looping send...0
polling
on_completion
    received message: message from passive/server side with pid 4188
polling
on_completion
    send completed successfully.
looping send...1
polling
on_completion
    send completed successfully.
^C
[idf@node1 Release]$ 

[idf@node1 Release]$ ./TGKITCClient 192.168.0.1 55148
rdma_get_cm_event
on_addr_resolved
build_qp_attr
register_memory
post_receives
rdma_get_cm_event
route resolved.
rdma_get_cm_event
on_connection
looping send...0
polling
on_completion
    received message: message from passive/server side with pid 4279
polling
on_completion
    send completed successfully.
looping send...1
polling
on_completion
    send completed successfully.
looping send...2
polling
on_completion
    send completed successfully.
looping send...3
polling
on_completion
    send completed successfully.
looping send...4
polling
on_completion
    send completed successfully.
looping send...5
polling
on_completion
    send completed successfully.
looping send...6
polling
on_completion
    send completed successfully.
looping send...7
polling
on_completion
    send completed successfully.
looping send...8
rdma_get_cm_event
disconnected.
polling
on_completion
    send completed successfully.
    on_completion: status is not IBV_WC_SUCCESS.     it is 5 [idf@node1 Release]$ 

服务器端:

on_event
peer disconnected.
on_event
received connection request.
post_receives
on_event
connected. posting send...
polling
send completed successfully.
polling
post_receives
received message: message from active/client side with count 0
polling
post_receives
received message: message from active/client side with count 1
polling
post_receives
received message: message from active/client side with count 2
polling
post_receives
received message: message from active/client side with count 3
polling
post_receives
received message: message from active/client side with count 4
polling
post_receives
received message: message from active/client side with count 5
polling
post_receives
received message: message from active/client side with count 6
polling
post_receives
received message: message from active/client side with count 7
on_event
peer disconnected.

确保卡上安装了最新的驱动程序和固件。除此之外,当尝试运行IB时,使用大多数操作系统发行版中包含的RDMA包是一个危险的游戏。

强烈建议对于这样的应用程序,应该使用Open Fabrics Enterprise Distribution来提供openib、opensm和各种其他有用的infiniband相关包,用于分析、诊断和网络调优。官方的OFED包可以在OpenFabrics网站上找到。

根据问题,看起来IPoIB正在使用,但没有提到具体的配置。IPoIB不一定是利用IB卡中可用硬件资源的最佳方式。

除了这些注意事项外,还要确保正确设置和配置子网管理器。有些交换机具有内置的子网管理器,可以通过管理接口访问和配置,在其他情况下,在您正在使用的一个节点上运行和配置子网管理器可能更有意义。OpenSM是一个常见的子网管理器,包含在OFED发行版中,有许多在线指南可用于根据所设置的网络类型设置和配置子网管理器。

OFED还包括各种IB测试和分析工具。ibdiagnet是调试IB网络问题的有用工具。网上有许多指南,展示了使用该工具的不同方法以及OFED中包含的其他工具。

根据所使用的IB交换机的类型,可能还有一些网络管理和诊断工具,这些工具将允许对网络进行进一步分析。IB硬件的配置和管理硬件的底层软件有时比实际运行的代码对整体性能更为关键。但是,如果进行了重大的软件或硬件配置更改,那么重新编译并链接到正确版本的OFED中的相关库可能是明智的。