Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in iostat-await calculation in Metricbeat #30480

Closed
thekofimensah opened this issue Feb 18, 2022 · 13 comments
Closed

Bug in iostat-await calculation in Metricbeat #30480

thekofimensah opened this issue Feb 18, 2022 · 13 comments
Assignees
Labels
bug Metricbeat Metricbeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@thekofimensah
Copy link

thekofimensah commented Feb 18, 2022

This error persists in version of metrcibeat 7.9-7.16 and most likely before as well.

I'm pulling in iostat-await logs with metricbeat and noticing that certain machines are repeatably returning massive numbers (like 8081489971116.08ms) which I'm guessing is most likely a rollover issue in the kernal probably connected to an integer overflow issue. Couldn't find anyone else reporting this so I thought it would be worth pointing out that at some place in the code, there is an issue subtracting two uint64s that results in a reasonable value.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 18, 2022
@thekofimensah
Copy link
Author

Screen Shot 2022-02-18 at 3 48 43 PM
Here's a sample

@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Feb 23, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 23, 2022
@belimawr
Copy link
Contributor

@thekofimensah do you face the same problem on 7.17.0 or 8.0.0?

@jlind23 jlind23 added the Metricbeat Metricbeat label Feb 23, 2022
@thekofimensah
Copy link
Author

Hey @belimawr, I will eventually get metricbeat updated on that machine, but our deployment is a bit tricky to be able to do on the fly upgrades. Was there a fix for this in the newer versions? If not, I doubt it is fixed because it has been a problem for a long time. I just looked at my records, and since 7.9.x it's been around.

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Mar 1, 2022

@thekofimensah can we see:

  • A complete ES document that's returning the error
  • the output of /proc/diskstats
  • uname -a on the host

Also, the iostat metrics were removed from system and ported the linux module in 7.10, I assume this is a version prior to that, based on your screenshot?

@thekofimensah
Copy link
Author

We're currently using metricbeat 7.16 actually for this event. 7.16 still has iostat in the system metrics because we don't have linux module active.

Here are the two pieces you asked for:

node:~$ uname -a
Linux node.nodes.net 5.4.0-100-generic #113-Ubuntu SMP Thu Feb 3 18:43:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
node:~$ cat  /proc/diskstats
   7       0 loop0 1670 0 6974 270 0 0 0 0 0 240 0 0 0 0 0
   7       1 loop1 79 0 5250 7 0 0 0 0 0 72 0 0 0 0 0
   7       2 loop2 2915 0 7014 253 0 0 0 0 0 268 0 0 0 0 0
   7       3 loop3 12445 0 28524 820 0 0 0 0 0 1004 0 0 0 0 0
   7       4 loop4 76 0 2186 1 0 0 0 0 0 32 0 0 0 0 0
   7       5 loop5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   7       6 loop6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   7       7 loop7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   0       0 nvme2c2n1 212 0 11832 16 0 0 0 0 0 76 0 0 0 0 0
 259       1 nvme0n1 4700929 42585 2373217865 7745755 154543530 1871494743 60129413264 1170508978 0 148907668 935043132 66469 0 1313221176 140219
 259       6 nvme0n1p1 2108 0 21017 378 83 12 10723 6 0 248 0 3 0 1035784 7
 259       7 nvme0n1p2 43173 4154 4201092 9841 1503317 1495036 43444381 14463785 0 8846496 13870780 3226 0 32902472 6591
 259       8 nvme0n1p3 4655399 38431 2368982836 7735504 153040130 1869999695 60085958160 1156045185 24 147061664 921172352 63240 0 1279282920 133620
 259       2 nvme1n1 4615019 2212053 2365416717 8711006 152165376 1868124273 59928951613 3948772475 0 194368280 3701179116 67057 0 1313575336 139944
 259       3 nvme1n1p1 427 0 26729 36 45 292 13304 7 0 220 0 0 0 0 0
 259       4 nvme1n1p2 5805 3507 794426 1857 1502056 1496297 43444381 25749823 0 12888904 24938032 3226 0 32902472 8242
 259       5 nvme1n1p3 4608539 2208546 2364582706 8709082 150663275 1866627684 59885493928 3923022645 85 192870648 3676241084 63831 0 1280672864 131701
   0       0 nvme3c3n1 212 0 11832 15 0 0 0 0 0 80 0 0 0 0 0
 259      10 nvme2n1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   9       0 md0 56235 0 4967970 0 2611637 0 43004920 0 0 0 0 3226 0 48296008 0
   9       1 md1 11510762 0 7977617242 0 7583513274 0 157187869048 0 0 0 0 97158 0 2559955784 0
 259      11 nvme3n1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Here is a sample event (different from the one above):

{
  "_index": ".ds-metricbeat-datastream-2022.02.14-000001",
  "_type": "_doc",
  "_id": "PvclN38BGuezqsCr0mMI",
  "_version": 1,
  "_score": 1,
  "_source": {
    "tags": [
      "beats_input_raw_event"
    ],
    "@timestamp": "2022-02-26T17:48:18.671Z",
    "system": {
      "diskio": {
        "read": {
          "bytes": 619429091328,
          "count": 2371946,
          "time": 3976780
        },
        "iostat": {
          "request": {
            "avg_size": 253471.92859659425
          },
          "service_time": 1.7893129770992366,
          "read": {
            "request": {
              "merges_per_sec": 0,
              "per_sec": 0.03362881148423912
            },
            "await": 0.5,
            "per_sec": {
              "bytes": 137.74361183944345
            }
          },
          "busy": 25.618428588693362,
          "await": 2166382157300758.5,
          "write": {
            "request": {
              "merges_per_sec": 2328.9633393409804,
              "per_sec": 143.14103608266382
            },
            "await": 2166891115871720.8,
            "per_sec": {
              "bytes": 36290620.69327897
            }
          },
          "queue": {
            "avg_size": 22.848490644359554
          }
        },
        "io": {
          "ops": 0,
          "time": 85466212
        },
        "name": "nvme0n1p3",
        "write": {
          "bytes": 14978125291520,
          "count": 69605857,
          "time": 1024485
        }
      }
    },
    "ecs": {
      "version": "1.12.0"
    },
    "service": {
      "type": "system"
    },
    "@version": "1",
    "host": {
      "os": {
        "kernel": "5.4.0-100-generic",
        "codename": "focal",
        "name": "Ubuntu",
        "family": "debian",
        "type": "linux",
        "version": "20.04.3 LTS (Focal Fossa)",
        "platform": "ubuntu"
      }
    },
    "metricset": {
      "period": 60000,
      "name": "diskio"
    },
    "event": {
      "duration": 1256611,
      "module": "system",
      "dataset": "system.diskio"
    }
  }
}

Here's this machine over the past 7 days (this is only showing with await over 1000, there's plenty of data that is in the normal range as well:
Screen Shot 2022-03-01 at 2 31 36 PM

@fearful-symmetry
Copy link
Contributor

@thekofimensah This is pretty baffling. The best explanation would be that some of the diskstats fields, particularly the milisecond write count, are rolling over, but that seems kind of inconceivable on the time frames we're seeing. Is there any way I can see a "bad" diskstat event , and the diskstat event that came immediately before it? Most of the values are calculated deltas, so I'll need two sequential events to try and reproduce the math, at least.

Also, are there any errors from the diskstat metricset in the logs?

@thekofimensah
Copy link
Author

Here are the events surrounding a "bad" event, one before and one after :

{
  "_index": ".ds-metricbeat-datastream-2022.02.14-000001",
  "_type": "_doc",
  "_id": "c5UmN38Bp-tT72bcvGhm",
  "_version": 1,
  "_score": 1,
  "_source": {
    "ingest_lag_seconds": 1,
    "ingest_time": "2022-02-26T17:49:19.845366612Z"
    "tags": [
      "beats_input_raw_event"
    ],
    "@timestamp": "2022-02-26T17:49:18.672Z",
    "system": {
      "diskio": {
        "read": {
          "bytes": 619483621376,
          "count": 2372157,
          "time": 3976943
        },
        "iostat": {
          "request": {
            "avg_size": 258878.3296388988
          },
          "service_time": 1.2384697890597067,
          "read": {
            "request": {
              "merges_per_sec": 0,
              "per_sec": 3.5380794181452333
            },
            "await": 0.7725118483412322,
            "per_sec": {
              "bytes": 914367.9644515243
            }
          },
          "busy": 5.8084867793626005,
          "await": 73.73686092241688,
          "write": {
            "request": {
              "merges_per_sec": 721.5334472170113,
              "per_sec": 43.36243305840556
            },
            "await": 79.69025522041764,
            "per_sec": {
              "bytes": 11227158.364686279
            }
          },
          "queue": {
            "avg_size": 3.376635680524843
          }
        },
        "io": {
          "ops": 0,
          "time": 85469676
        },
        "name": "nvme0n1p3",
        "write": {
          "bytes": 14978794844160,
          "count": 69608443,
          "time": 1230564
        }
      }
    },
    "ecs": {
      "version": "1.12.0"
    },
    "service": {
      "type": "system"
    },
    "@version": "1",
    "host": {
      "os": {
        "kernel": "5.4.0-100-generic",
        "codename": "focal",
        "name": "Ubuntu",
        "family": "debian",
        "type": "linux",
        "version": "20.04.3 LTS (Focal Fossa)",
        "platform": "ubuntu"
      }
    },
    "metricset": {
      "period": 60000,
      "name": "diskio"
    },
    "event": {
      "duration": 896617,
      "module": "system",
      "dataset": "system.diskio"
    }
  }
}


{
    "_index": ".ds-metricbeat-datastream-2022.02.14-000001",
    "_type": "_doc",
    "_id": "PvclN38BGuezqoCr0mMI",
    "_version": 1,
    "_score": 1,
    "_source": {
      "ingest_lag_seconds": 1,
      "ingest_time": "2022-02-26T17:48:19.848044834Z",
      "tags": [
        "beats_input_raw_event"
      ],
      "@timestamp": "2022-02-26T17:48:18.671Z",
      "system": {
        "diskio": {
          "read": {
            "bytes": 619429091328,
            "count": 2371946,
            "time": 3976780
          },
          "iostat": {
            "request": {
              "avg_size": 253471.92859659425
            },
            "service_time": 1.7893129770992366,
            "read": {
              "request": {
                "merges_per_sec": 0,
                "per_sec": 0.03362881148423912
              },
              "await": 0.5,
              "per_sec": {
                "bytes": 137.74361183944345
              }
            },
            "busy": 25.618428588693362,
            "await": 2166382157300758.5,
            "write": {
              "request": {
                "merges_per_sec": 2328.9633393409804,
                "per_sec": 143.14103608266382
              },
              "await": 2166891115871720.8,
              "per_sec": {
                "bytes": 36290620.69327897
              }
            },
            "queue": {
              "avg_size": 22.848490644359554
            }
          },
          "io": {
            "ops": 0,
            "time": 85466212
          },
          "name": "nvme0n1p3",
          "write": {
            "bytes": 14978125291520,
            "count": 69605857,
            "time": 1024485
          }
        }
      },
      "ecs": {
        "version": "1.12.0"
      },
      "service": {
        "type": "system"
      },
      "@version": "1",
      "host": {
        "os": {
          "kernel": "5.4.0-100-generic",
          "codename": "focal",
          "name": "Ubuntu",
          "family": "debian",
          "type": "linux",
          "version": "20.04.3 LTS (Focal Fossa)",
          "platform": "ubuntu"
        }
      },
      "metricset": {
        "period": 60000,
        "name": "diskio"
      },
      "event": {
        "duration": 1256611,
        "module": "system",
        "dataset": "system.diskio"
      }
    }
  }


  {
    "_index": ".ds-metricbeat-datastream-2022.02.14-000001",
    "_type": "_doc",
    "_id": "WfMkN38BGuezqoCr55ek",
    "_version": 1,
    "_score": 1,
    "_source": {
      "ingest_lag_seconds": 1,
      "ingest_time": "2022-02-26T17:47:19.844087668Z",
      "tags": [
        "beats_input_raw_event"
      ],
      "@timestamp": "2022-02-26T17:47:18.671Z",
      "system": {
        "diskio": {
          "read": {
            "bytes": 619429083136,
            "count": 2371944,
            "time": 3976779
          },
          "iostat": {
            "request": {
              "avg_size": 220628.24475473355
            },
            "service_time": 1.6211711382410996,
            "read": {
              "request": {
                "merges_per_sec": 0,
                "per_sec": 0.03352891869237217
              },
              "await": 0,
              "per_sec": {
                "bytes": 137.3344509639564
              }
            },
            "busy": 37.176865046102265,
            "await": 167.8276920827546,
            "write": {
              "request": {
                "merges_per_sec": 2939.279128248114,
                "per_sec": 229.2875104777871
              },
              "await": 167.85223367697594,
              "per_sec": {
                "bytes": 50594561.072925396
              }
            },
            "queue": {
              "avg_size": 38.07282481139983
            }
          },
          "io": {
            "ops": 195,
            "time": 85450976
          },
          "name": "nvme0n1p3",
          "write": {
            "bytes": 14975966986240,
            "count": 69597344,
            "time": 4294617635
          }
        }
      },
      "ecs": {
        "version": "1.12.0"
      },
      "service": {
        "type": "system"
      },
      "@version": "1",
      "host": {
        "os": {
          "kernel": "5.4.0-100-generic",
          "codename": "focal",
          "name": "Ubuntu",
          "family": "debian",
          "type": "linux",
          "version": "20.04.3 LTS (Focal Fossa)",
          "platform": "ubuntu"
        }
      },
      "metricset": {
        "period": 60000,
        "name": "diskio"
      },
      "event": {
        "duration": 666614,
        "module": "system",
        "dataset": "system.diskio"
      }
    }
  }

Let me see if I can find a log

@thekofimensah
Copy link
Author

Based on the sample I looked at, there's nothing in the logs but debug logging isn't active.

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Mar 3, 2022

Alright, I was at least somewhat right, and the counter is rolling over/deincrementing. In the first event, write.time is 4_294_617_635, and in the next, it's 1_024_485 The kernel docs say that should a 64-bit value on a 64-bit system, so I don't know why it would be rolling over at 4_294_617_635 , which is a hair's breath away from 4_294_967_295, which is the max value for a uint32.

EDIT: nevermind. Annoyingly, I found different kernel docs that explicitly mark those values as 32-bit: https://docs.kernel.org/admin-guide/iostats.html

Should be an easy fix, since it makes sense that some of those counters could roll over fairly frequently.

@thekofimensah
Copy link
Author

Perfect, you need anything else from me?

@fearful-symmetry
Copy link
Contributor

@thekofimensah I don't think so, hoping to have a PR up today for a fix.

@jlind23
Copy link
Collaborator

jlind23 commented Mar 8, 2022

Closing thanks to #30679

@jlind23 jlind23 closed this as completed Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Metricbeat Metricbeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

5 participants