Hello dear sirs,
This is to report a problem we have with prometheus_client in multiprocessing mode with restarting workers.
How do we observe the problem?
In our production environment it looks like this:
- master process starts and spawns a set of worker processes;
- metrics reporting is fine;
- reconfiguration is requested, and all worker processes are replaced;
- sometimes after this reconfiguration:
- metrics reporting stops working (HTTP endpoint returns 500)
- logs contain errors like the ones below, which suggests that .db files get corrupted
- metrics do not come back until complete restart (and removal of corrupted .db files)
The errors may look like this:
...
File "/Users/vasiliev/.virtualenvs/metrics-issue27/lib/python2.7/site-packages/prometheus_client/core.py", line 682, in __reset
files[file_prefix] = _MmapedDict(filename)
File "/Users/vasiliev/.virtualenvs/metrics-issue27/lib/python2.7/site-packages/prometheus_client/core.py", line 577, in __init__
for key, _, pos in self._read_all_values():
File "/Users/vasiliev/.virtualenvs/metrics-issue27/lib/python2.7/site-packages/prometheus_client/core.py", line 611, in _read_all_values
encoded = unpack_from(('%ss' % encoded_len).encode(), data, pos)[0]
error: unpack_from requires a buffer of at least 1919251561 bytes
...
File "/Users/vasiliev/.virtualenvs/metrics-issue27/lib/python2.7/site-packages/prometheus_client/multiprocess.py", line 42, in merge
metric_name, name, labels = json.loads(key)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
How to reproduce the problem?
This is very tricky to reproduce in isolated environment, and for sure does not fit in description of a Github issue, so I put the code and instructions how to reproduce here:
https://github.com/lonlylocly/prometheus_client_concurrency_issue
Basically, it is our production script stripped down to minimal version. It reproduces maybe 50% of the time.
What do I want from this issue?
I must admit that we are lost and we can't figure how can this issue be mitigated. We love prometheus and we really liked the convenience of python_client but metrics reporting breaks with this problem on stable basis and we would like to eliminate it.
I would appreciate any sort of suggestion or advice and am also willing to help via PR (if we manage to figure a workaround, personally I don't even know how to start).
Thank you!
Hello dear sirs,
This is to report a problem we have with
prometheus_clientin multiprocessing mode with restarting workers.How do we observe the problem?
In our production environment it looks like this:
The errors may look like this:
How to reproduce the problem?
This is very tricky to reproduce in isolated environment, and for sure does not fit in description of a Github issue, so I put the code and instructions how to reproduce here:
https://github.com/lonlylocly/prometheus_client_concurrency_issue
Basically, it is our production script stripped down to minimal version. It reproduces maybe 50% of the time.
What do I want from this issue?
I must admit that we are lost and we can't figure how can this issue be mitigated. We love prometheus and we really liked the convenience of
python_clientbut metrics reporting breaks with this problem on stable basis and we would like to eliminate it.I would appreciate any sort of suggestion or advice and am also willing to help via PR (if we manage to figure a workaround, personally I don't even know how to start).
Thank you!