Statistics for Network.Transport

Description

These would obviously be implementation dependent, so the API is likely to be rather loosely defined with `String` keys and so on. Making statistics available is likely to be very useful to administrators, particularly when trying to optimise the network topology for a distributed application.

Environment

None

Activity

Show:
Tim Watson
February 13, 2013, 10:12 AM

This sounds great, but the ghc event log only takes a string input. I can't see any way of writing more complex data to the event log - I'll ask about this on the p-h mailing list.

Alexander Kjeldaas
February 12, 2013, 11:58 PM
Edited

TL;DR: Add a session-id, and an event mask to all NT objects, use the GHC trace event system to log events. Have a separate process that analyzes the event log for management, statistics, and tracing purposes.

I wrote a large comment on this, then scrapped it. I think I have an idea for how we can do this now. This is sort-of a mix between statistics, monitoring and tracing though:

1. Use the GHC trace event system. This is a light-weight ring-buffer like system where we can write events.
2. Events can be a channel write, read, ping, polls, reconnects etc.
3. Events that count stuff can be the tuple (number of events, this value, total value). For example for reads: (40, 254, 123445) meaning 40 reads have happened, this read was 254 bytes, and the total number of bytes read is 123445. This makes it possible to sample events while still being able to generate histograms over read-sizes, running average, and totals.
4. Other events with different formats is ok too, I'm just mostly thinking about counting and aggregating stuff now.

5. NT should support a mutable application-level identifier that can be attached to channels, and an event log mask. The application-level identifier will be logged together with every event. This can be a string, a 64-bit int, or a pointer depending on the performance effects (and the event system implementation, I haven't studied it). Thus the above channel read event becomes (<session-id>, <channel-id>, 40, 254, 123445). The application-level identifier can be thought of as the "session id". The purpose is for the application layer to bind all reads and writes to channels that belong to the same application layer context together. Even disk IO can be included at a later stage. With proper tooling support, this should make it possible to trace what happens through a series of processes.

6. There should be a "programmable" event processor somewhere. Its purpose is to read the event log and do things. Some possible things are:
a) Aggregate events
b) Masssage events and show them on the management console on request. This doesn't need to happen unless someone accesses the management interface though.
c) Possibly forward events. For example when debugging a complex flow going through multiple processes in DH, it can be useful to forward all events with a given session-id. However, this can also be done using the query system in the next point.
d) Query events based on session-id, channel, remote processid, and nodeid.
e) Aggregate events into a large json-like structure that can be queried. This would be things like the total number of bytes read on a given channel.
f) Higher level: Create graphs of things? Interface with graphing tools.

7. The transitive closure of all events that have a given session-id by traversing channels from one process to another gives a full picture of all events in a complex multi-process system. It is up to the application to define what constitues a session, and support for hierarchical sessions can be built into this tool.

8. For statistics purposes, whether or not an event log is sampleable is imporant. Which events can be collapsed and how. For example the session-id doesn't have to be in the "read" event, but changing the session-id attached to a channel could be a separate event. However this requires an event log sampler to be careful to not skip such an event if it samples a later read from the channel.

Tim Watson
December 18, 2012, 1:48 PM

Comment:hyperthunk:12/18/12 01:48:16 PM:

Referenced in commit:
https://github.com/haskell-distributed/distributed-process/

Assignee

Tim Watson

Reporter

Tim Watson

External issue ID

None

OS

None

Priority

Desirable