zeeshanlakhani · December 3, 2015 21:21 · Mar 18, 2014
diff --git a/presentation.md b/presentation.md
@@ -0,0 +1,319 @@
+
+# Erlang Tracing: more than you wanted to know
+
+## Rough Outline
+
+* What can be traced?
+* How can trace events be specified?
+* "match specifications": twisty passages, all alike
+* WTF, can I just use DTrace and drink my coffee/beer/whisky in peace?
+* Trace delivery mechanisms: pick one of two
+* Trace event receivers: what can they do?
+* Redbug vs. The World
+* Examples of event receivers
+
+## Reference docs
+
+* See docs: `dbg:p/2` or `erlang:trace/3`
+    * http://erlang.org/doc/man/dbg.html
+    * http://erlang.org/doc/man/erlang.html#trace-3
+* They're somewhat interchangable.   
+* Use whatever one you're comfortable with API-wise
+* They are twiddling the same knobs, inside the VM
+* I talk a lot about the RedBug app:
+    * https://github.com/massemanet/eper
+
+## What can be traced?
+
+* Which processes?
+    * A specific pid.
+    * `existing`: All processes currently existing
+    * `new`: All processes that will be created in the future
+    * `all`: `existing` and `new`
+* What events?
+    * function calls: call, return
+    * sending of messages
+    * receiving of messages
+    * process-related events: spawn, exit, register, unregister, link, unlink, getting linked, getting unlinked
+    * scheduling of processes: in, out
+    * process exiting
+    * garbage collection activities
+* Additional bells & whistles
+    * microsecond timestamps on all messages
+    * report arity # instead of arg list
+    * set on spawn / first spawn / link / first link
+
+## "match specifications": twisty passages, all alike
+
+* See docs in `ets:fun2ms/1` `ms_transform:parse_transform/2`
+    * http://erlang.org/doc/man/ets.html
+    * http://erlang.org/doc/man/ms_transform.html
+* Because regular expressions on Erlang data **sucks**.
+* Match specs attempt to be Erlang'ish
+    * And try to be quick to execute
+    * But are obtuse to the 7th degree....
+* Example match specs:
+    * `[{'_',[],[{message,{return_trace}}]}]`
+    * `[{{'$1','$2'},[{'>','$2',{const,3}},{is_atom,'$1'}],['$1']}]`
+        * `ets:fun2ms(fun(A, B) when is_atom(A), B > 3 -> A end).`
+* Mostly useful for restricting tracing on function calls
+    * After-the-fact tracing filters are good but...
+    * ... generating exactly the traces that you want is **better**.
+    * E.g., match `foo:bar/3` but only if 1st arg is `hellooooooo_world`.
+
+## Trace delivery mechanisms: pick one of two
+
+* Erlang process receiver
+    * Send the trace events to any pid in the cluster
+    * Easy to use
+* TCP receiver
+    * Send binary trace events to a TCP port listener
+    * Faster & lower overhead: less likely to kill VM in trace event tsunami
+* **IMPORTANT:** All tracing events must go to the same single receiver.
+    * DTrace is far more flexible, alas.
+
+## WTF, can I just use DTrace and drink my coffee/beer/whisky in peace?
+
+* I sooooo wish.......
+* ... unfortunate reality hits ....
+* DTrace isn't available everywhere that Erlang is
+* The VM's tracing infrastructure isn't a good match with DTrace, yet.
+    * Pulls requests are welcome!
+
+## DANGER: Erlang tracing can crash your VM!
+
+It is quite possible to create a tsunami of trace events.
+
+* You can easily perturb performance with trace event receiver @ 100% CPU
+* You can easily run your machine out of RAM
+* You can deadlock the tracer and/or other processes.
+* Some trace patterns can disable tail call recursion ... run your
+  VM out of memory even when the # of trace events is very small.
+
+To be safe, do all of these things:
+
+* Make your trace specifications as specific as possible
+  * Do not use `"'_':'_'"`
+  * I.e.: Any module + any function name
+  * If you must use `"your_module_name:'_'"`, run for a *very* short
+    time, until you are confident that you will not create an event tsunami.
+* Simplify your trace receiver as much as possible
+* Limit the amount of time that tracing is enabled
+* Read the docs!
+* Redbug has all of these limits, that's one reason we use/recommend it.  
+
+## Trace event receivers: what can they do?
+
+* Anything, really
+    * The events are just data.  Fold/spindle/mutilate/....
+* In practice, simple is better
+* Most tracers just format the event, in the order received
+    * Simple, stateless
+    * For example, this is what Redbug does by default
+        * But Redbug can use custom event handler, if you wish....
+
+* Redbug vs. The World
+
+* Redbug's default trace event receiver is stateless, prints to console
+* Redbug has option to write binary events to a file for offline processing
+* Redbug can use multiple match specs at "full strength & power"
+* So ... when should you not use Redbug?
+    * When online & stateful event receiver is needed
+
+## Examples of event receivers
+
+Redbug specification:
+
+    > redbug:start("erlang:now -> return").
+    ok
+    21:31:52 <timer_server> {erlang,now,[]}
+    21:31:52 <timer_server> {erlang,now,0} -> {1385,555512,322417}
+    21:31:53 <timer_server> {erlang,now,[]}
+    21:31:53 <timer_server> {erlang,now,0} -> {1385,555513,322384}
+    [...]
+    quitting: msg_count
+
+To limit the formatting depth (see all of the "..." for omitted data)
+
+    (dev1@127.0.0.1)26> redbug:start("timer:'_' -> return", {print_depth,7}).
+    ok
+    21:33:41 <{erlang,apply,2}> {timer,tc,[riak_pipe_stat,produce_stats,[]]}
+    21:33:41 <<0.3234.1>> {timer,now_diff,[{1385,555621,...},{1385,...}]}
+    21:33:41 <<0.3234.1>> {timer,now_diff,2} -> 130
+    21:33:41 <<0.3234.1>> {timer,tc,3} -> {130,
+                                           {riak_pipe,[{{...},...},{...}|...]}}
+
+For Redbug help, run `redbug:help()`.  For explanation of Redbug
+`print_depth`, see the ~P formatting description in the `io` man page.
+
+## WARNING: Cut-and-paste code style
+
+* These are not great examples of Erlang code style
+* Close your eyes if you are sensitive to ugly code.  `^_^`
+
+## fsm_init_tracer.erl
+
+Source: Scott Fritchie: https://gist.github.com/slfritchie/007b399675fb76f2d4f0
+
+Report every `Interval` seconds the count of `init()` function calls for the following Riak KV FSM modules: riak_kv_buckets_fsm, riak_kv_exchange_fsm, riak_kv_get_fsm, riak_kv_index_fsm, riak_kv_keys_fsm, riak_kv_put_fsm
+
+## timeit.erl
+
+Source: Greg Burd: https://gist.github.com/2656289
+
+    %% @doc Dynamically add timing to MFA.  There are various types of
+    %% timing.
+    %%
+    %% {all, Max} - time latency of all calls to MFA
+    %%
+    %% {sample, N, Max} - sample every N calls and stop sampling after Max
+    %%
+    %% {threshold, Millis, Max} - count # of calls where latency is > Millis
+    %% and count # of calls total, thus percentage of calls over threshold
+
+Sample output:
+
+    > timeit:timeit(timer, tc, 3, {sample, 10, 5}). 
+    {ok,[{matched,'dev1@127.0.0.1',1},{saved,1}]}
+    [sample 1/10] <0.10771.1>:timer:tc: 0.152 ms
+    [sample 2/20] <0.10796.1>:timer:tc: 0.484 ms
+    [sample 3/30] <0.10825.1>:timer:tc: 0.111 ms
+    [sample 4/40] <0.10849.1>:timer:tc: 0.101 ms
+    [sample 5/50] <0.10875.1>:timer:tc: 194.682 ms
+
+
+## fsm_latency_tracer.erl
+
+Source: Scott Fritchie: https://gist.github.com/slfritchie/7a87269148280e4724ce
+
+Report each `riak_kv_put_fsm` and `riak_kv_get_fsm` process that executes in more than `LatencyMS` milliseconds.  Also, for various strategic API functions in the `file`, `bitcask`, and `eleveldb` modules, report latencies above the same limit.
+
+## backend_latency_tracer.erl
+
+Source: Scott Fritchie: https://gist.github.com/slfritchie/e7c3ad866f67d3fc8935
+
+A more sophisticated version of `fsm_latency_tracer`.  Watches more
+API functions in the `bitcask`, `eleveldb`, `file`, and
+`riak_kv_fs2_backend` modules.  Uses the `folsom` library to generate
+histograms of latencies of the `file:pread()` function to show
+outliers and multi-modal patterns that simple averages & maximums can
+show.
+
+## accumulating_time_tracer.erl
+
+Source: Scott Fritchie: https://gist.github.com/slfritchie/c01690d9c30e5e7b0100
+
+Demonstrates how much of every second that the `file_server_2` process
+is busy.  It shows pretty dramatically how Bitcask's highest latencies
+are frequently caused by serialization by `file_server_2`.
+
+## trace_large4.erl
+
+Source: Joe Blomstedt: https://gist.github.com/jtuple/244c578773962077e215
+
+
+* For calls to `riak_kv_get_fsm:calculate_objsize/2`, print the
+  Bucket, Key, and object size.
+* For calls to `riak_kv_eleveldb_backend:put/5` and
+  `riak_kv_bitcask_backend:put/5` and `riak_kv_memory_backend:put/5`,
+  report if the object size is larger than `Size`.
+* Stop after `Count` events have been recorded or `Time` milliseconds have
+  elapsed.
+
+## read_bin_trace_file.erl
+
+Source: Scott Fritchie: https://gist.github.com/slfritchie/23121803af63d76eeca5
+
+A small example of reading a binary-formatted Erlang trace file and doing something simple with it.
+
+Note that Redbug can write its trace events into a binary file.  This technique can be used for off-line processing, e.g., to avoid high processing cost in a production environment.
+
+## func_args_tracer.erl and latency_histogram_tracerl.erl
+
+Source: Scott Fritchie: https://gist.github.com/slfritchie/159a8ce1f49fc03c77c6
+
+func_args_tracer.erl
+
+    %% For example: what ETS tables are being called the most by ets:lookup/2?
+    %% The 1st arg of ets:lookup/2 is the table name.
+    %% Watch for 10 seconds.
+    %%
+    %% > func_args_tracer:start(ets, lookup, 2, 10, fun(Args) -> hd(Args) end).
+    %% 
+    %% Tracer pid: <0.16102.15>, use func_args_tracer:stop() to stop
+    %% Otherwise, tracing stops in 10 seconds
+    %% Current date & time: {2013,9,19} {18,5,48}
+    %% {started,<0.16102.15>}
+    %% Total calls: 373476 
+    %% Call stats:         
+    %% [{folsom_histograms,114065},
+    %%  {ac_tab,69689},
+    %%  {ets_riak_core_ring_manager,67147},
+    %%  {folsom_spirals,57076},
+    %%  {riak_capability_ets,48862},
+    %%  {riak_core_node_watcher,8149},
+    %%  {riak_api_pb_registrations,8144},
+    %%  {folsom,243},
+    %%  {folsom_meters,43},
+    %%  {folsom_durations,20},
+    %%  {timer_tab,18},
+    %%  {folsom_gauges,8},
+    %%  {riak_core_stat_cache,5},
+    %%  {sys_dist,3},
+    %%  {inet_db,1},
+    %%  {21495958,1},
+    %%  {3145765,1},
+    %%  {3407910,1}]
+
+latency_histogram_tracerl.erl
+
+    %% For example: create a histogram of call latencies for bitcask:put/3.
+    %% Watch for 10 seconds.
+    %%
+    %% > latency_histogram_tracer:start(bitcask, put, 3, 10).
+    %% 
+    %% Tracer pid: <0.2108.18>, use latency_histogram_tracer:stop() to stop
+    %% Otherwise, tracing stops in 10 seconds
+    %% Current date & time: {2013,9,19} {18,14,13}
+    %% {started,<0.2108.18>}
+    %% Histogram stats:     
+    %% [{min,0},
+    %%  {max,48},
+    %%  {arithmetic_mean,2.765411819271055},
+    %%  [...]
+    %%  {percentile,[{50,3},{75,4},{90,5},{95,6},{99,8},{999,14}]},
+    %%  {histogram,[{1,13436},
+    %%              {2,12304},
+    %%              {3,10789},
+    %%              {4,7397},
+    %%              {5,4191},
+    %%              {6,1929},
+    %%              [...]
+    %%              {30,0},
+    %%              {31,0},
+    %%              {40,2},
+    %%              {50,1}]},
+    %%  {n,51746}]
+
+## ktrace.erl and ktrace_fold.erl
+
+Source: Joe Blomstedt: https://gist.github.com/jtuple/161ad7c8c278b45729df
+Source: Jordan West: https://gist.github.com/jrwest/2999f8f217f1cbceca83
+
+* Attempts to capture enough data from Erlang tracing to construct a
+  flamegraph.
+    * http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/
+* Aside/see also: https://github.com/proger/eflame
+* On VM #1: `ktrace:c()`
+* On VM #2, on same machine, running Riak: `ktrace:s()`
+* Then get Riak to do some get requests.
+* ktrace output should appear in VM #1's console.  Save it to a file.
+* Run `ktrace_fold:go("/path/to/input-file", "/path/to/output-file").`
+* Run `flamegraph.pl /path/to/output-file > "/path/to/output.svg`
+
+For example: http://www.snookles.com/scotttmp/vnode-get.svg ... which
+suggests that only about 15% of the wall-clock time for
+`riak_kv_vnode` get processing is due to the backend (Bitcask in this
+case).
+
+## Questions?