There's a nice Xilinx application note from Ken Chapman:
Multiplexer Design Techniques for Datapath Performance with Minimized Routing Resources https://www.xilinx.com/support/documentation/application_notes/xapp522-mux-design-techniques.pdf
The most intriguing part is what he calls a data selector, but which might be more descriptively termed a one-hot mux, since it's a mux where the control signals are one-hot coded:
onehotmux(data, sel) = (data[1] & sel[1]) | ... | (data[n] & sel[n])
Chapman shows how onehotmux can be implemented with the help of an FPGA's carry chains, but the explanation wasn't very clear to me. So I'll try to explain it here in programmer terms.
Low-level-oriented software hackers are familiar with bitwise tricks that exploit cascading carries where a carry cascades from a lower part of a word through a run of 1s into a higher part. You can understand Chapman's trick quite simply with this concept. Let's look at a few examples:
11111011
+ 00000001
= 011111100
11111111
+ 00000001
= 100000000
11011111
+ 00000001
= 011100000
If you look at the carry bit of the result (the 9th bit in these examples) as the output of the computation, then you can see that the output is 1 if and only if the input bits are all 1s.
Back to the one-hot mux. Here's the formula for what Chapman does the way I would write it:
onehotmux(data, sel) = ((data | ~sel) + 1)[n + 1]
The idea is that we selectively introduce a gap in the carry cascade to control the final carry.
Since it's a one-hot mux, all but one of the selector signals is 0, so at most one of the addend bits will be 0. Say the 4th bit is selected:
1111x111
+ 00000001
= x........
Here sel[3] is 1, so x = (data[3] | ~sel[3]) = data[3] and hence the carry output is data[3].
There's a nice logical interpretation since a | ~b is equivalent to the logical implication b -> a.
In Chapman's application note, he combines this with a standard FPGA trick of doing a 3:1 selection in the LUT6 before feeding into the carry chain, so you can do up to 4x3 = 12 one-hot mux selections with the resources in a single CLB.
Aside from conceptual clarity, which is an admittedly subjective quality, there's at least one objective advantage to my formulation of this trick: FPGA synthesis tools should have no problem mapping this formula onto carry chains, whereas Chapman's formulation requires hand-mapping the CLB resources, which is hard to maintain and highly vendor specific.