Can we stop lowering FC to Matmul+BatchedAdd?

GEMM is almost always implemented as Y = AX+b, so we actually end up doing something less efficient by separating these into two nodes.  It's pretty universal on HW backends, too.  We could write new optimizations to re-fuse them, but why?