Create evaluation suite for benchmarking various models

Related to: https://github.com/uw-ssec/llmaven/issues/2

- [ ] Define our needs for an evaluation suite
- [ ] Collect the metrics we're using for benchmarking for various models using this evaluation suite 
- [ ] Update the evaluation code for deepeval to leverage the modular retrieval / generation code.