adversarial thinking applied to language models. curious to see how standardized evaluations will look like for LLMs (will they be conducted by other models??)