Evaluating large language models for abstract evaluation tasks: an empirical study | Synapse