Natural language descriptions of organismal phenotypes, a principal object of study in biology, are abundant in biological literature. Expressing these phenotypes as logical statements using formal ontologies would enable large-scale analysis on phenotypic information from diverse systems. However, considerable human effort is required to make the semantics of phenotype descriptions amenable to machine reasoning by (a) recognizing appropriate ontological terms for entities in text and (b) stringing these terms into logical statements. Most existing Natural Language Processing tools stop at entity recognition, leaving a need for tools that can assist with both aspects of the task. The recently described Semantic CharaParser aims to meet this need. We describe the first expert-curated Gold Standard corpus for ontology-based annotation of phenotypes from the systematics literature. We use it to evaluate Semantic CharaParser's annotations and explore differences in performance between humans and machine. We use four annotation accuracy metrics that can account for both semantically identical and similar matches. We found that machine human consistency was significantly lower than intercurator (human human) consistency. Surprisingly, allowing curators access to external information that was not available to Semantic CharaParser did not significantly increase the similarity of their annotations to the Gold Standard nor have a significant effect on inter-curator consistency. We found that the similarity of machine annotations to the Gold Standard increased after new ontology terms relevant to the input text had been added. Evaluation by the original authors of the character descriptions indicated that the Gold Standard annotations came closer to representing their intended meaning than did either the curator or machine annotations. These findings point toward ways to better design of software to augment human curators, and the Gold Standard corpus will allow training and assessment of new tools to improve phenotype annotation accuracy at scale.