Towards Robust Off-Policy Evaluation via Human Inputs