Content Validity of High-Dimensional Measurement

“Incidental data” from sources like smartphones, social media and search queries are abundant. They are collected continually and record human behaviour in natural environments. Therefore, such data can be ideal for measuring social phenomena. In fact, previous research has made successful predictions of variables like human values and personality from incidental data.

However, when the research goal is not to predict but to explain (i.e. test a theory or estimate a causal model), which is often the case in the social sciences, at least two major challenges arise with the use of incidental data. First, in explanatory modelling, constructed scores (of latent constructs) must not only correlate, but also be reliable and valid. Unfortunately, incidental data tend to suffer from measurement error problems because, by definition, they are not collected for the purpose of scientific research. Therefore, methods like latent variable measurement models are needed to estimate and correct for measurement error in incidental data. Second, in incidental data, indicators (e.g. social media posts and clickstreams) are often high-dimensional and most have low relevance to the target concepts. This raises the problem of high-dimensional measurement, which existing latent variable measurement models cannot deal with.

The goal of the PhD project of Qixiang Fang is to improve high-dimensional measurements of theoretical (latent) constructs (like human values and personalities) by leveraging knowledge from the fields of statistics, machine learning and natural language processing, with a focus on content validity and text data.

The PhD project is supervised by Dr. Daniel Oberski and Dr. Dong Nguyen. Financed by the NWO Vidi granted to Dr. Daniel Oberski.

Human Data Science (HDS)

Content Validity of High-Dimensional Measurement