Machine Learning and Grounded Theory Method

Machine Learning and Grounded Theory Method: Convergence, Divergence, and Combination.

Michael Muller, Shion Guha, Eric P.S. Baumer, David Mimno, and N. Sadat Shami. 2016. Machine Learning and Grounded Theory Method: Convergence, Divergence, and Combination. In Proceedings of the 19th International Conference on Supporting Group Work (GROUP '16). Association for Computing Machinery, New York, NY, USA, 3–8. DOI:https://doi.org/10.1145/2957276.2957280

Machine learning is more associated with quantitative methods, statistics, and positivist thinking; grounded theory with qualitative methods, ethnography, and interpretivist thinking. This paper examines where machine learning methods and grounded theory, often viewed as very different approaches, converge. They propose the synthesis of these two methods as offering interesting approaches to both hypothesizing and methods, especially for those interested in combining "big data" with qualitative methods. This might include using large-scale quantiative analysis alongside localized or smaller scale qualitative analysis. In other cases, one might begin with qualitative analysis and then use big data analysis to look at a larger picture.

Machine Learning Approaches

Centers around statistical models. Focuses on two types of variables: observed in data (features) or not observed. The algorithm is trained to "find values for unobserved variables that best fit the values of the observed variables" (pg. 3-4). In other words, using observed variables to infer characteristics in unobserved variables (e.g., using spam/not-spam labels to infer "spamminess").

Types of ML Models

Supervised: Uses labelled data to predict outcomes for unseen data. Two types of observed variables: inputs and outputs. Common methods are regression, classification, and confirmatory factor analysis.

Unsupervised: Uses unlabelled data to infer relationships about the data without supervision. Examples are clustering, PCA, topic modeling, and exploratory factor analysis.

Grounded Theory Method

Focused on constructing theory from the data, from the ground up. Grounded theory is characterized by its rigorous coding process, starting with open coding and eventually formalizing open codes into axial codes, which summarize the relationships between the open codes. GT utilizes the constant comparison method, comparing data with data and data with the emerging theory. The researcher then constructs dimensions of difference between open and axial codes. The emergent theory is made up of these codes, categories, and dimensions. The researcher then uses this theory to select new research sites or participants to test the theory's weakest points (theoretical sampling and abductive logic).

Types of Grounded Theory

Given the disagreements between the original creators of GT, Glaser and Strauss, there are now multiple approaches to GT.

Glaser: Glaser developed a set of 40 coding families focused on bridging initial codes to formal theory.

Strauss: Focus is on the gradual development of open and axial codes into categories and dimensions.

Convergences

Make claims to be grounded in the data.
Begin with and return to the data.
Develop interim components of theory that describe differences across data (GT = dimension; ML = features).
Iterative process.
Neither process complete when data is analyzed. Requires interpretation and theory-building.
A logic of abduction (ML = testing on unseen data to prevent overfitting).
Researcher selection (GT = coding most salient or important things; ML = feature selection).

Research Directions

Comparative Sequence Methods

Use methods sequentially.

GT > ML: Once data is coded/grouped, use ML methods to examine attributes of data.

ML > GT: Begin with ML classification against ground-truth outcome measure, then use GT to search for emergent properties within each ML class.

Hybrid Iterative Methods

An integrated, iterative process using both methods.

Researchers might alternate between each iterative step of both GT and ML, evolving GT theory and a refined ML model. Utilizes constant comparison between GT and ML.