Data is now a ubiquitous element of our lives. Voluminous amounts of data are produced across a wide spectrum of industries and applications, collectively referred to as “big data.” Big data is more and more pervasive, and experts are exploring its potential for insight into a variety of fields. In his book “Big Learning Data,” Elliott Masie identifies the use of big data as the next relevant step in the field of learning, training and development.
As expected, we are seeing an explosion of technologies and applications for storing, processing and mining extremely large data sets. Hadoop, IBM Watson are notable examples. All this hype has created some trepidation in the L&D community, fearing that exotic new technologies and analytical techniques are called for.
Granted, there can be hundreds of measures and thousands of data points, both structured and unstructured, but when you “peel back the onion,” in the end, you still have data. And as we know, the field of data analytics is already replete with techniques for analyzing data. All of these techniques start with a dataset containing observations and variables (or measures). Think of a dataset as a spreadsheet with observations represented by the rows and columns.
There are two traditional approaches to data analysis that can be adapted to big data and should be within the grasp of many in the training and development community: Factor Analysis (used to reduce dimensionality, or the number of key variables in big data sets, affording a more palatable analysis) and Ensemble Modeling (used to create consistent and robust predictive models from big data sets).
The use of these analytical techniques is predicated on two basic tenets of big data analysis that should always serve as our guiding light:
- Eliminate information redundancy among the variables in big data sets.
- Don’t get fooled by random events in big data samples.
So what makes a dataset big? It may contain a lot of rows (observations), such as thousands of learners taking courses or thousands of customers rating the technical ability of a customer support engineer. A dataset may also contain a lot of columns representing hundreds of variables or dynamic, repeated measurements over time, such the length of time that a learner stays on a given screen of an online course. Regardless of what makes datasets big, remember that for all situations, the dataset is just a spreadsheet metaphor with rows and columns.
As an analytical technique, Factor Analysis strives to mitigate redundancy or correlation among the variables. If not properly accounted for, this redundancy can lead to bias and instability in predictive models. Factor Analysis seeks to account for the correlations among a large number of variables in terms of a much smaller set of unobserved variables called factors.
Think of Factor Analysis as the statistical “analog” of Affinity Mapping, often used in business planning. You can think of the variables as post-it-notes and the factors as named groupings of the post-it-notes. By minimizing the effects of redundant data, the resulting predictive model will represent the most parsimonious structure of the data.
Think of big data as more data, not necessarily better data. Remember, big data still measures many attributes of the “social context” indirectly; it is just measuring more of them. The problem becomes one of reliably separating the signal from the noise.
Noise or “spuriousness” can be a real artifact of big data in that a random event that occurs one time in 1,000 occurrences (0.1% of the time) will occur 1,000 times in a dataset of 1 million observations. Thus, correlations and patterns can manifest that may turn out to be spurious.
Ensemble modeling enables you to build and test a predictive model using many independent sample data sets. What you are looking for through the repeated sampling and modeling are variables that show up consistently across samples. The resulting models tend to identify reliable relationships and result in greater predictive accuracy.