Validating social media data

| 2 Comments

Generating quantitative observations from unstructured data in social media is new. So, surprise, it's a field that doesn't have mature standards yet. Really, we don't even have accepted definitions yet, because mainstream marketers—who are still hearing that they need to start listening—don't know enough about social media practices to drive standardization. It's not too early for challenges to the validity of the data, however.

Social media analysis is not (usually) survey research
Because it's not widely understood, and the discussion has tended to focus on the benefits of listening, social media analysis is sometimes criticized for not following the standards of other types of research. George Silverman wrote a good example comparing online and traditional focus groups, for example.

Justin Kirby took a different swing at social media measurement, comparing data mining to survey research:

Just look at buzz monitoring practitioners who place great stock in sentiment analysis, but have none of the usual checks and balances (such as standard deviation) that underpin data validity within traditional research. If you can't calculate any margin of error, let alone show that you're listening to a representative sample of a target market, then how can you really prove that your analysis is sorting the wheat from the chaff and contributing valuable actionable data to your client's business?
(Justin has points worth pondering in the rest of his article, so go read it. I did note that marketers became advertisers early in the article, which suggests a partial answer to his complaint.)

Traditional research is based on sampling, where tests to determine the validity of the sample data are crucial (and, typically, poorly understood). Most social media analysis vendors are using automated methods to find all of the relevants posts and comments on a topic, which go into their analytical processes.

Testing the results
I won't argue against the idea of tests to validate the data, but tests created for surveys and samples aren't necessarily relevant to new techniques. The question is, what's the right test of a "boil the ocean" methodology? Here are some of the challenges, which are different from "is the sample representative?"

  1. How much of the relevant content did the search collect (assuming the goal is 100%—if not, you're sampling)?
  2. How accurately did the system eliminate splogs and duplicate content?
  3. How accurately did the system identify relevant content?
  4. How accurate is the content scoring?
Reaching perfection
Ideally, results would be reproducible, competing vendors would get identical results, and clients would be able to compare data between vendors. Theoretically, everyone is starting with the same data and using similar techniques. All that's left is standardizing the definitions of metrics and closing the gap between theory and practice. Easy, right?


2 Comments

Thanks for sharing this topic, I think it's an important one to discuss. Technically, I think social media analytics attempt to perform statistical censuses, not samples. But the problem of accuracy is just as important. Even the US Census has had response rates as low as 50% in some cases and had to struggle with error and confidence issues.

We know from "digital divide" discussions that online users aren't necessarily representative of all Americans (the Pew Internet and American Life project has numerous examples). We also know from other research that as little as one in seven participants in social media actually contribute content (See the 12/2007 Pew study on Teen use of Social Media). The end result is that what the analytics measure is a very skewed portion of the public, despite the assumption is that buzz and sentiment findings are representative of the public as a whole. Without additional research to contextualize and validate social media research, for now everyone should be warned that what they get simply may, or may not, be accurate.

Thanks. Guy. The Census comparison is helpful, since that's an established example of a survey that attempts to reach every individual.

Projecting social media research results to the general population is a tricky area, though projects that combine social media with traditional research may help. I agree that the demographics issue complicates things.

In reading the post again, I see that I didn't state the purpose of social media research, which would be a problem if this were how I started a project. Part of the fun with the critiques is that proponents of established methods tend to defend their own particular turf, while the new techniques have multiple applications that cross functional silos.

One tactic for making the data more representative is to focus on communities over blogs.

For another take on the research question, see Jim Nail's post, Is Prediction Even the Point of Social Media?

Comments are now closed for this entry.


About Nathan Gilliatt

  • ng.jpg
  • Voracious learner and explorer. Analyst tracking technologies and markets in intelligence, analytics and social media. Advisor to buyers, sellers and investors. Writing my next book.
  • Principal, Social Target
  • Profile
  • Highlights from the archive

Subscribe

Monthly Archives