Before you can pull insights from your data, you need data, but I'm hearing more concerns about data quality in social media analysis lately. Before, people asked about the traditional tradeoff in text queries: finding relevant content while excluding off-topic content. Lately, I'm hearing more about social data that's intentionally tainted. If you're looking for meaning in social media data, you may have to deal with adversaries.
Yes, and you've been playing without an opponent, which is, as you may have guessed, against the rules.
— "Anton Ego," Ratatouille
Ask a company with three initials as a name how many three-letter abbreviations are in use, and you get a sense of the challenge in finding relevant content. Common words as brand names pose a similar challenge (I always like the examples of Apple and Orange, because it's the one time you really can compare them). If people are honest and expressing their real opinions, it's hard enough to find what you're looking for.
The problem is, people aren't always honest. You also need to get rid of intentional noise in the data.
The analyst's adversaries
We've all seen online spam (sorry, Hormel, you must hate that term). Junk mail for hormones and drugs in email, junk comments on blogs, junk blogs, trashy web sites—the costs are so low that even microscopic conversion rates are profitable, so it persists. Some of that shows up in social media, which is the problem here.
At the recent Social Media Analytics Summit, Dana Jacob gave a talk on the spam that finds its way into the search results of social media analysis platforms, skewing the numbers. One tidbit that Dana shared to illustrate the challenge: If you consider all of the creative misspellings, there are 600 quintillion (6 x 1020) ways to spell Viagra. So removing all of the spam from your data is a challenge.
Spam seems to come in two flavors, neither of which will help you understand public opinion or online coverage. One is designed to fool people, to get them to click a link. It may lead to malware or fraud, or to some sort of product for sale. The other is designed to fool search engines with keywords and links embedded in usually irrelevant text. It's usually obvious to a human reader, but the hope seems to be that some search engines will count the links in their ranking of the target site.
- Gaming analytics platforms
Another presenter outlined a more direct challenge to the social media analyst when he described his system to game analytics systems with content farms and SEO tactics. He talked about using weaknesses in analytics systems to plant information in them. One slide described his methods as "weaponizing information in a predictive system," which doesn't leave a lot of room for exaggeration.
He even used a real client as an example. The question is, how many others do the same thing, but discreetly? If you're looking for market intelligence in social media, do you trust your sources?
- Deception in crowdsourced data
Another conversation went into the potential poisoning of the crowdsourcing well, in this case one of the crowdmapping efforts in a political conflict. If one party to the conflict entered false reports—perhaps to discredit the project or misdirect a potential response—could it be detected?
Beyond the crowdmapping context, can you detect opposition personas that post false reports in social media? It's a standard tactic in the government/political arena, but it could hit you in business, too. All you need is a motivated opponent.
Next: ideas for detecting deception
I don't mean to be all problem and no solution, but this post is already a long one. I'll share some ideas on how we might detect deception in social media in my next post. For now, I'll end with a happier observation: Sometimes, people lie in real life and get caught when they reveal the truth in social media.
Update: Part 2 is now up: Detecting Deception in Social Media
Photo by John Cooper.