May 2012 Archives

buckets of berriesIn preparing for last month's Social Media Analytics Summit, I needed a talk on the emergence of the social media analytics industry—which was tricky, since I don't usually talk about social media analytics. I didn't want to set up an elimination round of buzzword sweepstakes, arguing for this usage or that. Instead, I looked for a unifying theme, which led to a new question and three categories of social media data.

I've used a disappointment setup in my presentations for a while. "What's the best tool?" "It depends." The point is to get people thinking about what they're trying to accomplish, rather than jumping on the bandwagon for a popular tool. One of the questions I've suggested is "how do you measure social media?" There's an assumption hiding in that question, which became a limitation when I tried to update my slides. I needed a better question.

What can you do with social media data?
The key was to focus on the basic building blocks of analytics: data, analytics, and application. We tend to focus on the analytics technologies and the end-user applications, but what about the data? What if we focus on social media as a source of data? Ah, there we go.

What kind of data do social media give us to work with? If you look at the various specialists working the question, I've found three basic categories:

I'll go into each of these categories in the next few posts, but first, let's acknowledge that these are not rigid boundaries. Mixing data types and analytics lenses is definitely something to encourage, but if we want the data types to play together, we should understand what they are, first.

Next: Working with Social Media Data: Content

Photo by hugovk.

A Metaphor Is a Mixed Drink

CocktailsIt's Friday, and I've been writing long posts lately, so here's a simple idea: A metaphor is a mixed drink. In business, we use a lot of metaphors, some better than others.

Some are easy for beginners. They're simple and sweet, and they lose their appeal over time.

Some are difficult at first, but surprisingly good when you figure them out.

Some are old, traditional, and still on point. The classics.

Some are just outdated.

Some pack a lot of ingredients into a simple effect. All that work for so little result.

Some are gimmicky and less clever than they think. Sparkly!

Some appear simple but are capable of important subtleties.

Some look better than they are.

Some think they're metaphors but are actually similes.

The next time you're stuck in a meeting and the metaphors start to fly, you can amuse yourself by figuring out which drink a metaphor would be. It's more stealthy than shouting "bingo" after one too many clichés.

Happy weekend.

Photo by Kurman Communications, Inc.

It's bad enough when people are wrong as they express facts and opinions on the Internet. Mistakes happen. But there's more going on. Some people are intentionally adding noise to the online world, in an attempt to mislead users and analysts. Garbage in, garbage out, so how do we catch the garbage before it becomes part of the analysis?

This post is the second in a series. The first is Can You Trust Social Media Sources? Most of my posts aren't this long; the next will be nice and short.

Catching and deleting spam and other garbage in social media data is one side of an arms race, just like email spam and computer viruses. Developers of social media analysis platforms work to eliminate spam from their results, and spammers develop new tactics to dodge the filters. As long as the incentives remain, people will find ways to game the system.

For most analysts, the main response is to pick a platform that does a decent job of catching the undesirable content. Most do some sort of machine learning to identify and filter spam, and while the results are imperfect, they're useful as a first step. The second step is to allow users to flag content as spam, and it's good if the system learns from that action. A third step is to allow users to blacklist a site altogether; once you know it's not what you're looking for, there's no need to rely on the spam-scoring engine.

Evaluating questionable data
This is where I'd love to give you the magic button that reveals deceptive content. I'd like to have the Liar Liar power, too, but that's not going to happen. Instead, I have some ideas of how to think about questionable results. Most of them are in the form of questions. Some are more probabilistic than definitive, but I think they could be helpful.

  • Consider your purpose
    Your sensitivity to garbage in your data depends on what you're doing with it. If you're monitoring for customer service purposes, flag the spam and move on. If you're reporting on broad trends, you might get better results through sampling, or by focusing on high-quality sources. If you're looking for weak signals, you may not have the luxury of ignoring the low signal-to-noise ratio of a wide search. As always, match the effort to the objective.

    Some people actually need to look at spam—consider the legal department. If a link leads to a site selling counterfeit merchandise and you're in a trademark protection role, the spam is what you're looking for.

  • Consider the source (person)
    Who posted the item in question, and what do you know about them? Is the poster a known person? What do you know from the individual profile? Who does the person work for? What groups is the person connected to? Does the person typically discuss the current topic? Is the person's location consistent with the information shared?

    If you're not sure whether the poster is a person or a persona, develop a profile. A persona is like a cover identity; it can be strong or weak. Does the persona have a presence on multiple networks? Since when? Is it consistent across networks? Does it have depth, or is every post on the same topic? Who does the persona associate with online, and what do you know about them? Do the persona's connections reveal the complexity of relationship types that real people develop (school, work, family, etc.)? Do the profiles and connections give information about background that can be checked?

    For questionable sources, think about the different types of data that might reveal something through social network analysis.

    Back at the Social Media Analytics Summit, Tom Reamy described work by researchers to identify the political leanings of writers, based on their language choices (writing about non-political topics). Can we use text analytics to add information about native language, regional differences, and subject-matter expertise to individual profiles?

  • Consider the source (site)
    Where was the data posted? What do you know about the site? Is it a known or probable pay-to-play or disinformation site? Is it a content-scraping site? Does it have information from a single contributor (such as a blog) or from many (such as a crowdsourcing site)? What else is posted to the site? Where is it hosted? Who owns it? Where are they based? What can you learn from the domain registration?

    What's the online footprint of the site? Is it linked to real people in social networks? Is it used as a source by other people? Credibility flows through networks; do known, credible (not necessarily influential) people link to it and share its content in their networks? Does it appear to have bought its followers, or are they real people?

  • Consider other sources
    If you're going to do something serious—and I'll leave the definition of serious as an exercise for the reader—don't trap yourself in a new silo for social media data. What else do you know? What do other online sources say? Does the questionable data fit with what you're getting from sources outside of social media? Are you getting similar information from credible sources, or are all of the sources for the questionable data unknown?

    A few months ago, I heard Craig Fugate, the Administrator of the (US) Federal Emergency Management Agency (FEMA), tell a story about government agencies and unofficial sources of information. The story involved a suspected tornado and unconfirmed damage reports in social media. Government agencies prefer official reports from first responders and other trained observers, so the question was how to evaluate reports in social media.

    In the case of severe weather, one answer is to compare the reports with official sources of weather data. If radar indicated a likely tornado passing over a location a few minutes before the damage reports, then you'd know something important that should help evaluate those reports. What's the analogy for your task? Is there a hard-data source that can add relevant information? Does a geospatial view add a useful dimension (such as radar, post location, and photo metadata all in same location would, in the example)?

  • Consider the incentives
    What does a potential adversary stand to gain by fooling you—or someone else looking at the same data—with false information? Who gains by leading you to an incorrect action? Who makes money on your decision? Who benefits from misleading other people with false information (think product reviews and propaganda)? Is questionable information in your system consistent with the aims of an interested party?

    Part of the challenge here is that false information could be intended to mislead anyone. The target could be an individual, a small group, or entire populations. Who gains? Is there a link from the source to an interested party?

  • Consider the costs
    Part of what makes spam so frustrating is the volume level—there's a lot of the stuff around. At some point, the signal-to-noise ratio gets so low that the source becomes useless, unless you can identify and eliminate the junk. In a way, all that junk adds up to a sort of denial-of-service attack at the content layer. Is there a way to deal with that?

    A denial-of-service (DOS) attack and its scaled-up variant, the distributed denial-of-service (DDOS) attack, overload the targeted web site with simultaneous requests, causing it to become unavailable to real visitors. In 2010, Amazon weathered a DDOS attack without losing service. The explanation was that their normal operation looks a lot like a DDOS attack—lots of people visiting the site simultaneously. Their system was built to handle that kind of load, so the attack failed. One answer to a DDOS attack, then, is to have the capacity to handle the load.

    The social media analysis equivalent is to process it all, so what would that look like? Would a deeper analysis of known junk and its sources help improve the identification of junk? Would it tell you something useful about the parties that post the junk?

  • Consider the consequences
    The final point is to revisit the first point. What are you trying to accomplish? What decision will you make based on the data, and what happens if the information was false? What if it was placed there to manipulate your response (even if the information itself is true)? Does the rest of the decision-making process have the safeguards to prevent costly errors?
The hard problem
One way to look at this is to go through the whole process while thinking "spam." Junk results are an annoyance if you're doing day-to-day monitoring for business, and they're a problem if you're doing quantitative analysis. The technology is improving, and you have options for dealing with spam in these settings.

Some junk isn't that hard to catch, especially once a person looks at it. Gibberish blog comments are easy to identify. Names and email that don't match are sort of obvious, too. Content scrapers and other low-quality sites tend to have a certain look. If you have time to look at the spam that evades your filters, you can catch a lot of it.

The real challenge comes in looking for intelligence—whether in business, finance, politics, or government—in the presence of a motivated and well-funded adversary. If someone wants to fool you—or at least keep you from using an online source—they can improve their chances by better imitating the good data surrounding their junk. The quick glance to identify spam becomes a bigger effort, with more uncertainty.

Pay-to-play blogs may have original content from professional writers, so you can't just look for poor quality. False personas may be developed over time, with extensive material to create a convincing backstory. Networks of such personas could post disinformation, along with more normal-looking content, across multiple sites. With time and resources, personas can appear solid, which is why governments are investing in them.

I think some of the techniques above could help, but it's really a new arms race. The problem for everyone else is that this arms race will tend to poison the social media well for everyone who wants to discuss the contested topics.

If your organization is interested in these topics, don't just read the blog. Call me. As long as this post is, it's the short version. Clients get the full story.

XKCD cartoon by Randall Munroe.

FutbolBefore you can pull insights from your data, you need data, but I'm hearing more concerns about data quality in social media analysis lately. Before, people asked about the traditional tradeoff in text queries: finding relevant content while excluding off-topic content. Lately, I'm hearing more about social data that's intentionally tainted. If you're looking for meaning in social media data, you may have to deal with adversaries.

Yes, and you've been playing without an opponent, which is, as you may have guessed, against the rules.
— "Anton Ego," Ratatouille

Ask a company with three initials as a name how many three-letter abbreviations are in use, and you get a sense of the challenge in finding relevant content. Common words as brand names pose a similar challenge (I always like the examples of Apple and Orange, because it's the one time you really can compare them). If people are honest and expressing their real opinions, it's hard enough to find what you're looking for.

The problem is, people aren't always honest. You also need to get rid of intentional noise in the data.

The analyst's adversaries

  • Spam
    We've all seen online spam (sorry, Hormel, you must hate that term). Junk mail for hormones and drugs in email, junk comments on blogs, junk blogs, trashy web sites—the costs are so low that even microscopic conversion rates are profitable, so it persists. Some of that shows up in social media, which is the problem here.

    At the recent Social Media Analytics Summit, Dana Jacob gave a talk on the spam that finds its way into the search results of social media analysis platforms, skewing the numbers. One tidbit that Dana shared to illustrate the challenge: If you consider all of the creative misspellings, there are 600 quintillion (6 x 1020) ways to spell Viagra. So removing all of the spam from your data is a challenge.

    Spam seems to come in two flavors, neither of which will help you understand public opinion or online coverage. One is designed to fool people, to get them to click a link. It may lead to malware or fraud, or to some sort of product for sale. The other is designed to fool search engines with keywords and links embedded in usually irrelevant text. It's usually obvious to a human reader, but the hope seems to be that some search engines will count the links in their ranking of the target site.

  • Gaming analytics platforms
    Another presenter outlined a more direct challenge to the social media analyst when he described his system to game analytics systems with content farms and SEO tactics. He talked about using weaknesses in analytics systems to plant information in them. One slide described his methods as "weaponizing information in a predictive system," which doesn't leave a lot of room for exaggeration.

    He even used a real client as an example. The question is, how many others do the same thing, but discreetly? If you're looking for market intelligence in social media, do you trust your sources?

  • Deception in crowdsourced data
    Another conversation went into the potential poisoning of the crowdsourcing well, in this case one of the crowdmapping efforts in a political conflict. If one party to the conflict entered false reports—perhaps to discredit the project or misdirect a potential response—could it be detected?

  • Sockpuppets
    Beyond the crowdmapping context, can you detect opposition personas that post false reports in social media? It's a standard tactic in the government/political arena, but it could hit you in business, too. All you need is a motivated opponent.
It's a little farther afield, but read Will Critchlow's post on online dirty tricks for more ideas on how our tools can (will) be used against us. If you work with political clients, you'll want to understand how they work. For everyone else, it's another lesson toward being an informed voter.

Next: ideas for detecting deception
I don't mean to be all problem and no solution, but this post is already a long one. I'll share some ideas on how we might detect deception in social media in my next post. For now, I'll end with a happier observation: Sometimes, people lie in real life and get caught when they reveal the truth in social media.

Update: Part 2 is now up: Detecting Deception in Social Media

Photo by John Cooper.

About Nathan Gilliatt

Subscribe

Monthly Archives