Recently in Intelligence Category

Surveillance whiteboardAs ubiquitous surveillance is increasingly the norm in our society, what are the options for limiting its scope? What are the levers that we might pull? We have more choices that you might think, but their effectiveness depends on which surveillance we might hope to limit.

One night last summer, I woke up with an idea that wouldn't leave me alone. I tried the old trick of writing it down so I could forget it, but more details kept coming, and after a couple of hours I had a whiteboard covered in notes for a book on surveillance in the private sector (this was pre-Snowden, and I wasn't interested in trying to research government intelligence activities). Maybe I'll even write it eventually.

The release of No Place to Hide, Glenn Greenwald's book on the Snowden story, provides the latest occasion to think about the challenges and complexity of privacy and freedom in a data-saturated world. I think the ongoing revelations have made clear that surveillance is about much more than closed-circuit cameras, stakeouts and hidden bugs. Data mining is a form of passive surveillance, working with data that has been created for other purposes.

Going wide to frame the question
As I was thinking about the many ways that we are watched, I wondered what mechanisms might be available to limit them. I wanted to be thorough, so I started with a framework to capture all of the possibilities. Here's what I came up with:

Constraints on personal data

The framework is meant to mimic a protocol stack, although the metaphor breaks down a bit in the higher layers. The lowest layers provide more robust protection, while the upper layers add nuance and acknowledge subtleties of different situations. Let's take a quick tour of the layers, starting at the bottom.

Hard constraints
The lowest layers represent hard constraints, which operate independently of judgment and decisions by surveillance operators:

  • Data existence
    If the data don't exist, they can't be used or abused. Cameras that are not installed, microphones that are not activated do not collect data. Unposted travel plans do not advertise absence; non-geotagged photos and posts are not used to track individual movements. At the individual level, countermeasures that prevent the generation of data exhaust will tend to defeat surveillance, as will the avoidance of known cameras and other active surveillance mechanisms.

  • Technical
    Data, once generated, can be protected, which is where much of the current discussion focuses. Operational security measures—strong passwords, access controls, malware prevention, and the like—provide the basics of protection. Encryption of stored data and communication links increase the difficulty—and cost—of surveillance, but this is an arms race. The effectiveness of technical barriers to surveillance depends substantially on who you're trying to keep out and the resources available to them.
Soft constraints
The upper layers represent soft constraints—those which depend on human judgment, decisionmaking and enforcement for their power. Each of these will tend to vary in its effectiveness by the people and organizations conducting surveillance activities.

  • Legal
    This is the second of two layers that contain most of the ongoing discussion and debate, and the default layer for those who can't join the technical discussion. The threat of enforcement may be a deterrent to some abuse. Different laws cover different actors and uses, as illustrated in the current indictment of Chinese agents for economic espionage.

  • Market
    In the private sector, there's no enforcement mechanism like market pressure—in this case, a negative reaction from disapproving customers. Companies have a strong motive to avoid activities that hurt sales and profits, and so they may be deterred from risking a perception of surveillance and data abuse. This is the layer least likely to be codified, but it has the most robust enforcement mechanism for business. In government, the equivalent constraint is political, as citizens/voters/donors/pressure groups respond to laws, policies and programs.

  • Policy
    At the organization level, policy can add limits beyond what is required by law and other obligations. Organization policy may in many cases be created in reaction to market pressure and prior hard lessons, extending the effectivenes of market pressure to limit abusive practices. In the public sector, the policy layer tends to handle the specifics of legal requirements and political pressures.

  • Ethical
    Professional and institutional ethics promise to constrain bad behavior, but the specific rules vary by industry and role, and enforcement is frequently uncertain. Still, efforts such as the Council for Big Data, Ethics, and Society are productive.

  • Personal
    Probably the weakest and certainly the least enforceable layer of all, personal values may prevent some abuse of surveillance techniques. Education and communication programs could reinforce people's sensitivity to personal privacy, but I include this layer primarily for completeness. Where surveillance operators are sensitive to personal privacy, abuses will tend not to be an issue.
Clearly, the upper layers of this framework lack some of the definitive protections of the lower layers, and they're unlikely to provide any protection from well-resourced government intelligence agencies (from multiple countries) and criminal enterprises. But surveillance (broadly construed) is also common in the private sector, where soft constraints are better than no constraints. As we consider the usefulness and desirability of the growing role of surveillance in society, we should consider all of the levers available.

One step at a time
This framework isn't meant to answer the big questions; it's about structuring an exploration of the tradeoffs we make between the utility and the costs of surveillance. Even there, this is only one of several dimensions worth considering. Surveillance happens in the private sector and government, both domestically and internationally. There's a meaningful distinction between data access and usage, and different value in different objectives. Take these dimensions and project them across the whole spectrum of active and passive techniques that we might call surveillance, and you see the scope of the topic.

Easy answers don't exist, or they're wrong. It's a complex and important topic. Maybe I should write that book.

If I write both the surveillance book and the Omniscience book (on the value that can be developed from available data), should I call them yin and yang?

Poisoning the Online Well

Garbage in, garbage out. The latest from the ongoing Snowden/Greenwald revelation is a reminder that interested parties know how to plant false information on the Internet, and that some of them are probably doing it. It has implications for anyone looking for good information online, anyone with a reputation to protect, and—potentially—for everyone invested in the online world.

The piece itself is worth a look (How Covert Agents Infiltrate the Internet to Manipulate, Deceive, and Destroy Reputations). The details are more disturbing than surprising, but as you read it, ignore the focus on the British intelligence agency GCHQ. It doesn't matter whether you trust your own government's actions, and the common distinction between a country's own citizens and everyone else is also irrelevant. The same tactics are available to every government—and any other motivated group. If they don't do this already, the newly released document provides the suggestion.

For the government intelligence guys, this is just a continuation of the second oldest profession: Get your enemy's secrets; protect your own. Deceive your enemy; avoid deception. It's a challenge when multiple entities are simultaneously trying to (a) get useful information from open sources online and (b) plant deceptive information in the same sources. I wonder how much blue-on-blue deception happens between information operations and open-source intelligence gathering, anyway.

For everyone else, this latest report should serve as a reminder of some of the risks in social media:

  1. Data quality risk
    People tell lies online—I know, but it's true. Some of the false information out there may have been placed by a motivated adversary who wants to mislead you (maybe even you, specifically). The target may be your organization, a related organization or someone who wants to work with you.

    The information you find online can be a useful source, but it's not the only source. If you're informing significant decisions, use all of your available resources, and be alert to the possibility of intentional deception.

  2. Reputation risk
    We're familiar with the concept of online reputation risk; corporate risk managers seem to think it's almost synonymous with "social media." If your business has potential exposure to government opposition (from whatever country), your risk may come from a better organized and funded source than the usual unhappy former customer.

  3. Target risk
    As people conduct their personal and political lives online, they expose themselves to snooping and more. The threats to personal privacy and freedom by government agencies have made the ongoing revelations newsworthy, but these public and semi-public channels are equally exposed to anyone who disagrees.

  4. Collateral damage risk
    Some of these information operations happen in the same online venues as normal personal use. As competing governments start viewing the online world through the cyber battlespace lens, normal users and the platforms themselves could take some damage. Off the top of my head, I'm thinking of legal, market, and technical risks, but that's probably just a start.

    It's too much to go into in a post, but companies with significant exposure to covert online tactics would be well served to chase down the implications of those tactics, and don't limit the discussion to legal exposure. Beyond the specifics on any one program, the revelations of the last year indicate the willingness of government entities in multiple countries to use environments operated by private-sector companies in ways they weren't intended. The safe asumptions are that governments are doing more than we know, and so are other types of organizations.

Politically, it matters very much who is doing what to whom and why. As a practical matter, who and why don't much matter. It's enough to know that someone, somewhere is developing and using methods to use popular online tools against people and organizatons they don't like. If you depend on online tools and don't have a basic literacy in the concept of cyberwar, it's time to learn, so you can recognize it if it comes to your neighborhood.

One of the great strengths of the Internet is the way it overcomes the limitations of distance. A side effect is that it also does away with the concept of a safe distance from danger.


SpikeEveryone loves a chart that answers a key question, but I particularly like the ones that make you think: Why did that happen? What changed? What are we missing? What happens next?

A spike on a chart is a big ol' why, waiting to be asked.
me, 2010

It's an old point, but a few examples came to me last week. Beyond the immediate interpretation of the numbers (e.g., big number good, small number bad), I think these patterns imply follow-up questions along the lines of "what happened here" and "why did it happen?"

  • Spike in a trend
    A sudden change means something happened. What? Why? Did the value then return to the usual range? Is the new value temporary or a new normal? Do you need to take some action as a result? The spike is the chart telling you where to look, which I suspect most people do instictively.

  • Smooth line on a historically bumpy trend
    A bumpy trend line that grows more stable is telling you something else, but the follow-up questions are similar. Did the data source stop updating, or is the change real? Remember to watch the derivatives of your metrics, too. If the metric keeps changing but the rate becomes constant, is that real or an artifact of the data collection? What happened, why, what action in response…

  • Crossing lines
    A is now bigger than B; does it matter? Obviously, it depends on what A and B represent, but it's a good place to understand: what happened, why, what it means, how much it matters, and whether to expect it to continue. If it's a metric that people care about, expect to discuss it.
Beyond the numbers
Thinking beyond the graphs, I remembered two things from conceptual diagrams that always make me curious:

  • Empty boxes in a matrix
    If the framework makes sense, its boxes should be filled in, whether it's the consultant's standard two-by-two matrix or something much larger. An empty box may represent an impossible combination—but it could be a missed challenge or opportunity. I once found $12 million in sales in an empty box, and so empty boxes always get my attention.

  • Solid lines around a space
    A clear definition says as much about what something isn't as what it is. When the definition takes the form of a diagram—an org chart, a Venn diagram, a network graph—I wonder about what's just outside the diagram. The adjacent markets and competitors from the future; the people who are near—but not in—an organization. What does the white space represent, and what does that mean to you?
These came to me as I was getting ready to attend a lecture by Kaiser Fung (which was excellent—ask him about the properties of big data). I'm sure there are many more. Without wading into technical analysis waters, what other patterns make you stop and think?

Writing at Wired UK, Paul Wright has some concerns about the use of social media monitoring in law enforcement: Meet Prism's little brother: Socmint. I'll quote a couple of sections, but you need to read the whole piece; its tone is at least as important as its content.

For the past two years a secretive unit in the Metropolitan Police has been developing the tools for blanket surveillance of the public's social media conversations,

Operating 24 hours a day, seven days a week, a staff of 17 officers in the National Domestic Extremism Unit (NDEU) has been scanning the public's tweets, YouTube videos, Facebook profiles, and anything else UK citizens post in the public online sphere.

The intelligence gathering technique—sometimes known as Social Media Intelligence or Socmint—has been used in conjunction with an alarming array of sophisticated analytical tools. [emphasis added]

Wright has a fairly alarmist—but accurate—take on something that's obvious to anyone who thinks about it: outside of a few protected spaces, what we do in social media is public, and government security and law enforcement agencies are using that data. It's the details of what they do with it that will make some people uncomfortable.

The problem is that public is gaining new depth of meaning as information moves online, and we haven't sorted the implications.

Nothing changes, but everything's changed
The new public information is persistent, searchable, and rich with analytic potential. I wrote about this last year (Why Government Monitoring Is Creepy), and it's still where I think we need to start. People seem to be expecting a sort of semi-privacy online, but the technology doesn't have that distinction. Data is either public or private, and the private space is shrinking.

The "alarming array" of tools refers to all the interesting stuff we've been talking about doing with social media data for years: text analytics, social network analysis, geospatial analysis… For business applications, we've mostly talked about analysis on aggregate data, but if you apply the lens toward profiling individuals and don't care about being intrusive, you can start to justify the concerns.

But several privacy groups and think tanks—including Big Brother Watch, Demos and Privacy International—have voiced concerns that the Met's use of Socmint lacks the proper legislative oversight to prevent abuses occurring.

It's worth noting that Wright's piece is specifically about law enforcement use of social media data, and he points to others who are concerned about overreach by law enforcement agencies. Here are the organizations mentioned, along with links to some of their relevant work:

This is the social data industry's PRISM problem: the risk that the revelations of intelligence agency practices will raise broader privacy concerns that include the business use of public social media data. They're different issues, but the interest sparked by the NSA disclosures has people thinking about privacy.

In this case, Wired makes the connection explicit with their headline, calling social media intelligence "Prism's little brother." As Wright demonstrates in his article, open-source social media monitoring raises issues, too.

Legitimate questions, too
There's more going on here than a question of perception. If invasion of online privacy gains traction as an issue, the important distinction between public and private data is only part of the issue. If we limit the topic to public data, the question becomes, what are the limits to the use of public data?

An important part of answering that question will depend on understanding why there should be limits, which goes to what is being done with the data. It's going to be worth separating the concepts of accessing the data and using it. What you do in your analysis may be even more sensitive than the data you base it on.

People are sharing more than they realize, and analysts can do more with that data than people think. As monitoring becomes pattern detection becomes predictive modeling, it becomes more likely to make people uncomfortable. Last year's pregnant daughter is this year's precrime is next year's thoughtcrime, or so the thinking goes.

Will concerns like this lead to new restrictions by governments or the companies who control the data? Will people cut back on their public sharing? Or will these concerns fade when the next topic takes the stage (squirrel!)?

What are the constraints?
The existing limits on social media monitoring and analysis boil down to this: If it is technically possible, not illegal, and potentially useful, do it (depending on your affiliations, professional ethical standards may also apply). What we're seeing is that the unrestricted use of social data has the potential to make people uncomfortable, which could have consequences for those who would use the data.

It's worth thinking about the constraints on using social data, which involves more than the ethics question. I have some thoughts, which I'll share later.

Asking a computer to make sense of everyone's written opinions is a big challenge, but it's not the last one that social media will impose on anyone who wants to analyze it. We're sharing a lot of pictures in our virtual hangouts lately, which means it's time to update the old question. Instead of "what are people saying about us," the new question is something like, "what do people's pictures tell us about what they think of us and how they use our products?"

Just as the shared images give us access to new types of information about people, their tastes, and more, emerging technologies offer the promise of helping us understand the images at scale. To the vocabulary of text analytics or natural language processing, add computer vision. As with its text-processing cousin, it's not as evolved as your eyes, but it doesn't blink, and it doesn't sleep.

Looking at the photo directly
Let's say you want to track publicly shared photos that contain your company's logo. Without image analysis, monitoring depends on keywords in posts and photo descriptions, filenames, tags, and other metadata. It's better than nothing, but it has limitations. You're going to pick up images that don't actually include your logo, and you'll miss photos that include your logo but aren't about your logo.

If your tool can "see" product logos in photographs, you get access to a different type of information. You start to catch products and logos in the wild, where people really use them. The brand protection guys will like enhanced abilities to track counterfeits and parodies, but maybe this opens the door to a new kind of online ethnography, too.

Finding the technology
As demand picks up , you can expect the serious competitors in social media analysis to add image search capabilities. Already, Ninestars has added image recognition from a partner, and Meltwater's OculusAI acquisition suggests future capabilities with images. They won't be the last.

These companies are going at the image recognition challenge directly:

What's next?
Computer vision has lots of potential beyond spotting logos in photos. I imagine that this sort of product/logo identification will extend to video, though I'll need to talk to an expert to understand when to expect that.

And then there are people. We already have identity tagging in Facebook, and big money is going toward advancing facial recognition. I also found Real Eyes, a company that analyzes emotional responses from video, so visual analysis of faces isn't limited to identifying their owners.

The computers aren't just reading. They're starting to watch, too. Can you do something good with that?

This is one of those list posts that will grow as people point out more companies. Who'd I miss?

I sometimes summarize the opportunity of social media analysis as using computers to "read the Internet." It's not an original idea, but it is one we still haven't mastered. I've seen many tools that find relevant content and apply some level of automated analysis, but we're not about to replace the analyst. One simple question I've started to think about is, "then what happens?"

The SocialSpook 9000 reads millions of blog posts, Facebook updates, and tweets every second. It finds every relevant mention in your space, extracting the facts, opinions, and needs that you're looking for. Its sentiment analysis engine provides 120% accuracy in 38 languages, and its graphics are so well designed that whole new awards contests have been created for it to win.

In 2007, I pointed out the need to link social media monitoring to customer service, because most of the problems that people were seeing as PR problems started with unhappy customers. Since 2010, I've been thinking about another application: blending social media data with other publicly available sources to create an automated view of what's happening in the world. It turns out to be a big challenge.

My own private news channel… or command center?
We can take this in several directions. At the low end, applications such as Flipboard generate personalized media based on activity in the user's social media accounts and selected topics or sources. In the middle, we might have a more dynamic version of the social media dashboard running in the conference room or reception area. It's the web-powered news channel that always shows something you might care about.

At the high end, we're looking at a valuable—but noisy and sometimes misleading—source of crowdsourced information about events in near-real time. The obvious applications are in government: national security, law enforcement, emergency management, and disaster response agencies are looking for fast and accurate information from social media sources. I see value in corporate applications, too, for functions like security, risk management, logistics, and business continuity that need information when things happen. Preferably without hiring an army of analysts to look at dashboards on the quiet days.

Now what happens?
The challenge in using social media for real-time awareness is that the volume level becomes overwhelming just when the information becomes most valuable. Forget looking for the needle in a haystack; this is the needle in the needlestack. Faster than you can read them, more messages arrive, and they're all relevant.

Existing tools generally emphasize either handling messages individually (think customer service or community engagement) or analyzing them in aggregate (think sentiment and leading topics). For this application, we want the system to help analysts deal with the volume without losing the detail, and that's where I started asking about what happens next.

For all the systems that can notice something happening and put it on a screen, I wanted a system that can notice and pay attention. So what would that look like?

Here's an idea (click to enlarge):

Computer Attention in Situational Awareness Applications

The inputs to this system can go beyond social media content; depending on your application, it might pick up data about natural disasters, weather, or market data. It might incorporate traditional news media, commercial intelligence services, or internal data. Its models will reflect the needs of its users, so a system that looks for, say, transportation-related incidents could be quite different from one looking for damage reports in weather emergencies.

This has a lot of moving parts, and it builds on what others have already built. The central idea is to go beyond the dashboard and think about how the system can relieve analysts of some of the burden of reading the alert queue. Step one is to consider what an analyst does with that information and how a computer could mimic that.

I'm sharing some of the frameworks that have been hiding on my whiteboard. Want the long version? Email me.

It's bad enough when people are wrong as they express facts and opinions on the Internet. Mistakes happen. But there's more going on. Some people are intentionally adding noise to the online world, in an attempt to mislead users and analysts. Garbage in, garbage out, so how do we catch the garbage before it becomes part of the analysis?

This post is the second in a series. The first is Can You Trust Social Media Sources? Most of my posts aren't this long; the next will be nice and short.

Catching and deleting spam and other garbage in social media data is one side of an arms race, just like email spam and computer viruses. Developers of social media analysis platforms work to eliminate spam from their results, and spammers develop new tactics to dodge the filters. As long as the incentives remain, people will find ways to game the system.

For most analysts, the main response is to pick a platform that does a decent job of catching the undesirable content. Most do some sort of machine learning to identify and filter spam, and while the results are imperfect, they're useful as a first step. The second step is to allow users to flag content as spam, and it's good if the system learns from that action. A third step is to allow users to blacklist a site altogether; once you know it's not what you're looking for, there's no need to rely on the spam-scoring engine.

Evaluating questionable data
This is where I'd love to give you the magic button that reveals deceptive content. I'd like to have the Liar Liar power, too, but that's not going to happen. Instead, I have some ideas of how to think about questionable results. Most of them are in the form of questions. Some are more probabilistic than definitive, but I think they could be helpful.

  • Consider your purpose
    Your sensitivity to garbage in your data depends on what you're doing with it. If you're monitoring for customer service purposes, flag the spam and move on. If you're reporting on broad trends, you might get better results through sampling, or by focusing on high-quality sources. If you're looking for weak signals, you may not have the luxury of ignoring the low signal-to-noise ratio of a wide search. As always, match the effort to the objective.

    Some people actually need to look at spam—consider the legal department. If a link leads to a site selling counterfeit merchandise and you're in a trademark protection role, the spam is what you're looking for.

  • Consider the source (person)
    Who posted the item in question, and what do you know about them? Is the poster a known person? What do you know from the individual profile? Who does the person work for? What groups is the person connected to? Does the person typically discuss the current topic? Is the person's location consistent with the information shared?

    If you're not sure whether the poster is a person or a persona, develop a profile. A persona is like a cover identity; it can be strong or weak. Does the persona have a presence on multiple networks? Since when? Is it consistent across networks? Does it have depth, or is every post on the same topic? Who does the persona associate with online, and what do you know about them? Do the persona's connections reveal the complexity of relationship types that real people develop (school, work, family, etc.)? Do the profiles and connections give information about background that can be checked?

    For questionable sources, think about the different types of data that might reveal something through social network analysis.

    Back at the Social Media Analytics Summit, Tom Reamy described work by researchers to identify the political leanings of writers, based on their language choices (writing about non-political topics). Can we use text analytics to add information about native language, regional differences, and subject-matter expertise to individual profiles?

  • Consider the source (site)
    Where was the data posted? What do you know about the site? Is it a known or probable pay-to-play or disinformation site? Is it a content-scraping site? Does it have information from a single contributor (such as a blog) or from many (such as a crowdsourcing site)? What else is posted to the site? Where is it hosted? Who owns it? Where are they based? What can you learn from the domain registration?

    What's the online footprint of the site? Is it linked to real people in social networks? Is it used as a source by other people? Credibility flows through networks; do known, credible (not necessarily influential) people link to it and share its content in their networks? Does it appear to have bought its followers, or are they real people?

  • Consider other sources
    If you're going to do something serious—and I'll leave the definition of serious as an exercise for the reader—don't trap yourself in a new silo for social media data. What else do you know? What do other online sources say? Does the questionable data fit with what you're getting from sources outside of social media? Are you getting similar information from credible sources, or are all of the sources for the questionable data unknown?

    A few months ago, I heard Craig Fugate, the Administrator of the (US) Federal Emergency Management Agency (FEMA), tell a story about government agencies and unofficial sources of information. The story involved a suspected tornado and unconfirmed damage reports in social media. Government agencies prefer official reports from first responders and other trained observers, so the question was how to evaluate reports in social media.

    In the case of severe weather, one answer is to compare the reports with official sources of weather data. If radar indicated a likely tornado passing over a location a few minutes before the damage reports, then you'd know something important that should help evaluate those reports. What's the analogy for your task? Is there a hard-data source that can add relevant information? Does a geospatial view add a useful dimension (such as radar, post location, and photo metadata all in same location would, in the example)?

  • Consider the incentives
    What does a potential adversary stand to gain by fooling you—or someone else looking at the same data—with false information? Who gains by leading you to an incorrect action? Who makes money on your decision? Who benefits from misleading other people with false information (think product reviews and propaganda)? Is questionable information in your system consistent with the aims of an interested party?

    Part of the challenge here is that false information could be intended to mislead anyone. The target could be an individual, a small group, or entire populations. Who gains? Is there a link from the source to an interested party?

  • Consider the costs
    Part of what makes spam so frustrating is the volume level—there's a lot of the stuff around. At some point, the signal-to-noise ratio gets so low that the source becomes useless, unless you can identify and eliminate the junk. In a way, all that junk adds up to a sort of denial-of-service attack at the content layer. Is there a way to deal with that?

    A denial-of-service (DOS) attack and its scaled-up variant, the distributed denial-of-service (DDOS) attack, overload the targeted web site with simultaneous requests, causing it to become unavailable to real visitors. In 2010, Amazon weathered a DDOS attack without losing service. The explanation was that their normal operation looks a lot like a DDOS attack—lots of people visiting the site simultaneously. Their system was built to handle that kind of load, so the attack failed. One answer to a DDOS attack, then, is to have the capacity to handle the load.

    The social media analysis equivalent is to process it all, so what would that look like? Would a deeper analysis of known junk and its sources help improve the identification of junk? Would it tell you something useful about the parties that post the junk?

  • Consider the consequences
    The final point is to revisit the first point. What are you trying to accomplish? What decision will you make based on the data, and what happens if the information was false? What if it was placed there to manipulate your response (even if the information itself is true)? Does the rest of the decision-making process have the safeguards to prevent costly errors?
The hard problem
One way to look at this is to go through the whole process while thinking "spam." Junk results are an annoyance if you're doing day-to-day monitoring for business, and they're a problem if you're doing quantitative analysis. The technology is improving, and you have options for dealing with spam in these settings.

Some junk isn't that hard to catch, especially once a person looks at it. Gibberish blog comments are easy to identify. Names and email that don't match are sort of obvious, too. Content scrapers and other low-quality sites tend to have a certain look. If you have time to look at the spam that evades your filters, you can catch a lot of it.

The real challenge comes in looking for intelligence—whether in business, finance, politics, or government—in the presence of a motivated and well-funded adversary. If someone wants to fool you—or at least keep you from using an online source—they can improve their chances by better imitating the good data surrounding their junk. The quick glance to identify spam becomes a bigger effort, with more uncertainty.

Pay-to-play blogs may have original content from professional writers, so you can't just look for poor quality. False personas may be developed over time, with extensive material to create a convincing backstory. Networks of such personas could post disinformation, along with more normal-looking content, across multiple sites. With time and resources, personas can appear solid, which is why governments are investing in them.

I think some of the techniques above could help, but it's really a new arms race. The problem for everyone else is that this arms race will tend to poison the social media well for everyone who wants to discuss the contested topics.

If your organization is interested in these topics, don't just read the blog. Call me. As long as this post is, it's the short version. Clients get the full story.

XKCD cartoon by Randall Munroe.

FutbolBefore you can pull insights from your data, you need data, but I'm hearing more concerns about data quality in social media analysis lately. Before, people asked about the traditional tradeoff in text queries: finding relevant content while excluding off-topic content. Lately, I'm hearing more about social data that's intentionally tainted. If you're looking for meaning in social media data, you may have to deal with adversaries.

Yes, and you've been playing without an opponent, which is, as you may have guessed, against the rules.
— "Anton Ego," Ratatouille

Ask a company with three initials as a name how many three-letter abbreviations are in use, and you get a sense of the challenge in finding relevant content. Common words as brand names pose a similar challenge (I always like the examples of Apple and Orange, because it's the one time you really can compare them). If people are honest and expressing their real opinions, it's hard enough to find what you're looking for.

The problem is, people aren't always honest. You also need to get rid of intentional noise in the data.

The analyst's adversaries

  • Spam
    We've all seen online spam (sorry, Hormel, you must hate that term). Junk mail for hormones and drugs in email, junk comments on blogs, junk blogs, trashy web sites—the costs are so low that even microscopic conversion rates are profitable, so it persists. Some of that shows up in social media, which is the problem here.

    At the recent Social Media Analytics Summit, Dana Jacob gave a talk on the spam that finds its way into the search results of social media analysis platforms, skewing the numbers. One tidbit that Dana shared to illustrate the challenge: If you consider all of the creative misspellings, there are 600 quintillion (6 x 1020) ways to spell Viagra. So removing all of the spam from your data is a challenge.

    Spam seems to come in two flavors, neither of which will help you understand public opinion or online coverage. One is designed to fool people, to get them to click a link. It may lead to malware or fraud, or to some sort of product for sale. The other is designed to fool search engines with keywords and links embedded in usually irrelevant text. It's usually obvious to a human reader, but the hope seems to be that some search engines will count the links in their ranking of the target site.

  • Gaming analytics platforms
    Another presenter outlined a more direct challenge to the social media analyst when he described his system to game analytics systems with content farms and SEO tactics. He talked about using weaknesses in analytics systems to plant information in them. One slide described his methods as "weaponizing information in a predictive system," which doesn't leave a lot of room for exaggeration.

    He even used a real client as an example. The question is, how many others do the same thing, but discreetly? If you're looking for market intelligence in social media, do you trust your sources?

  • Deception in crowdsourced data
    Another conversation went into the potential poisoning of the crowdsourcing well, in this case one of the crowdmapping efforts in a political conflict. If one party to the conflict entered false reports—perhaps to discredit the project or misdirect a potential response—could it be detected?

  • Sockpuppets
    Beyond the crowdmapping context, can you detect opposition personas that post false reports in social media? It's a standard tactic in the government/political arena, but it could hit you in business, too. All you need is a motivated opponent.
It's a little farther afield, but read Will Critchlow's post on online dirty tricks for more ideas on how our tools can (will) be used against us. If you work with political clients, you'll want to understand how they work. For everyone else, it's another lesson toward being an informed voter.

Next: ideas for detecting deception
I don't mean to be all problem and no solution, but this post is already a long one. I'll share some ideas on how we might detect deception in social media in my next post. For now, I'll end with a happier observation: Sometimes, people lie in real life and get caught when they reveal the truth in social media.

Update: Part 2 is now up: Detecting Deception in Social Media

Photo by John Cooper.

Twitter metadataDo you put social media data on a map? Location is a handy dimension for slicing, dicing, and visualizing your data. The question is, which location are you visualizing? Even a single tweet—in under 140 characters—can have four different locations.

I've taken a real interest in applying geospatial analysis to social media over the past year. It's been especially appropriate in emergency management and some other discussions with government types. Mostly, though, it's just another lens to apply to social media data, another way to find some value in the data we have now.

So, you want to put social media activity on a map. It's worth thinking about what that location really represents. One little statement can have four distinct locations, depending on how you look at it:

  1. Location of the service/server
    Internet-based communications happen in this virtual space where physical location is largely irrelevant, but everything runs on a computer somewhere—even in the cloud.

    You could even separate this one into two (or more) locations—the locations of the server and of the company that owns it—but for most of us, these are the least relevant locations. A few specialists need to know the physical or logical location of a server, but for the rest of us, there's nothing to see here.

  2. Location of the account
    Look at an account on Twitter, Facebook, or other social network. Most of them have a place for users to provide their location. Its accuracy depends on the account owner, which is why you see so many Twitter accounts located in "Earth" or something similarly uninformative. During the pro-democracy protests in Iran, a lot of people set their Twitter locations to Tehran in sympathy with the protesters.

    At its most useful, the location associated with an account tells you a default location for a user—home base.

  3. Location of the post
    Social and mobile are increasingly two aspects of the same technology-adoption trend, as more people take their social media through mobile devices. With geolocation tagging and location-based services, they're sharing their immediate location: "I am here, now." This is the location you're most likely to see represented on a map.

  4. Location of the described event
    This last location won't be encoded in an API, because it's found in the content people share. When they talk about events in the real world, they mention places, possibly indirectly. You'll need a text analytics tool that recognizes locations to extract those. When they post pictures, the photos may include location metadata from the camera.
Let's put them all together with a couple of hypothetical examples. We'll ignore the location of the server, because it's not relevant for most uses.

  • Let's say that I tweet about an event in Egypt (4) during a break at a conference in Washington (3). My account location (2) is in North Carolina. How does that compare with a geotagged photo (4) of the same event sent from Cairo (3) by an account that says it's located in Cairo (2)?

  • It's another stormy day in the middle of America, and someone posts a picture of a damaged building (4) on Facebook. The account location (2) and post location (3) are nearly the same, and they're in the projected path of a tornado, based on National Weather Service radar data. Do you believe that a tornado hit the building?
Despite all of that muddying of the water, you're probably ok if you use the per-post geolocation data for most purposes. When in doubt, always remember to state your question clearly, and then you can pick the right data to answer it.

Illustration: Map of a Twitter status object by Raffi Krikorian.

Why Government Monitoring Is Creepy

Eavesdrop phoneQuiz: A government agency wants to monitor social media in the course of performing its function. Is that an obvious use of public information, or further evidence of a dark conspiracy? Oh, good, I see lots of hands for both answers. Let's look at what's really going on here.

You have zero privacy anyway. Get over it.
—Scott McNealy (1999)
When people hear about social media monitoring by a government agency—such as the recent news of FBI, DHS, and CIA programs—the usual response is outrage about the perceived violation of privacy. People are living their lives online, and they don't want the government listening in.

Superficially, that's completely understandable. Most of us don't want people eavesdropping on us, even if we aren't hiding anything and don't harbor conspiracy theories. We just like our conversations to be kept within the group we think we're talking to. The usual response makes intuitive sense, even if we realize that these online conversations are, technically, public.

(By the way, I'm assuming that we're talking about governments in free, democratic countries here. Events over the last few years have clearly demonstrated the danger to people sharing information and opinions in countries with repressive regimes during times of instability. Sometimes, it's easy to decide whether the government is using or abusing people's information.)

Expectations of privacy
Where do we get this expectation of privacy in public places? Everybody knows that Twitter is public (unless you make your updates private), Facebook has public updates, YouTube is for the world, many forums are public, and blogs are a form of publishing, right?

How can we expect privacy in a public place?

Read that last sentence again, and I think we'll start to see what happened. We're not really talking about a public place—it's not a place at all. All of this Internet-based communication happens in a virtual space, which is shared by everyone. Virtual means almost, which also means not. A virtual space is not a real space; it's an artificial environment that is different from the real world in important ways.

The nature of public is one of those ways.

Public doesn't mean what it used to mean
Imagine having a conversation with a friend in a public place—a city street, maybe, next to a bus stop, or a sports stadium during a game. These are public places. We may have norms against eavesdropping, but someone standing close to you might hear your conversation. So your expectation of privacy is reduced, compared to when you have a conversation in a home or office.

The physical world imposes limits on the potential audience for conversations. Sound drops off over distance, and quickly. Other sounds in the environment block out the conversation, too. If you're talking while a bus leaves the stop or a big play happens on the field, even the person you're talking to might have trouble hearing you. A few feet away, you're inaudible. Across the street or stadium, you may as well not exist.

The Internet is different. A whisper on the other side of the world is as clear as a shout in a quiet room. A million people can talk at the same time, and we can pick out individual conversations—all of them. Say something today, and it's still there tomorrow. Time, distance and the crowd—none of them recreate the semi-privacy we experience in physical settings.

The conversation at the bus stop and the isolated tweet are both public, and yet they're entirely different. The differences come back to the difference between the Internet and the physical world. People react to the perceived violations of privacy because they learned their ideas of public and private in the physical world, and the different physics of information in the virtual world break their mental models.

A clear dichotomy
The virtual world also breaks the in-between states of semi-private and semi-public. There's no semi online. Private is uncertain, too.

Three can keep a secret, if two of them are dead.
—Benjamin Franklin
Some online venues make the attempt to be private, but it's enforced with terms of service and technical measures that can be defeated. Any notion of privacy in online communications has an element of trust, which may be backed up by contracts or law. But it's not private in the same way as a conversation in a closed room.

Public discussions, on the other hand, are really public, in a globally ubiquitous way that the physical world can't match. Those open Twitter accounts and blog posts, the groups and forums that anyone can read. Comments on newspaper sites and book reviews. Videos and pictures uploaded all over the place. Anyone can see them—milliseconds or months later.

This isn't the first time
We've run into this qualitative change in the nature of public information before. Think about public records that the government keeps, such as on property transactions. These records have always been public, but pre-Internet, realities of the physical world created barriers to access.

If you wanted to look at property records, you had to go to the clerk in the appropriate local government office. You'd probably wait in line, and when it was your turn, you made your request. If you asked for something the clerk could find, you could look at the file, and you might pay ten cents a page to get a copy.

Where's the record today? It's on the web, with a database query engine that lets you look up properties by owner or address, with wild cards in your queries. If you don't find what you want, you look again—as many times as you like. When you find something interesting, you have all the information, which you can save or print as much as you like.

On other web sites, that same public record is aggregated with many others, mashed up in a map that shows house prices everywhere. Zoom out, get the big picture. Zoom in, find out what your neighbor paid for that house. It's the same public record, but putting it on a computer and making it available on the web completely changes what it means to be public.

The world changes faster than we adapt
We're so used to the constant rush of innovations and what we can do with them. We're not so good with thinking about the implications and adjusting our mental models. People start sharing their lives in these public channels, without thinking about what happens to the information. Remember the first stories of job applicants who shared the wrong pictures in Facebook?

Now, government agencies are opening up about their interest in what people have to say online, and we have this wounded sense of privacy based on expectations from the physical world. All that data is public, in the expanded sense of online public information. Did people think that officials wouldn't find it useful?

The value to government is obvious, but we need a reasoned discussion on the appropriate tradeoffs between government use and individual protection. All of which is far too much for an already long-winded blog post.

Related posts:

Photo by Jeff Schuler.

About Nathan Gilliatt

  • ng.jpg
  • Voracious learner and explorer. Analyst tracking technologies and markets in intelligence, analytics and social media. Advisor to buyers, sellers and investors. Writing my next book.
  • Principal, Social Target
  • Profile
  • Highlights from the archive


Monthly Archives