The Science of an Upset

By  Kathryn Schiller Wurster

Donald Trump won the presidency last night, taking the electoral college despite what appears to be Clinton’s narrow win in the popular vote. The results surprised nearly everyone in the media and polling world, who had almost entirely predicted a wide margin of victory for Hillary Clinton. Even Nate Silver’s blog FiveThirtyEight, which has earned a reputation for crunching numbers in exquisite fashion, had Clinton with much better odds throughout most of the race, with the final odds at 70/30 Clinton to Trump the day before the election.

But all the numbers crunchers depend on polls and statistical methods that aren’t reliable and now seem remarkably old fashioned. A Nature article examined this problem in mid-October and blamed the decline of landlines, rise of mobile phones, “shy voter” behavior, and unreliable online polls. At one time in history, calling people on the phone and asking them questions may have been the best way to find out their opinions and predict their likely behavior. But this election has just proved that it doesn’t always work. The UK saw a similar upset against the pollsters’ predictions in the Brexit vote.

The problem is what people say on the phone is likely driven by lots of other factors, especially when the candidates and poll questions are controversial. Conducting phone surveys today also relies on an increasingly outdated mode of social interaction, likely biasing the samples. Online polls likely have their own biases; they also rely on people answering honestly and having a representative sample. In the end, it is clear that asking a small subset of people questions cannot be relied on to give us a real picture of what likely voters are actually going to do.

At the same time, we have more data streams about people, and correlations to their behavior, than ever before. Advertisers can target microgroups based on incredibly detailed demographics. Each of us leave vast trails of data everywhere we go; these trails could be mined to answer all the questions pollsters ask (and likely much more). Social network analysis should be able to tell us who the influencers are and measure their impact on the outcomes.

Now we need a team of statisticians and big data analysts and marketing gurus to look back at trends in data from a wide range of sources in the lead-up to the election. We need a forensic investigator to try to find correlations and trends that we missed along the way and connect the dots that led us here. The margins were narrow, so it may be that – for now – the degree of uncertainty we have to accept is still greater than the margin of error in the actual results. But we should be able to do better than this.

Advertisements

SYSTEM_ERROR_505_STATS_FAIL

By Beth Russell

If data is the gold standard, then why don’t all scientists agree all the time? We like to say the devil is in the details but it is really in the analysis and (mis)application of data. Scientific errors are rarely due to bad data; misinterpretation of data and misuse of statistical methods are much more likely culprits.

All data are essentially measurements. Imagine that you are trying to figure out where your property and your neighbors meet. You might have a rough idea of where the boundary is but you are going to have to take some measurements to be certain. Those measurements are data. Maybe you decide to step it off and calculate the distance based on the length of your shoe. Your neighbor decides to use a laser range finder. You are both going to be pretty close but you probably won’t end up in the exact same place. As long as his range finder is calibrated and your stride length is consistent, both methods are reliable and provide useful data. The only difference is the accuracy.

Are the data good or bad? It depends upon how accurate you need to be. Data are neither good or bad as long as the measurement tool is reliable. If you have a legal dispute your neighbor will probably win, on the other hand if you are just trying to figure out where to mow the grass you’re probably safe stepping it off. Neither data sets are bad, they just provide different levels of accuracy.

Accuracy is a major consideration in the next source of error, analysis. Just as it is important to consider your available ingredients and tools when you decide what to make for dinner, it is vital to consider the accuracy, type, and amount of data you have when you go to choosing a method for analysis. The primary analysis methods that science uses to determine if the available data supports a conclusion are statistical methods. These are tests that can estimate how likely it is that a given assumption is not true, they are not evidence that a conclusion is correct.

Unfortunately, statistical methods are not one size fits all. The validity of any method is dependent on properties of the data and the question being tested. Different statistical tests can lead to widely disparate conclusions. In order to provide the best available science, it is vital to choose, or design the best test for a given question and data set. Even then, two equally valid statistical tests can come to different conclusions, especially if there isn’t very much data or the data has high variability.

Here’s the rub… even scientists don’t always understand the analysis methods that they choose. Statistics is a science in itself and few biologists, chemists, or even physicists are expert statisticians. As the quantity and complexity of data grows, the importance of evaluating which analysis method(s) should be used becomes more and more important. Many times a method is chosen for historical reasons – “We’ve always used this method for this type of data because someone did that before.” Errors made due to choosing a poor method for the data are sloppy, lazy, bad science.

Better education in statistics will reduce this type of analysis-based errors and open science will make it easier to detect them. Another thing we can do is support more team science. If a team also includes a statistics expert, it is much less likely to make these type of errors. Finally, we need more statistics literate editors and reviewers. These positions exist to catch errors in the science and they need to consider the statistics part of the experiment, not the final arbiter of success or failure. High quality peer-review, collaboration, and the transparency created by open data are our best defenses against bad science. We need to strengthen them and put a greater emphasis on justifying analysis methodology choices in scientific discovery.

“Do no harm” is NOT Enough

computer-1149148_1280By Beth Russell

“If it ain’t broke, don’t fix it, right,” my granddad used to say, right before he would wink at me, chuckle, and say “let’s see if we can figure out how to make it better.” This type of ingenuity is at the root of American innovation, invention, and process evolution. Observation, experimentation, and a national drive for optimization are part of our culture. As we have moved from the 20th century into the 21st, there has been a fundamental shift from “one size fits all” solutions, towards more personalized solutions.

The Precision Medicine Initiative is one of the great goals of our time. However, most of our medical treatment is still geared toward the treatment that will usually work, rather than the treatment that is the best for the individual patient. What would the world look like if we could change that in years rather than decades? What if we could do it cheaply, and easily, with information that already exists?

We can. To start the process, we need only to do one thing – to share. Buried within our medical records, our genetics, and our health data, is the information that we need to make our medical treatments better. Our diversity in population, hospital, and practitioner policies, and personal health decisions compose an enormous health data set. If we are willing to share our data with researchers and to insist that the insurers, hospitals, and practitioners make sure that the data is interoperable, we will be well on our way.

We often have widely held medical practices that are not actually supported by scientific data. This is illustrated by a recent decision by the Department of Agriculture and the Department of Health and Human Services to remove daily flossing from their guidelines. Apparently, there was no actual scientific data behind it. Such practices are often low-risk procedures or treatments that do not warrant the expense of a clinical trial. Many of these will probably turn out to be accurate for most people, but not necessarily for everyone. I for one don’t plan to stop flossing anytime soon.

These sorts of medical practices are typically adopted based upon observation and consensus. This approach is cheap but relies on practitioners detecting a pattern of good or bad results, is highly subject to human bias, and is much more geared towards safety than efficacy. There will always be room for common sense and human observation in the medical process but they will miss both small, and rare effects.

For over a century the arrow has been shifting away from simple observation towards data-based decision making. Large observational studies like the Framingham Heart Study and the Nurse’s Health Study have had outsize impacts on medical practices but they are still too small. Only with many observations from numerous patients can we detect the variations in efficacy and safety that are needed for precision medicine.

Today, clinical trials are the gold standard for medical treatments. These experiments are expensive, time consuming, and often suffer from low subject numbers and a lack of diversity. They also can run into ethical issues, especially with vulnerable populations. Even when the results of clinical trials are excellent, their results aren’t always adopted initially by practitioners. Medicine tends to be slow to adopt change. Data sharing will allow scientific analysis to extend beyond the length of time and number of subjects that are used in any “trial” and will allow us to better evaluate drugs and treatment after they go to market, not just before they are approved.

Data sharing is also important for areas of medicine for which traditional clinical trials are difficult or impossible to run. One of these areas is surgery. Most surgeries are not subjected to clinical trials and there is great variation in the methods for even relatively common surgeries from hospital to hospital. How does a patient decide where to get a life-saving surgery? Recommendations from friends and family are the number one method for choosing a doctor. There is no place to look to find out whose favorite method is the best one overall, nor the best for the individual patient. This needs to change. Sharing our medical data will make this possible.

Medical practice is poised for a revolution. We are beginning to move from treating the symptoms to treating the person. This can only happen if enough of us are willing to share. So let’s practice our earliest kindergarten lesson already.

The Right to Erase Data

The Internet has become a platform of societal intercourse: an information repository, communication tool, commercial-space, and a location for self-brand promotion. Yet unlike in the past, information on societal intercourse is no longer ephemeral, the digital ones and zeros produced from these interactions are permanent, creating a digital fingerprint of each individual user in cyberspace. On their own, personalized bits of data are not particularly useful, and only appear to provide relatively esoteric indicators of a particular individual. Big data analytics, however, correlates flows of data and provides insights derived from behavior science. This information generated about individuals allows corporations and government entities to predict and model human behavior.

Personal big data can be a societal boon, helping to facilitate healthier living, smarter cities, and increasing web simplification through personalization. However there is a darker underbelly to the accumulation of this information. Personal data (clicks, keystrokes, purchases, etc.) are being used to create hundreds of inaccessible consumer scores, ranking individuals on the basis of their perceived health risk, lists of occupational merit, and potential propensity to commit fraud. Moreover, as recent leaks of celebrity photos illustrate, Internet privacy is no longer a guarantee. Information that is meant to remain in the private sphere is slowly leaking into the public sphere, challenging previously conceived notions of civil liberty. In order to curb the tide of cyber intrusions, the individual right to erase data must be enacted.

The European Court of Justice ruled in 2014 that citizens had the “right to be forgotten” — they ruled in favor of citizen right’s to privacy. As today is Data Privacy Day, perhaps it is time for the US to stand up and create their own variant of this law, a uniquely American law that allows American citizens the right to erase data— the right to ensure their privacy.

CReST Proposed Language: 

“Any person has the right to erase personal data that they identify as a breach of their privacy. Data erasure may be requested to and arbitrated by the search engine that publishes the data online. If erasure is justified then the search engine must erase any links or copies of that personal data in a timely manner. The search engine is responsible for the removal of authorized 3rd party publication of said data.”

The 28th Amendment: Part 1 – The World is Watching You

Charles Mueller

In the very near future, everything you do, everything you say, and everything you think will be monitored, studied, and analyzed in order to understand what makes you ‘tick’.

Look no further than the infamous story about how Target figured out a father’s teenage daughter was pregnant before he did. By closely monitoring this young girl’s spending habits, Target was able to predict this girl was pregnant and send her coupons for diapers. This ability is not malicious in any way, it is just a new kind of creepy. It becomes a little more creepy when you realize there currently are not any rules or laws in place to protect us from someone using this same type of personal data mining to try and do something like raise your health insurance premium because your shopping habits suggest you are eating unhealthy. How exactly are we going to ensure that these capabilities are only being used to enhance our society rather than take advantage of it? Our leaders today are not taking this issue seriously and it means that we need to take matters into our own hands. A good start may to be to push for a constitutional right to own the digital information we produce (our data) when we engage with the world through the Internet.

The Internet has revolutionized how marketers and advertisers communicate their messages to individuals and consumers. This has all been enabled by the exponential increase in data produced by individuals using digital technologies like smart phones. Our every move in the digital world is tracked and the data collected by what the marketing industry calls a 3rd party data company, such as Axciom; essentially a big data crunching machine that finds patterns that help marketers and advertisers understand what makes us do the things we do. Many of us have no idea we are opting in to this type of profiling nor do we care because often it is used to sell us things that we believe we want.

The digital technologies that make it all possible will continue to evolve and this type of individual targeting will become easier as more users wear their devices.  Big data is no longer just assisting marketers; it is defining how they approach their jobs. How will the world change when big data can be used to create targeted, personalized digital content in real-time? How far away is a future where my commute to work is so well analyzed by big data companies that they can generate and deliver messages at the most opportune times to get me to buy Starbucks coffee instead of Dunkin Donuts?

It will be an incredible power to be able to deliver an optimized message that makes an individual “act” in response to receiving that message. Who decides what messages are sent through all the various digital platforms that are becoming more ubiquitous in our lives? We have already seen the influence big data and social media can have on a presidential election. Will future presidents be elected because they literally raised the most money? Will access to my thoughts simply be granted to the highest bidder? Who is making sure those watching and studying my digital life are using that information for things that are in my best interest?

The role of the government is and has always been to protect its citizens’ rights. In the digital future, the most precious trait of the citizen may be their data. Ensuring that individuals have a constitutional right to own their data could be a way to protect consumers from potential practices of malicious real-time big data analysis. Data ownership will only make it easier to take advantage of the current methods that allow users to opt in or opt out of the powerful targeting mechanisms continually being developed. Having the power to share your data with certain companies could become a type of voting system; you share your data with companies who use it to enhance your experiences and deny it to those who do not.

The digital age is rapidly evolving and the agencies that historically advise Congress on issues regarding consumer protection still have not figured out how to properly respond. We need leaders who understand this and are willing to create policies that protect us. Establishing the ownership of digital data to the citizen is a potential step in the right direction. The 28th Amendment to the Constitution should state that we have the right to own our data.

 

If you want to understand 21st Century ‘Electioneering’, look to Cicero

Jennifer McArdle
In the first century BC, Marcus Tullius Cicero ran for consul, the highest office in the Roman Republic. His younger brother, Quintus, sought to advise his elder brother on how to effectively ‘social engineer’ the electorate. In Quintus Tullius Cicero’s The Commentariolum Petitionis, Quintus directs Marcus to wage a campaign based on micro targeting, delivering targeted campaign messages (which often contradicted each other) to various members of the Roman populace, in order to gain their support. Quintus’ campaign strategy delivered Marcus victory, demonstrating the power of tailored messaging.
The use of behavioral science and big data by campaigns to effectively model voter behavior is adding new relevance to Cicero’s 2000 year-old campaign strategy—micro targeting is once again in vogue.
The 21st century has witnessed the emergence of ‘data driven campaigns.’ Campaigns are combining big data with behavioral science and emergent computational methods to model individual voter behavior. By combining the data located in public databases, which include information such as party registration, voting history, political donations, vehicle registration, and real estate records with those of commercial databases, campaigns have been able to effectively target individuals. This micro targeting extends beyond the ability to identify which voters to contact, but to the content of the message as well. Philip N. Howard in his book, New Media Campaigns and the Managed Citizen, notes that in the weeks prior to the 2000 presidential election, two middle-age, conservative, female voters logged on to the same Republican website, from different parts of the country. The first, a voter from Clemson, South Carolina saw headlines about the Republican commitment to 2nd Amendment protections and their pro-life stance. The second, based in Manhattan, was never shown those headlines. The website’s statistical model suggested that the former female would respond positively to those headlines, while the latter likely supported some measure of gun control and a woman’s right to choose.
While micro targeting in Rome arguably made the process more democratic—Marcus was not a member of the nobility and would have typically been eliminated from the candidacy—today’s use of micro targeting has the potential to erode democracy. These computational models allow parties to acquire information about voters without directly asking those same voters a question. With this information in hand, campaigns can opaquely micro-target individuals, selectively providing information that fits their partisan and campaign issue bias, while removing platforms that may not align with their interests. Essentially, campaigns are able to generate filter bubbles, which reinforce individual viewpoints, while removing differing ideas or philosophies from their search results. Voters are not even aware that micro-targeting has occurred.
While it is unlikely that micro targeting can be removed completely from politics, there may be a mechanism to ensure the integrity of the democratic process in politics. While difficult, given the opaque nature of micro targeting, attempting to create a ‘sunshine movement’ during campaigns by creating non-partisan sites that highlight each candidates’ individual platforms could help to ensure that voters know each candidates true views. ‘Data driven campaigns’ need not erode democracy, but should they remain as is, they may do just that.