How can pollsters predict the voting pattern of over 100 million by using just 1000 of them? This is mind-boggling, yet, in most instances, pollsters do it within the margin of error.
Unfortunately, in 2016 general election pollsters screwed up big time.
There is also a popular flop from 1936 Literacy Digest poll.
This is where the challenge lies--how to find that the sample is representative of the 18+ voting age population.
Post-stratification data weighting ensures that the survey sample we collect mirrors the underlying target population. For example, when we conduct an election poll, we would like our sample of 1000 survey respondents to represent the characteristics of the US 18+ voting age population, usually in race, gender, age, income, level of education (these are the basic demographics that people tend to differ from each other) and, perhaps, party affiliation. If our sample looks like the target population then we can presume it behaves and thinks like the target population.
However, there is a certain stigma attached to data weighting. No matter how theoretically sound it is, most folks cringe when we say we weighted the data. It invites the obvious question that we may have manipulated the data. It makes me wonder when people suggests us to report the data as is without any massaging. There is something called non-response bias. In election polling it is very hard to get young, affluent and minority voters to participate. Also samples tend to be less Republican. Unless extra effort is taken in the field to fill these non-responsive cells using quotas, it needs to be treated. My favorite survey methodology is stratified sampling followed by post-stratification weighting. This is the methodology most pollsters deploy.
There are several ways to weigh data. In the election polling at the Center for Survey Research and Methodology (CSRA), we weighted the data by race, gender, age, income and education. Some pollsters weigh data by party affiliation. Almost all pollsters weigh data since it is impossible get a good representation of the voting age population given the limited time of an election poll. Even though Random Digit Dialing (RDD) technique is used through systems like Computer Aided Telephone Interviewing (CATI), not all the respondents are willing to participate in a survey.
A 15% response rate in a phone survey is challenging. It is even dismal in online surveys which is often less than 5%. The multi-mode surveys (online, text, social media and telephone) can get more response. The vast majority that are not responding to the survey can flip the results. How do we know about them? The knowledge about the non-respondents is critical in a survey, and since they are not responding is a challenge to unravel the mysteries.