Tweets across the world of words like “haze, sunny, cold” are being analyzed by researchers at University of Wisconsin to determine areas of high pollution.
The computerized prediction system was created to analyze social media posts coming from a specific city to arrive at an estimated air quality index for that city.
This prediction system was created because in impoverished areas of China there are no air quality stations to estimate air pollution, Shike Mei, graduate student and member of the research team, said.
“What’s interesting about our approach [is that] the ultimate goal is not simply to predict air pollution for China, but rather to design a machine learning approach that can do this or other related tasks,” Jerry Zhu, associate professor in the department of computer sciences, said.
This is done through computer science techniques called natural language processing and machine learning, Zhu said.
At an intuitive level, the computer scans all social media posts and counts how many times every word is used on a particular day in a given city.
“The intuition is that there will always be words that are heavily-used,” Zhu said. “What is interesting is the program needs to figure out which of the words are actually related to air quality. We want it to figure that out all by itself.”
UW developers did this by collecting data on all social media posts, as well as the actual air quality index, from two cities on the same day, Zhu said.
Because the computer knows the actual air quality difference between the cities and the word count differences from the social media posts, it conceptually just looks at whether there are certain words used more often, Zhu said.
“There could be many different reasons certain words are used more often in City A than in City B,” Zhu said. “But if you collect more data from multiple cities from multiple days, always when the air quality index is known, pretty soon you can correlate those word count differences with air quality.”
Words like “haze, pollution, indoors and heavy” are indicative of bad air quality, whereas words like “sunshine, sunny and cold” are indicative of good air quality, Zhu said.
Before, to measure air quality, one would have to set up a physical monitoring station with devices to measure pollutants, Zhu said.
“That approach is much more accurate,” Zhu said. “But, as you can imagine, it is also limited by where you can set it up and how many places you can set it up. On the contrary, our approach is less accurate, but all it requires is to monitor social media posts. …”
This offers the capability to measure air pollution to cities where they do not have the means to set up air quality stations, Mei said.
Currently, the computer only reads the text portion of social media posts. UW researchers plan to improve the accuracy of the prediction by including any photos posted.
“We want computers to be able to identify interesting portions of the photo and use that automatically,” Zhu said. “That pretty much means we want the computer to look at outdoor photos and beyond somebody’s shoulder into the background to see how hazy the sky is.”