Birth of a Revolution Pt. II — A global model for forecasting political instability
By Abhi Nayar | an428 | firstname.lastname@example.org
- The App : abhinayar.github.io/BirthOfARevolution
- Front-End Code : github.com/abhinayar/BirthOfARevolution
- Back-End Code : github.com/abhinayar/BirthOfARevolution_Backend
Birth of a revolution is an attempt to quantify social unrest within countries through a combination of historical socioeconomic data analysis as well as real-time social media sentiment analysis.
Historical data and static models built upon it are well equipped to deliver insights into instability on a macro level. However, these models do not adjust well to (a) rapid advances in technology and evolving methods of communication/unrest contagion and (b) real-time shifts in public sentiment on a granular level (considering reliable data is most often only available up to 2 years prior to the current date).
At the same time, unrest models that rely purely on real-time data, whether social media or otherwise, do a poor job taking the macroeconomic views into account. These models cannot, for example, take immediate outbursts in negative public sentiment and combine them with larger macro events such as troughs in the business cycle, contractionary economies or unrest in neighboring countries (which has been observed to result in a viral contagion-like effect for instability).
Birth of a revolution Pt. II aims to combine both these types of models — historical regression and real-time sentiment, in order to better predict levels of social unrest in a given country at any time. Birth of a revolution Pt. I was a project I completed in my junior-year of high school, which ended up winning 4th place at the Intel International Science & Engineering Fair (ISEF) in 2013 [https://sspcdn.blob.core.windows.net/files/Documents/SEP/ISEF/2013/Press-Releases/Grand-Awards.pdf]
That project, Pt. I, was solely focused on historical data analysis, and the motivations for this project come from seeing the gaps in the analytical power of the original predictor that arose as a result of the models inability to adapt to real-time sentiment shifts and reliance on long-term socioeconomic data.
What follows is a description of the methodology followed for creating both models, the relevant research as well as the outcomes and final deliverables. Each step and sub-step are labelled in detail and presents a holistic view of the research, thoughts and processes involved in creating the final unrest predictor.
A large amount of background research was done prior to beginning this project. The research fell into one of four broad categories:
- Past historical analysis studies
- Past real-time analysis studies
- Historical ANOVA/Regression analysis research
- Implementation best practices
See the section labelled Research Links for a non-comprehensive list of key links.
The key findings from my research were that few studies existed that combined multi modal prediction methods to create a more robust model. Most studies focused on a single index — historical data, socioeconomic, social media, news data, blog data, etc.
Perhaps due to the difficulty in combining various modalities, authors had instead (in most cases) explored the extent to which a single modality could act as a predictor for social/political instability.
These papers &readings gave an insight into the methodologies behind the creation of several unimodal unrest models and a few (4,6,7) talked about combining modalities as well.
On the historical side I looked into different methods of statistical analysis, regressions and ANOVAs. On the real-time side I primarily analyzed which mediums to pull data from as well as the best practices on pulling data.
One of the most interesting finding in this research was that the words used to label unrest really mattered. Several major political science institutions such as the Macmillan center at Yale had put together lists of words that were commonly associated with impending instability and unrest. It was interesting to see that the simple choice of language made a considerable difference, especially during any real-time analysis process.
I also spent a significant portion of time delving into the technical side of things and analyzing the best ways to present this data from a design and front-end perspective. There were also significant challenges from a backend perspective for data ingestion that I had to solve but thankfully many of the leading technology companies (including Twitter and Tumblr) have published a lot of literature surrounding technology usage and best practices which I referred to extensively.
I will now describe the methodologies involved in building each part of the model as well as the final result.
The model created for Pt. I in 2013 lacked robustness as well as dataset size. Given the significant advances in my understanding of statistics and econometrics I decided the best course of action would be to recreate the dataset and re-run all the analysis.
The data was gathered from a variety of reputable sources. Each substrate of information — social, economic, etc. was provided by an overlapping set of verified sources, whether international organizations or established information/data aggregators.
- The World Bank [http://www.worldbank.org/]
- African Development Bank [https://www.afdb.org/en/]
- Asian Development Bank [https://www.adb.org/]
- Inter-American Development Bank [https://www.iadb.org/en]
- International Monetary Fund [http://www.imf.org/external/index.htm]
- U.N. Development [http://www.undp.org/content/undp/en/home.html]
- The African Union [https://au.int/]
- The European Union [https://europa.eu/european-union/index_en]
- International Labor Organization [http://www.ilo.org]
- OPEC [http://www.opec.org/opec_web/en/]
- World Trade Organization [https://www.wto.org/]
- J.W. Coons Advisors [http://www.jwcoonsadvisors.com/]
- Trading Economies [https://tradingeconomics.com/forecast/currency]
- Focus Economics [https://www.focus-economics.com/]
- Al Jazeera [https://www.aljazeera.com/]
- New York Times [https://www.nytimes.com/]
- Financial Times [https://www.ft.com/]
- The Economist [https://www.economist.com/]
- Economist Intelligence Unit [https://www.eiu.com/home.aspx]
- World Economic Forum [https://www.weforum.org/]
The full dataset can be viewed here:
Building the model
Similar to Pt. 1, the concept was to build an ANOVA/Linear Regression based predictor that best fit unrest estimates. In order to build this model, I had to first collect some standard of comparison.
This standard of comparison would serve as the independent variable in my model and needed to be reputable, third party validated and robust. The independent variable I chose was the Political Instability Index, a measure put out by the Economist Intelligence Unit.
The EIU, for each relevant country, has put together detailed reports dating back to the early 1990s. By acquiring these reports and shifting through them (thousands of them!) I was able to put together an entire column of data corresponding to the unrest levels as stated by the Economist.
Once the independent variable had been created, I ran regression analysis in Excel on several sunsets of data. I selectively excluded hundreds of rows of data based on their completion, etc. as well as in order to have a test set to run my model on once it had been created.
The regression analysis returned p-values for all my indicators, allowing me to exclude several insignificant ones at an alpha level of 0.05. The alpha level is the significance threshold, the lower it is the larger part an indicator must play in explaining independent variable variance in order to be included in the final regression. The indicators that remained significant were:
- Youth Unemployment Rate
- Percent Urban Population
- Overall Globalization Rate
- Human Development Index
- Civil Liberties Index
- Corruption Perception Index
- Mobile Phone Use Per 100
- Literacy Rate
- Voice and Accountability Index
- Rule Of Law Index
By further restricting the alpha level to 0.01, the predictors could be narrowed down to:
- Youth Unemployment Rate
- Percent Urban Population
- Human Development Index
- Corruption Perception Index
- Mobile phone use per 100
These factors resulted in a model that had an R^2 value of 0.84. R^2 is a measure of how well the model fits the data. It ranges from 0 to 1, thus in this case our model could correctly describe our data 84% of the time.
One last step was done from a historical model perspective, namely trying to take contagion into account. Once the base model had been created, I re-ran the regression, this time including an indicator variable that was a 1 if the sub-region of the country had experienced unrest and 0 if it had not.
With the inclusion of this indicator variable, the model now had an R^2 value of 0.88, which proved my hypothesis that a contagion like effect had a high likelihood of impacting unrest scores.
Once the model had been created and R^2 scores had been computed I then ran it on all excluded rows which had the factors filled. For the rows that did not, I removed the missing factors from the model and adjusted the variable weights. I then re-ran the estimator for these rows to get unrest estimates for all rows in the dataset.
This left me with historical unrest estimates for every country in the dataset dating back 20 years. The next step was to create the real time predictor.
At the outset the real-time predictor sought to combine publicly available data from two microblogging sites — Twitter & Tumblr. Both have been used in the past as a measure of unrest and sentiment prediction.
Indeed, one of Twitter’s up-sold premium offerings is an analytics tool that allows business to track trends and consumer sentiment over time. Tumblr, while not offering a similar service, does offer access to a significant portion of their total data stream.
During the development process however, it became apparent that Tumblr could not provide data at the same speed and accuracy of Twitter. This is due to their rate limiting exceptions as well as the way they make the content indexable — Tumblr content is only on the hashtags people use in the post as opposed to the wording of the post itself. This caused data to come in at a much slower pace and did not add much in and of itself to the analysis except increased development complexity.
Thus I decided to proceed using Twitter data alone. The real-time analysis is done by combining three data streams:
- 7-day search based unrest estimation
- Real-time tweet stream sentiment aggregation
- GDELT realtime instability index
7-day Search Based Unrest Estimation
At the start of the project, I was able to access a Twitter firehose data pipe. This was through a previous employer, and gave me access to 100% of Twitter data, dating back as far as 2004.
As I began working however, the employer contacted me and informed me that as I was no longer employed there, I would have to give up my access. Thus other avenues for aggregating Twitter had to be found.
I realized through my research that Twitter provided free access for up to 7-days worth of Tweet data. This could be accessed by searching on certain key-words and all tweets containing those keywords over the past week would be returned.
I took this mass of data, parsed out certain unsatisfactory items such as Retweets (these lacked critical components needed for sentiment analysis and mapping, namely geographic data as well as the addition of the various @ symbols confusing the NLP API).
I then passed the data through the Google Natural Language API. This API tokenized and analyzed the tweets and returned a sentiment score and magnitude (confidence of sentiment). The sentiment ranged between -1 and 1, and the magnitude between 0 and 1.
By comparing the number of unrest-related tweets to the number of total tweets and the aggregate sentiment of the unrest-tweets I was able to create an index that attempted to measure the 7-day unrest level.
As previously mentioned, both the Twitter search and stream APIs operate based on the keywords passed to them. These keywords determine which tweets are filtered and delivered to the microservice that serves them to the NLP application.
The keywords were carefully selected by analyzing a variety of literature from the MacMillan center (@ Yale) and several other leading Political Science institutes. These operations had released words that were most closely associated with political instability, and data was collected on these words in both English and Spanish.
In order to validate the robustness of this predictor I compared the data output by the 7-day unrest index against an instability index published by GDELT.
GDELT is the largest, most comprehensive and highest resolution open database of societal trends in the world. They constantly analyze news streams in real-time in over 200 countries to try and get a sense of the levels of different indicators in a given country — one of these indicators being stability.
GDELT is one of the world’s leading data providers on short-term unrest prediction. By comparing my models results to an aggregate week-by-week prediction from GDELT data, I was able to tweak my model in order to weight the unrest tweets vs. total tweets in order to create a predictor that more closely tracked the GDELT estimates.
Real-time Tweet Stream Sentiment Aggregation
To create a real-time predictor, it was not enough to analyze 7-days of historical tweet data — in order to adapt to real-time shifts in sentiment a cron job that runs every 24 hours would not prove satisfactory. Thus I decided to augment the 7-day historical search data with a true real-time estimator.
This estimator is powered by the Twitter Stream API. Using this API it is possible to receive all tweets that satisfy the passed conditions that are posted to the Twitter platform in real-time. While this service will not provide any historical data, it is a really great way to attempt to analyze the “pulse”, so to speak, of the Twitter-verse on certain topics.
An added advantage of this API was that you could bound the incoming tweets by lat-long coordinates. This meant I was able to construct an operator that took a dataset of countries mapped to their lat-long boundaries (polygons) and only receive and analyze tweets emanating from within that country.
Thus I was able to build a country-by-country real-time Twitter stream analysis, which is what is working in the visualization under the tab “By Country”. The Twitter stream data was then sent to the same Google Natural Language API which extracted sentiment and returned the tokenized word and confidence score. .
It was then ported, via a Socket.IO connection, to the client which would proceed to map/chart the data on a globe or country, giving the user a complete view of the data coming out of one country at any given time.
Validation on this model was more difficult than the validation on the 7-day search results. No other real-time predictor of unrest really exists, and thus I chose to backlog the tweet generated sentimentdata and compare it to the 3 and 7-day GDELT results once that amount of time had passed.
In order to do this, an hourly unrest estimate was logged to a Firebase database, and later ported to a BigQuery table. By downloading a log of all GDELT estimates I was able to see how close the real-time predictor was to the GDELT index. My finding was that the Twitter data showed a significantly higher level of unrest and/or negative sentiment than GDELT.
I attribute a lot of this discrepancy to the way individuals are inclined to behave on social media vs. the behavior of press (which is the backbone of the GDELT analysis). Individuals on Twitter are much more expressive, and liberal in their use of expletives and strong language than the press, and this type of language is not something I filtered for during the indexing process.
GDELT Realtime Stability Index
The GDELT event database is one of the world’s largest collections of event-based data. The organization is partnered with tech giants like Google, and collects information from news sources and governments around the world on a 15-minute interval.
This is one of the most robust and up to date sources for producing a real-time view of instability as predicted by the actual events and news coming out of a country.
One of the biggest differences between this project and previous attempts at quantifying unrest is the combination of historical predictive modeling with real-time social media sentiment analysis.
While many papers and projects have analyzed each type of data in isolation, there are relatively few that have attempted to combine them. Even when combined, many of the projects simply analyzed them side-by-side, as opposed to creating a universal predictor that assigns weightage to either predictor in a manner that enhances the efficacy of the model as a whole.
Given the relatively sparse research in this field, combined with my lack of expertise building robust statistical models, what follows is a best attempt at combining the two predictors that weighs the unrest estimates from both the historical as well real-time models to create a holistic representation of unrest.
I began by weighting the models equally- 50% historical and 50% real-time. This however proved problematic as the real-time model is not close to the level of efficacy as the historical model.
This is because I had a very hard time filtering out the irrelevant tweets and other pieces of real-time data thus parsing the real-time streams for truly accurate measures of unrest were difficult.
Because of this I chose to weigh the historical model at 70% and the real-time model at 30%. This consistently gave some of the best unrest estimates, as you could clearly see the real-time aspect shifting instability levels but not the extent that a 50% weightage would.
There was also a noticeable difference in impact to the overall instability score during periods of large scale irrelevant material of spam being posted by Twitter users.
Future work would involve tweaking this weightage as well as figuring out a better way of parsing relevant vs irrelevant tweets and choosing only the most relevant ones for inclusion in the final product.
The end aim of this work was to create a visualization that would allow anyone — from experienced statistician to layman, to look holistically at levels of unrest and political instability around the world.
To that end, my front-end development work centered around creating a beautiful, interactive visualization that would allow individuals to easily get a picture of global instability. I therefore set out to build an interactive visualization with three components:
- Search By Key Term
- Search By Country
In the streaming mode of the visualization, all tweets that matched the list of keywords related to unrest, from anywhere in the world, were aggregated in my backend service. These tweets would come in in real-time via the Twitter Stream API which allows you to pipe in tweet data based on both keywords and location.
The tweets were aggregated on the backend and using asynchronous functions, sent to the Google Natural Language API for processing. The GNP service looked at the tweet text, any emoji use, etc. and returned two pieces of information — sentiment score and confidence in results.
I then waited till a client connected to the backend and piped the tweet data, sentiment score and any relevant meta data to the client via a Socket.IO connection.
The client is a web-based interface that upon first load shows a 3D globe and country/sentiment data on either side. The globe is created using ThreeJS, a library for creating 3D visualizations on top of WebGL — a GPU based graphics driver for the web.
The client would listen on a Socket.IO stream and upon receipt of data from the backend do a number of things.
- Map the tweet on the globe with a marker indicating sentiment
- Post the tweet on the right sidebar
- Make a call to a REST API to receive country data and display it on the left side (for the country the tweet was from)
- Get the instability measure from our Firebase Real-time Database
In order to achieve these things, from a development standpoint quite a few problems had to be solved. The first few surrounded mapping.
Since the globe is simply a 3D sphere with custom materials and shaders giving it the glowing appearance, mapping latitudes and longitudes to it was not a trivial matter. The problem was solved by mapping lat-long coordinates to a point on the surface of the sphere, applying the globe texture and then rotating the sphere until the coordinate matched the physical location.
Once that step was done, mapping the actual tweet was a process as well. Tweets came in with geographic data that is a bounded box of 4 coordinates indicating the corners of a polygon within which the tweet emanated. In order to estimate the location of the tweet, the four corners of this bounded box had to be averaged. This resulted in an approximation of the tweets location to the center of the bounding box, where the tweet was eventually plotted.
The other major challenge was to list the tweet on the right sidebar. This involved creating an aggregate sentiment score which reflected the aggregated sentiment of tweets received since the visualization had begun receiving tweets.
In order to calculate this aggregate score, the average sentiment of all tweets had to be recalculated each time a new one was received. These values were then stored in a DOM node as well as internal state variable for easy access and computation.
Search By Key Term
I wanted to give users an option to sort the incoming result by the keyword that had triggered its inclusion as well as allow users to search for a specific key word.
I ran into an issue with the latter aspect, namely allowing individuals to search on their choice of key term. This is due to the way Twitter handles streams of data- when you open a stream you have to preset the key terms being searched on and these words become immutable once be streaming begins.
Thus allowing each user to search by individual word involved opening too many streams for my rate limited API to handle. The alternate option was to intake all Tweet data and then parse them individually via a regular expression and only serve the tweets that matched the pattern. This solution however also suffered from the rate limited exception and the fact that Twitter will not allow an unlimited data pipe without a firehose access.
I thus had to come up with a more limited solution. I decided to allow users to filter by a limited subset of words and only view the incoming data for one or their selection of words. I was able to do this by reading in all tweets that fit the subset of words and then on the client side determining whether the tweet fit the user-specified criteria
Search By Country
I also wanted to let users browse the historical data I had collected on countries, and view the unrest/incoming tweets that only emanated from within certain geographical regions.
I was able to do this by first importing my data set from Microsoft Excel to firebase — Google’s cloud database. Then I created a service that would create a bounding box on the Twitter stream on the fly, something that was actually in the realm of possibility unlike changing the keywords the stream was started.
Once the tweets coming in were bounded to a local, I had the globe center on the country, highlighting it for the user. In addition I polled the firebase instance for all relevant facts from my data set for the chosen country, as well as a full list of instability measures across every year that was available in the dataset.
With this I plotted a bar chart of various unrest measures, allowing the user to click on a bar in the graph to view the data for that respective year. In addition I polled a REST api that served Wikipedia entries relating to social unrest, limiting it for results specific to the selected country during the chosen year.
Together this gave the user both a historical insight into what turmoil he country had been experiencing, but also trends as you could clearly see instability rising gradually and peaking when Wikipedia indicated there had been a large scale unrest event.
While this visualization satisfies my initial vision, there is a lot more that I would have liked to do/am capable of doing given the time, tools and skill set. Below I will detail future work related to this project that I would like to continue as well as some limitations on the current deliverable.
This work has only begun to scratch the surface of attempting to predict human behavior such as unrest. I will detail several extensions and limitations on the current project as well what I would like to do going forward.
On the data collection side, the more data that could be collected the better predictor you would build. One of the biggest limitations of the current work is dataset size — while the set is relatively large in comparison to the dataset for BoR Part I, it pales in comparison to other projects.
There are several datasets that I would love to get my hands on, one of them being the CNTS (cross-national Time series) dataset that records socioeconomic factors for countries dating as far back as 1880.
I also did not have time to go by hand and collect data on unrest events that had occurred in the country and indexing them based on type of event,etc.
All this could have contributed to a much more robust and extendable prediction model. Unfortunately the current model is suited only for analyzing the impact of socioeconomics and third party measurement indexes on unrest levels in a country.
I also have my doubts as to whether the economist unrest measure is the best one that could have been used. I feel that doing more deep research into the various variables would result in a prediction model with much more efficacy.
Unfortunately due to time constraints I had to collect the data as fast as possible and this could have well led to a less than satisfactory model.
On the real-time data there are several things that could be done better or in a more robust manner. They primarily revolve around the prediction methodology and the data analysis used.
On the methodology front, I have my personal doubts about the way I am calculating level of unrest. For one thing, the tweets are not filtered for spam or other irrelevant tweets. Building some sort of sorting mechanism or intelligent filter using AI/NLP or otherwise would go a long towards improving the efficacy of the model.
In addition the pure unrest related tweets / total tweets might not be the best way of measuring unrest sentiment. I did experiment with weighting the measure based on relative sentiment scores as well as magnitude/confidence intervals returned by Google but was not able to build a robust model in time.
I also would, going forward, implement a smarter analysis mechanism. One idea is to load all the tweets into BigTable or another data engine, and to process them in batches for more than pure sentiment, leveraging a lot of the other metadata that is produced in a tweet.
Overall, for the time constraints I was working within, I am quite happy with the real-time result. However given more time I would also choose to improve on the visualization making it even more accessible to non data scientists in terms of insights. It’s questionable how much the layman cares about the various factor levels vs immediately seeing in a given year what factor most contributed to the unrest and what event it percolated.
Lastly, given access to more data one can almost always build a better predictor. Due to my Twitter firehose access being revoked I was limited as to which tweets I could see and which ones I could analyze. Doing this at scale would involve some sort of partnership with Twitter that gives access to a much larger dataset from Twitter both for search and streaming purposes.
In this report, I detail the thinking and process behind the project — Birth Of A Revolution Pt. II (BoR). BoR seeks to combine historical data analysis with real time sentiment analysis in order to create a more robust social unrest prediction model than a unimodal approach that is the standard in academic circles.
I detail the build process of a model that solely uses historical data for unrest prediction. I then detail a real-time model creation as well as one that relies on 7-day tweet data.
Lastly I detail the building of a visualization that allows anyone to view the unrest data, interact with it, and glean new insights by looking at Time series data in tandem with real time streaming information.
While this project begins an attempt at unrest prediction, social unrest like all human behavior involves incredibly complex mechanisms and is inherently difficult to predict. I think there is a lot of work left to do, and a long way to go but I believe BoR begins to scratch the surface. I plan to continue working on this project into the summer and introducing several new streams of data into the analysis such as Tumblr, Facebook and Instagram.
In closing, I would like to thank my advisor Guy Wolf, for his unending patience, and my roommates at Yale for putting up with me constantly showing them the visualization and asking for their feedback on it. Thank you all.
CS + Econ, Yale 2018