The indicator 4a, developed in the context of the BAPS Logframe, measures the references in global summits to statistical development and/or data gaps.
This indicator measures progress towards the BAPS action area that aims to “ensure that outcomes of global summits and high-level forums specifically recognise the need for statistical capacity development”. See the BAPS document.
Definition of Global Summits
A first step towards identifying the data source for this indicator is the definition of global summit. A summit is a meeting between heads of governments (The Oxford Dictionary). For a summit to be global, it must bring together governments from around the world.
Following this definition, global summits can be characterised by exchanges between the participating governments. The events closest to this definition are the G20 summits that bring together governments from the 20 major economies. The G20, while closely fitting the concept of a global summit, can be criticised for not being truly global as discussions are driven by a rather select group of governments.
The approach proposed here is therefore to focus on intergovernmental organisations whose members are composed of the widest possible set of sovereign states. At the global level, these are the United Nations and (for each sectoral focus) and the FAO, IFAD, ILO, IMF, IMO, ITU, UNDP, UNESCO, UNIDO, WHO, World Bank, WMO, and WTO. These agencies bring together countries from all member states to discuss and set norms and policies.
2. Data sources
A. USE INFORMATION FROM UN AGENCIES TO SCREEN FOR GLOBAL SUMMITS
List of global summits
A natural second step is the identification of suitable outcome documents from UN specialised agencies that cover global summits and related discussions. One option is to draw up an exhaustive list of global summits. Immediate concerns with this approach are that this list would change annually and some of the largest summits are only organised biannually, which would introduce a lot of undesirable variation in such a measure.
Output documents from UN Agencies
The method proposed here is therefore to consider more frequent output documents from UN Specialised Agencies.
Use of direct output from UN agencies has two main advantages:
- First, it covers all relevant sectors at a global level.
- Second, and more crucially, this approach does not require the Secretariat to identify such events itself.
- Instead, the Secretariat can assume that agencies follow and report on topics discussed in the most relevant events in their sector.
B. COLLECT HIGH FREQUENCY TWITTER DATA
Another option is that their website content that can be captured through RSS feeds, for example. However, not every site has an active RSS feed and some (e.g. the World Bank) have dozens of feeds to subscribe to.
- It is thus proposed to use a standardised and frequent source of outcome documents: the official Twitter accounts of the agencies and their daily tweets, endorsements and retweets1.
- Tweets are particularly suitable because they cover the most relevant sectors and are published in an easily accessible format on a daily basis.
- The proposed indicator uses thus data extracted from the twitter timelines of these UN specialised agencies to measure the extent to which global summits include reference to statistical capacity development and data gaps.
- These tweets are currently restricted to 280 characters, making them easy to analyse.
The methodology proceeds in four steps.
- In a first step, it extracts hashtags for the current year for the 15 UN Specialised Agencies as well as for 20 Statistical Agencies. The latter are taken from the website of the global partnership for sustainable development data (GPSDD). These include PARIS21, ODW, World Bank Data, FAO Stats, etc.
- In a second step, the methodology creates a list of hashtags related to statistical development and data gaps. These are hashtags that occur at least three times2 more often in tweets of Statistical Agencies than they do for UN Specialised Agencies.
- Next, we take the most frequent hashtags (the top 75%) and calculate the relative frequency of their occurrence for each of the 15 UN Specialised Agencies.
- In the fourth step, we calculate the non-weighted average over these relative frequencies for the 15 agencies.
- Finally, we show the trending history of top 6 hashtags used by those agencies.
A. EXTRACT HASHTAGS
First, the twitter timelines of 15 UN Specialised Agencies are downloaded to identify hashtags related to statistical development that were used during global summits in the current year.
Then, the same process is used with 20 accounts that tweet exclusively about issues related to statistical capacity development, i.e. accounts of Statistics Think Tanks.
But not: WorldBank, FAO, etc.
Hashtags used by Stats Think Tanks are for instance: #SDGs, #SDG, #Data, #BigData, #DataRev, #DataRevolution, #OpenData, #Statistics, #PPPs, #GlobalGoals, #Post2015, #Development, #MDGs, #data4sdgs
Major Stats Think Tanks are catalogued based on the Global Partnership for Sustainable Development Data’s website. The database includes tweets from 15 UN agencies and 20 Stats Think Tanks. Twitter has a restriction of 3200 tweets loaded by twitter account. We update timeline datasets of each agency’s every month to capture the activities of its account.
Identification of hashtags referring to statistical development or data gaps
After loading timelines of United Nation Specialised Agencies, we extract hashtags, id and date of creation of each tweet and create a data frame with these observations. All the information extracted from all the UN agencies’ tweets are compiled together in a unique dataframe. This exercise is repeated for stats think tanks.
B. SPOT REFERENCES TO STATISTICAL DEVELOPMENT USING HASHTAGS SEQUENCES
We thus consider a tweet to make reference to statistical development or data gaps if it contains hashtags from this keyword list related to statistical capacity building.
This keyword list was compiled and extended by comparing the odds ratio of hashtags used in the timelines of (A) stats think tanks with those of (B) other international organisations
## hash freq odds
## 96 #data 63 16.180250
## 112 #sdg4 58 8.104335
## 119 #education2030 54 3.972765
## 199 #hlpf2016 37 4.025425
## 198 #globaldev 37 3.028301
## 213 #sdg 35 5.895163
## 220 #bigdata 34 8.560273
## 252 #science 31 4.231521
## 301 #opendata 26 99.749289
## 407 #bees 18 3.871548
## 435 #dataviz 17 37.456220
## 458 #data4sdgs 16 28.780403
## 475 #wd2016 16 16.482545
## 499 #sofia16 15 4.281476
## 531 #rice 14 3.220867
## 564 #ods 13 4.414616
## 583 #canada 12 4.099286
## 636 #learning 11 14.782273
## 621 #culture 11 4.099286
Stats hashtags are identified in a two-stage process:
- First, the frequencies of each hashtag used by UN agencies is computed. This exercise is repeated for stats think tanks.
- Secondly, a list of hashtags related to statistics is determined by comparing the odds ratios of hashtags.
- The odd ratio of one hashtag is the relative frequency of this hashtag among all tags used by stats think tanks on the relative frequency of this hashtag in the twitter timelines of UN agency.
- Mathematically, this means:
- Odds ratio are cut from 3, so that we only keep the relevant hashtags in the UN agencies timelines that are in the 25 percentile in stats agencies odds ratios. Hashtags that occur at least three times more often in tweets of stats agencies than they do for UN agencies are considered related to statistical capacity development.
C. UN AGENCIES TWEETS: TOP HASHTAGS RELATED TO STATISTICS
Then, we take the most frequent hashtags (the top 75%) used by the 15 UN Specialised Agencies.
The above word cloud clearly shows that #sdgs, #bigdata and #machinelearning are the three most important hashtags, which validates that UN specialised agencies tweets present information on statistical development.
Relative frequency of hashtags related to statistics used per UN agency
Then, the relative frequency of hashtags related to statistics is computed for each UN agency, based on the hashtags identified during the first phase using odd ratios.
The relative frequency of hashtags related to statistical development for each agency is simply equal to the frequency of hashtags related to statistics used by each UN agencies over its total number of hashtags.
This relative frequency of hashtags related to statistics is fairly low among all United Nations Specialised Agencies.
Comparison UN agencies and stats think tanks: top 10 hashtags related to stats
D. Final Indicator and initial results
An initial analysis was undertaken based on 112 000 unique tweets containing a total of over 160 000 hashtags. The final indicator is the average over these relative frequencies for the 15 UN agencies. The relative frequency of tags related to statistical capacity development is currently at about 1,13%. It means that around 1% of all the hashtags used by all the UN agencies are related to statistical development and/or data gaps.
E. Trending of the top hashtags overtime
The top 6 hashtags used by the statistical agencies are #genderdata, #data, #opendata, #sdgs, #sdg4 and #undataforum. By creating the graphs of their trending history, we found that different hashtags evolves differently overtime. The frequencies of some hashtags show clearly sharp increases during global summits and decreases afterwards. Frequencies of other hashtags, on the other hand, increase steadily overtime.
This methodology has several limitations.
- Tweets are a non-representative data source made available by a private company. The methodology is therefore subject to biases and depends on continued access to the Twitter API, which may not be sustainable in the long term.
- Difference between UN and country perspective: Although UN Specialised Agencies will by and large reflect the priorities of their member countries, they may not give a balanced view of the position of country’s heads of states.
- Difference between frequency and impact of topics: Some topics may trend on Twitter and even induce herding behaviour among the agencies but do not make it on the agenda of high-level events.
- Weighting of summits: Some summits do not really have an output but some others clearly have decisions. A future version of the methodology could aim to differentiate these and reflect this in the form of a weighting.
5. Next Steps and Conclusion
This indicator 4a provides a unique measure of references in global summits to statistical development and/or data gaps. It identifies hashtags related to statistical development and computes the relative frequency of these stats hashtags per UN agency. It allows therefore to observe the relative weight of each United Nation Specialised Agency in the total number of references to statistical development in global summits.
If approved by the Board, the Secretariat will continue the fully automated data collection for the indicator using the Twitter API and by the 2017 Board Meeting, will have established a baseline and determined a target for 2018.
As part of the implementation, the analysis will feed into the development of a freely accessible, online-based results monitoring instrument that allows users to track and browse outcomes of all global summits and high level forums.
- A retweet is a forward of individual tweets by other users to their own feed. ↩
- This odds ratio of three turned out to be a good cut-off value. ↩