Behind the scenes: Investigating the Investigators

Behind the scenes: Investigating the Investigators

The hottest topic on Australian researchers lips of late has been the recent NHMRC Investigator Grant announcement. I recently set about flexing my data-science muscles to see what the outcomes of the scheme were overall, and what a successful application might look like for the next round. If you haven’t come across it yet, you can read more here.

Below is a behind-the-scenes look at what went into the data analysis for this post, and what I learnt along the way. I have also released all the code and results as a GitHub repository for those who like the real nitty-gritty details or want to extend the analysis.

Raw data

The raw data used in this analysis came from three main sources.

The first was the NHMRC grant outcomes website, which offers spreadsheet summaries for grants released since 2013. I chose to use only the data from 2015 onwards, for a couple of reasons: (1) the structure of the Fellowship system appears to have changed in 2014 to the ECF, CDF, RF layout which remained in place until 2018. This meant that 2013 data was poorly correlated with the more recent datasets. (2) The 2014 dataset did not have as much detail in the gender, age, state breakdowns that could be easily compared to the following years. (3) Five years seemed like a nice time period to work with!

The second source of data was the Field of Research codes used to classify research. You can find the complete list at the Australian Bureau of Statistics. I struggled to find an easily-downloadable version, and instead copied them from the University of Melbourne intranet. With a little post-processing, I had a fully functional list of each level of classification, which I could then use to understand which types of research were popular for funding.

The last source of data was Scival, which I used to gather the number of research publications and average Field-Weighted-Citation-Impact (FWCI) for each awardee in the ten years previous to their year of award. This was somewhat of a manual process, and I used the ‘best match’ profile for each awardee imported into SciVal. Overall, 88% of the awardees were matched accurately (and this could be increased with a little manual curation). I also did a little digging around in the PubMed Central API using a python package (see the resources list below for more information) to batch-query the author names and collect their publication history, to compare with the matches generated by SciVal.

Processing and analysis

After initial cleaning of the raw data, I then equated the new and old schemes by matching the tiers. Although the correlation is imperfect (due to changes to eligibility between the old and new schemes), this resulted in Early Career Fellowships mapping to Emerging Leader 1, Career Development Fellowships mapping to Emerging Leader 2, and Research Fellowships mapping to Leadership Fellowships.

Lessons learnt

Data is messy. This was evident in all of the raw data I collected – naively, I expected simple-to-use spreadsheets from the NHMRC. At the very least, I was anticipating similar formats for the more recent 5 years. What I was greeted with was a complicated series of tables designed for visual interpretation by human eyes and not easy accessibility via computer scraping. The initial data cleanup took more than half the time it took to complete the analysis.

Through this process, it became clear to me that as a general rule people, despite often working in a science-oriented role, do not handle, label or store data well. With the increasingly data-driven world we live in, we would all benefit from improving our data hygiene.

Lastly, while the trends I saw and commented on are indeed interesting, they should be interpreted cautiously. The data that the NHMRC provides is somewhat fragmented (to protect the privacy of successful applicants). More importantly, the data they do provide is focused on successful applications. There are lots of important details about the makeup of the applicant pool that we don’t see and this is important – albeit missing – context for interpreting the trends I highlighted.

Similarly, publication history searches are a tricky one. Pay-walled publication information is a nightmare to access, and although PubMed searches are OK they are limited. To get standard metrics such as H-indices and field-weighted citation impacts requires access to subscription services. Moreover, people’s names are difficult! Even once you have access to the databases, it can be difficult to know whether/how to split the given names into first/last and there is almost no chance of avoiding manual curation if you want a complete dataset.

Tricks and tools of the trade

As this was my first dedicated data-science style project, I quickly ran across a few questions.

When wanting to plot the per-state distribution of applications and successful awardees, the most obvious visualisation was a map. I’d never plotted a map before, and after a quick google search I found myself asking What on earth is a choropleth and where do I find a shapefile? It turned out a chloropleth is what I wanted to make – a thematic map in which areas are shaded or patterned in proportion to the measurement variable being displayed on the map. And to do this, you need a shapefile – a vector data storage format for storing the location, shape, and attributes of geographic features. Luckily there are a few great tutorials, and eventually, I found the geospatial data I was looking for. Amazingly, using geopandas meant dealing with this type of data relied on many of the skills I already have and before I knew it Voila! – one map of Australia complete with colour-mapped and labelled data.

While handling the somewhat messy task of linking successful awardees to their publication track record, I came across the problem of slight variations between different naming formats. How do you match text that is fuzzy, and how does Levenshtein help? It turns out that Python has a whole range of functions via the FuzzyWuzzy package that help to compare strings that are referring to the same thing but are written slightly differently. The most simple version of this makes use of the Levenshtein distance, named after Vladimir Levenshtein who originally considered this phenomenon in 1965. This metric measures how far apart two sequences of words are according to the minimum number of edits needed to change one into the other. These edits can be insertions, deletions or substitutions. One detailed tutorial later, and I was on my way matching fellowship awardees to scival authors in no time.

Finally, throughout the whole plotting process, I was conscious of my colour palette. I am a firm believer that good dataviz should be both functional and beautiful. With plenty of gender-based comparisons, you might wonder how I could go past the tried and true colour combination. So why I didn’t choose pink and blue for gender studies? It turns out there are a few reasons why red and blue for gender data is an ‘unawesome choice’ rooted in gender stereotypes (pink = girls = weaky, cute and blue = boys =strong, bold). Luckily there are plenty of good colour-combinations that circumvent these issues.

With these tools in hand, I had everything necessary to analyse and visualise the datasets at my disposal. To see these tricks and tools in action, don’t forget to check out the GitHub repository or head along to some of the resources listed below.

Resources

Wishlist: I haven’t had a chance to implement this functionality yet but these dataviz tools are high on my to-try list to extend the accessibility of this dataset!


Disclaimer

The original analysis was intended to inform my personal decision of whether to apply for an Investigator Grant in the 2019 round. Any action you take as a result of this information is done at your own peril. If you do decide to act on this information, I wish you the best of luck whichever path you may choose. May the odds be ever in your favour.

This being said, I have of course aimed to be as unbiased and informative as possible. This is also my first foray into data-science-for-public-consumption, so if you do notice any overt errors or bugs feel free to raise an issue on GitHub or get in touch via twitter and I will check it out as soon as possible.

Image credits: cadop via unsplash