Project description
This week’s assignment involves the sort of data collection and manipulation that
you might have to do for a project of your own. It seems straightforward, but the
devil is in the details: there are a lot of land mines even the simplest data-gathering
exercise. The assignment is designed to point them out and help you walk around
them.
The Correlates of War Project (COW) produces one of the most widely-used
datasets in all of international relations: the National Material Capabilities Index
(NMCI). This index consists of annual values for total population, urban population,
iron and steel production, energy consumption, military personnel, and military
expenditure of all countries in the international system, from 1816-2007. In order
to calculate a country’s overall capabilities, COW takes the average of each country’s
share of system-wide resources in each of these six categories.
In this assignment, you will reproduce a more limited version of the NMCI
using present-day data.
1. Go to the Wikipedia pages for world population, GDP, and military expenditures.
?? For each page, extract the relevant data into a separate Excel spreadsheet.
(For population, use United Nations estimates.)
2. Get rid of all columns except country name and the quantity of interest (GDP,
population, and military expenditures, respectively). Add a first row and use
it for column names (“Country” and “GDP,” for example.)
3. Tidy up the data. Get rid of any extra characters. Select the key numeric
variables, go to the Format menu, choose “Cells…”, and convert them to
General—this gets rid of commas, dollar signs, percentages, etc., which are a
real pain when saved to a .csv file. When working on this step, beware of the
following problems:
??Wikipedia is convenient, but the data quality sucks. Never forget this. If this were more than an
exercise, you’d take the time to find better sources.
1
(a) Unless you used Outwit Hub in step 1, some of the country names will
have spaces in front of them. These are a pain to get rid of. The simplest
way is to copy them from the Excel spreadsheet into a text editor, use a
Find-and-replace command to get rid of them, and copy and paste them
back into Excel.
(b) Some names and numbers have Wikipedia footnotes (“b”, “[20]”) stuck
on the end. Outwit won’t save you from these. Deleting them by hand
in Excel is probably the fastest way to get rid of them.
(c) Wikipedia authors don’t always use the same names for the same countries,
so you need to make sure they’re consistent before merging them.
In particular, watch out for “China” and “People’s Republic of China,”
“Gambia” and “The Gambia,” and various combinations with “Congo”
and “Macedonia.”
4. Once you’re done tidying, save each dataset to a separate .csv file.
5. Create an R command file that does the following:
(a) Read each of the three .csv files into appropriately-named data frames
(for example, Pop, GDP, and MilEx).
(b) Summarize each of these data frames.
(c) In each data frame, create a new variable that reflects each country’s
percentage of the world total. (That is, divide the data column by the
world total.) Make sure that this new variable is part of the same data frame as
the original data column.
(d) Merge the three data frames. Match the data using country name.
(e) Create your version of the NMCI by calculating each country’s average
share of each of these three resources. If a country has 2% of the world’s
population, 1% of world GDP, and 4% of world military expenditures,
for example, its average share should be (2+1+4)/3 = 2.33%.
Your final dataset should have 8 columns: the name of the country, GDP (total
and percent of world), population (total and percent of world), military expenditures
(total and percent of world), and NMCI.
Due to Assignment 4 Drop-box: three .csv files and an R command file that creates
the data set described in 5., above.
This week’s assignment involves the sort of data collection and manipulation that
you might have to do for a project of your own. It seems straightforward, but the
devil is in the details: there are a lot of land mines even the simplest data-gathering
exercise. The assignment is designed to point them out and help you walk around
them.
The Correlates of War Project (COW) produces one of the most widely-used
datasets in all of international relations: the National Material Capabilities Index
(NMCI). This index consists of annual values for total population, urban population,
iron and steel production, energy consumption, military personnel, and military
expenditure of all countries in the international system, from 1816-2007. In order
to calculate a country’s overall capabilities, COW takes the average of each country’s
share of system-wide resources in each of these six categories.
In this assignment, you will reproduce a more limited version of the NMCI
using present-day data.
1. Go to the Wikipedia pages for world population, GDP, and military expenditures.
?? For each page, extract the relevant data into a separate Excel spreadsheet.
(For population, use United Nations estimates.)
2. Get rid of all columns except country name and the quantity of interest (GDP,
population, and military expenditures, respectively). Add a first row and use
it for column names (“Country” and “GDP,” for example.)
3. Tidy up the data. Get rid of any extra characters. Select the key numeric
variables, go to the Format menu, choose “Cells…”, and convert them to
General—this gets rid of commas, dollar signs, percentages, etc., which are a
real pain when saved to a .csv file. When working on this step, beware of the
following problems:
??Wikipedia is convenient, but the data quality sucks. Never forget this. If this were more than an
exercise, you’d take the time to find better sources.
1
(a) Unless you used Outwit Hub in step 1, some of the country names will
have spaces in front of them. These are a pain to get rid of. The simplest
way is to copy them from the Excel spreadsheet into a text editor, use a
Find-and-replace command to get rid of them, and copy and paste them
back into Excel.
(b) Some names and numbers have Wikipedia footnotes (“b”, “[20]”) stuck
on the end. Outwit won’t save you from these. Deleting them by hand
in Excel is probably the fastest way to get rid of them.
(c) Wikipedia authors don’t always use the same names for the same countries,
so you need to make sure they’re consistent before merging them.
In particular, watch out for “China” and “People’s Republic of China,”
“Gambia” and “The Gambia,” and various combinations with “Congo”
and “Macedonia.”
4. Once you’re done tidying, save each dataset to a separate .csv file.
5. Create an R command file that does the following:
(a) Read each of the three .csv files into appropriately-named data frames
(for example, Pop, GDP, and MilEx).
(b) Summarize each of these data frames.
(c) In each data frame, create a new variable that reflects each country’s
percentage of the world total. (That is, divide the data column by the
world total.) Make sure that this new variable is part of the same data frame as
the original data column.
(d) Merge the three data frames. Match the data using country name.
(e) Create your version of the NMCI by calculating each country’s average
share of each of these three resources. If a country has 2% of the world’s
population, 1% of world GDP, and 4% of world military expenditures,
for example, its average share should be (2+1+4)/3 = 2.33%.
Your final dataset should have 8 columns: the name of the country, GDP (total
and percent of world), population (total and percent of world), military expenditures
(total and percent of world), and NMCI.
Due to Assignment 4 Drop-box: three .csv files and an R command file that creates
the data set described in 5., above.