Data
Download this data file to your computer: Messy data
About the data
The data for this lesson is a part of a draft Data Carpentry Digital Humanities workshop.
Twitter data spanning a period of three and a half months was downloaded. Each Tweet containing the hashtag #digitalhumanities during > the period 1 June 2014 - 15 September 2014 was captured in a spreadsheet. The original dataset is available from Figshare and was created by Ernesto Priego through the use of software that downloaded the Tweets through the Twitter API. The spreadsheet contains information about the handle of the Tweeter, the Tweet text, time and date of Tweet, language, platform from which the Tweet > was sent, and more. Because the original dataset was generated by software, it is already in a relatively clean, machine readible format.
To illustrate some of the common formatting and other mistakes that can be made when researchers are collecting data in spreadsheets, > we have created a messy version of the original dataset. Our story is that we had 2 student assistants who collected the data at separate times and in slightly different formats in their spreadsheets.
The data in this lesson is a subset of the original version that has been intentionally ‘messed up’ for this lesson.
The data for this lesson and the workshop are available under a CC-BY license and is available for reuse.
Software
To interact with spreadsheets, we can use LibreOffice, Microsoft Excel, Gnumeric, OpenOffice.org, or other programs. Commands may differ a bit between programs, but general ideas for thinking about spreadsheets are the same.
For this lesson, if you don’t have a spreadsheet program already, you can use LibreOffice. It’s a free, open source spreadsheet program.