A Brief-ish Introduction to Reproducible Research
What is reproducible research?
Research is reproducible when published results can be replicated using data, code, and instructions provided by the authors of a scientific analysis without any additional information.
Why do reproducible research?
(1) It helps you! Need to explain what you did to an advisor or collaborator? Reproducible research helps you remember what you did and helps others figure out what you did. Need to reproduce or alter a figure or analysis because an advisor, collaborator, or reviewer requested it? Reproducible research allows you to do that simply and easily. Need to start a new project that contains tasks you've already done or are similar to things you've already done? Reproducible research makes this much easier. Accused of malpractice because of a mistake in your analyses? Reproducible research protects you from accusations. Want your work to be cited more? Reproducible research increases paper citation rates and allows other researchers to cite your code and data.
(2) It helps others! Others can learn from your work. Science requires a steep learning curve. Allowing others to access your data and code gives them a head start on performing similar analyses. Others can reproduce your work. Science is an iterative process. Allowing others to access your data and code makes it easier to perform subsequent studies to provide rigorous evidence for a phenomenon. Others can protect themselves from your mistakes. Mistakes happen in science. Allowing others to access your data and code gives them a chance to prevent mistakes and provides protection for workplaces, journals, funding bodies, and others who may be affected when mistakes happen. Others can build on your work. Science is a team effort. Allowing others to access your data and code gives them more tools to advance science, whether they are collaborators waiting for you to finish and analysis or colleagues hoping to address unanswered questions in your work.
Why isn’t all research reproducible?
(1) Complexity. Science is hard, and it takes specialized (and often proprietary) knowledge and tools (both software and hardware) that may not be available to everyone. For example, genomic analyses require lots of arcane knowledge about the molecular architecture of DNA; analyses that rely on high-performance computing clusters may rely on different programming languages, hardware configurations, or software packages; and analyses performed in SAS require that users have thousands of dollars to pay for a license.
(2) Technological Change. Hardware and software both change over time, and they often change quickly. When old tools become obsolete, that makes research less reproducible. For example, even though we have much greater software and hardware capabilities today, reproducing physics research performed in the 1960s would require a completely new set of tools that would take time to set up
(3) Human error. Science is performed by fallible humans. People gradually forget how they did things over time, they speed up analyses because they just want to be done with a project that feels like it is taking forever to complete, they want to avoid getting "scooped" or having their data exploited by other people, and some (though very few!) even operate in bad faith and want to hide their research so no one finds out that their fancy paper is fraudulent.
How to do reproducible research
Reproducibility starts in the planning stage. It’s not as simple as posting data and code online after a project is done. After all, the person who will benefit most from reproducible research is you! (Note: This document focuses specifically on small-ish data--data sets that contain anywhere between a few to a few million observations. Big data is inherently less reproducible due to its complexity and is probably outside the scope of most people reading this. This document also focuses on tips and tools associated with R rather than other statistical programs. However, many of these tips and tools will also be useful to people who work with big data or use other programming languages.)
Step 1: Data Management
Reproducible research starts with sound data management practices. After all, it's really difficult to reproduce research if your data is a mess or disappears mysteriously. A few best practices are listed below, aggregated in groups of similar principles.
i: Data Storage and Format - (a) Store data files in useful, flexible, non-proprietary formats. Paper copies of data sheets, while useful as a last resort, take a lot of work to whip into shape. Likewise, storing data in a proprietary computer program may create a bunch of work for you when that program changes or its company goes out of business. Storing data in .csv or .txt files is almost always the best way to go. (b) Make multiple copies of data, even before you clean it up. Store the different copies in different places and using different storage mediums (e.g., an external hard drive + cloud storage). Computers break, hard drives get stolen, and servers get hacked--don't leave yourself vulnerable to those events. Bonus tip: store copies of raw data before cleaning it up. It's entirely possible that you irrevocably compromise or alter your data while cleaning it up--you don't want to be in that situation either. (c) After you store your raw data, clean it up and store multiple clean copies of the tidy data set. Tidy data is in long format (i.e., variables in columns, observations in rows), has consistent data structure (e.g., doesn't mix character data with numeric data for a single variable), and has appropriately formatted and informative headers (e.g., consists of reasonably short variable names that do not include problematic characters like spaces, commas, and parentheses).
ii: Metadata – (a) Include informative metadata that explains how and why data was collected, what variable names mean, what confusing cell values mean, and any other helpful information. Data is useless if it's not clear what it actually means. (b) Locate metadata in a sensible location. A few rows of metadata above the data may work--it's easy to trim those off when data is input into R. A paired text file can also be a helpful way to store metadata. And remember, metadata includes manuscripts, reports, and lab notebooks--it's always good to keep these organized well so you can refer to them at a later data. (c) Make sure files have useful, informative names. It should be easy to tell what's in a file from its name, and a consistent naming protocol enables useful information (like date created or version number) to be provide even more information when you're searching through files.
iii. File Organization – (a) Organize your files in a sensible, user-friendly structure. Files are not useful when you can't find them. Bonus tip: organize files in small blocks of similar files. It's fine to have a master list somewhere, but keep all files used for a given project in a project folder that's easily located in the future. Also, make sure that files don't grow too large--unnecessary huge files are harder to handle, and files with several different things inside them are harder to organize.
Step 2: Readable Code
After your data has been adequately wrangled, it's time to craft some tidy code. Software code does two things—it performs operations that you want it to do, and it serves as a log of what you did. Code is therefore inherently reproducible. Point-and-click programs, not so much.
i: Predictable Templates – (a) Code is most useful when it comes in a predictable template. A file that contains R code should include a description of what it does and who wrote it at the top, followed by small blocks that import data, packages, and external functions. Analytical code should follow those sections, and sections should be demarcated using a consistent protocol.
ii: Code Comments – (a) Follow the Goldilocks Principle: comment your code thoroughly, but avoid redundant comments. Comments should contain enough information that it's easy for a stranger with adequate knowledge to understand what the code does, but not so much that it's a chore to sort through comments.
iii: Style – Following a clean, consistent coding style makes code easier to read and reduces the need for comments. (a) Use a consistent naming convention to name objects. Camel case (e.g., camelCase) and snake case (e.g., snake_case) are popular options. (b) Embed meaningful information in object names. For example, if you're working with a data set in matrix form, attach "_mat" to the object name. This will serve as a visual reminder of an important property of the object. (c) Use indentation, but keep it consistent. This will also convey important information (e.g., nestedness) without requiring comments. (d) Write code in relatively short lines. Our brains process narrow columns of data much more easily than longer ones. (e) Group code in blocks, keeping related tasks together but separating new ones. These function like paragraphs to make code more comprehensible. (f) Avoid long blocks of code. It's more difficult to understand what's going on when code stretches on too long without interruptions. Avoid excessive nesting for the same reason. (g) Bonus tip: Many well-known organizations offer code style guidelines that were developed by many expert coders. Take advantage of these, but keep in mind that all style guides are subjective to some extent. Develop a style that works for you.
iv: Ease of Use - (a) Automate repetitive tasks. For example, if you find yourself using a custom function a lot, save the function as an external file and load it at the top of your code file. On the same note, use loops to make code more efficient. Both of these will prevent mistakes, since anyone using the code won't be editing so many commands. (b) Remove temporary objects as you go. Objects in R are stored in memory while the session is live, so having too many active objects can take up lots of memory. Code users running a script on a computer with less memory than the code author may therefore run into problems reproducing code. (c) When possible, use popular and well-maintained packages. These are much more likely to be kept current in the future and to have a bounty of helpful documentation.
v. Version Control - (a) Practice version control. Version control allows coders to keep multiple versions of code so that they can revisit older versions later. This can be extremely helpful when code is accidentally altered, packages or functions are updated, or if others users are running older versions of software. It also has many of the same benefits as the data management practices listed above.
Step 3: Sharing Research
Now comes the fun part! Or at least the part most people associate with reproducible research--sharing research with others. However, as should be clear by now, sharing the data and code is far from the only component of reproducible research, and once Steps 1 and 2 above are followed, it's also the easiest step. There are many ways to do this, several of which are described below.
i: Scientific Papers - Scientific papers are the original tool for reproducible research. After all, they're supposed to be thorough records of exactly what researchers did and exactly what they found. They're essentially wordy metadata. And now that articles are deposited and primarily read online, it's fairly easy to submit R code and data with your manuscripts. These days, code and data that can be used to replicate research is often found in the supplementary material of journal articles. Some journals (like eLife) are even experimenting with embedding data and code in articles themselves. A couple of things to keep in mind, though: (a) Supplementary materials can be lost if a journal switches publishers or when a publisher changes its website. Journal articles are not totally reliable repositories for code and data, because publishers are focused on the articles, not supplementary materials. (b) Research is only reproducible if it can be accessed, and many journals are locked behind paywalls that make them less accessible. To make research accessible to everyone, you may have to use other tools like preprints and data/code repositories.
ii: Pre-print/Post-print Repositories - A pre-print is a scientific paper that is archived in a publicly available repository before peer review. A post-print is a scientific paper that is archived in a publicly available repository after peer review. Both are usually restricted by journal policies like not allowing final formatted PDFs to be deposited, but both allow journal articles to be freed from paywalls. Data and code can be archived in the supplementary materials in these repositories as well, although their capabilities are often less than those for journals.
iii: Personal Websites - Personal websites are increasingly used by scientists to share their work. They're also relatively easy to create and often free (as long as you're willing to put up with a sponsored URL and annoying ads). Data and code can be easily archived and made publicly accessible, and this option gives you complete control over what code and data are made available and how it's presented. One downside: code and data are only available as long as you maintain your website, and are difficult to transfer if you want to move to another domain or website provider.
vi. Figshare - Figshare is yet another digital data repository. It offers essentially the same services as Zenodo, but users can pay for extra privileges and options. It is well integrated with several scientific publishers (including PLOS).
Step 4: Advanced Tools
If you want to get go really wild with reproducible research, there are many more advanced tools. If have listed a few of these below with brief descriptions.
i. Make - Make is a software tool that can be used to automate analyses, even when files within those analyses have changed. It's super complicated to explain thoroughly, but it essentially enables users to turn a set of related analyses into one cohesive package or program.
ii. LaTex - LaTex is a document preparation system that can be used to integrate analyses into scientific documents. It can automate updates in manuscript drafts and can be configured to allow document readers to see the code used to perform analyses and create figures. Overleaf is an online software program that is commonly used to create and edit LaTex documents and allows collaborative editing of those documents.
iii. Open Science Framework - Open Science Framework is a project management repository that combines the repository features of Dryad/Zenodo/Figshare with collaborative tools. It is integrated with many reproducible research programs, including widely used pre-print servers, version control software, and publishers.
iv. Github - Github is a software development platform that utilizes Git, a common version control software. It is a standard tool in software development and computational research.