OSKB Collection Resources

Unrestricted Use

CC BY

Rating

The Biology Semester-long Course was developed and piloted at the University of Florida in Fall 2015. Course materials include readings, lectures, exercises, and assignments that expand on the material presented at workshops focusing on SQL and R.

Subject:: Applied Science; Biology; Computer Science; Information Science; Life Science; Mathematics; Measurement and Data
Material Type:: Module
Provider:: The Carpentries
Author:: Ethan White; Zachary Brym
Date Added:: 08/07/2020

More Less

Data Cleaning and Management Using OpenRefine

Conditional Remix & Share Permitted

CC BY-NC

Data Cleaning and Management Using OpenRefine

Rating

Course materials on using OpenRefine, a powerful tool for cleaning and transforming tabular data.

Subject:: Applied Science; Life Science; Physical Science; Social Science
Material Type:: Activity/Lab
Provider:: New York University
Author:: Nick Wolf; Vicky Steeves
Date Added:: 02/12/2019

More Less

Data Cleaning with OpenRefine for Ecologists

Unrestricted Use

CC BY

Data Cleaning with OpenRefine for Ecologists

Rating

A part of the data workflow is preparing the data for analysis. Some of this involves data cleaning, where errors in the data are identified and corrected or formatting made consistent. This step must be taken with the same care and attention to reproducibility as the analysis. OpenRefine (formerly Google Refine) is a powerful free and open source tool for working with messy data: cleaning it and transforming it from one format into another. This lesson will teach you to use OpenRefine to effectively clean and format data and automatically track any changes that you make. Many people comment that this tool saves them literally months of work trying to make these edits by hand.

Subject:: Applied Science; Computer Science; Information Science; Mathematics; Measurement and Data
Material Type:: Module
Provider:: The Carpentries
Author:: Cam Macdonell; Deborah Paul; Phillip Doehle; Rachel Lombardi
Date Added:: 03/20/2017

More Less

Unrestricted Use

CC BY

Data Intro for Archivists

Rating

This Library Carpentry lesson introduces archivists to working with data. At the conclusion of the lesson you will: be able to explain terms, phrases, and concepts in code or software development; identify and use best practice in data structures; use regular expressions in searches.

Subject:: Applied Science; Information Science; Mathematics; Measurement and Data
Material Type:: Module
Provider:: The Carpentries
Author:: James Baker; Jeanine Finn; Jenny Bunn; Katherine Koziar; Noah Geraci; Scott Peterson
Date Added:: 08/07/2020

More Less

Data Is Present: Open Workshops and Hackathons

Unrestricted Use

CC BY

Data Is Present: Open Workshops and Hackathons

Rating

Original data has become more accessible thanks to cultural and technological advances. On the internet, we can find innumerable data sets from sources such as scientific journals and repositories, local and national governments, and non-governmental organisations. Often, these data may be presented in novel ways, by creating new tables or plots, or by integrating additional data. Free, open-source software has become a great companion for open data. This open scholarship project offers free workshops and coding meet-ups (hackathons) to learn and practise data presentation, across the UK. It is made possible by a fellowship of the Software Sustainability Institute.

Subject:: Applied Science; Life Science; Physical Science; Social Science
Material Type:: Activity/Lab
Author:: Pablo Bernabeu
Date Added:: 01/27/2020

More Less

Conditional Remix & Share Permitted

CC BY-NC

Data Management & Reproducibility

Rating

Introduction to data management and reproducibility for researchers as a presentation.

Subject:: Applied Science; Life Science; Physical Science; Social Science
Material Type:: Lesson
Provider:: New York University
Author:: Vicky Steeves
Date Added:: 04/04/2019

More Less

Unrestricted Use

CC BY

Data Management with SQL for Ecologists

Rating

Databases are useful for both storing and using data effectively. Using a relational database serves several purposes. It keeps your data separate from your analysis. This means there’s no risk of accidentally changing data when you analyze it. If we get new data we can rerun a query to find all the data that meets certain criteria. It’s fast, even for large amounts of data. It improves quality control of data entry (type constraints and use of forms in Access, Filemaker, etc.) The concepts of relational database querying are core to understanding how to do similar things using programming languages such as R or Python. This lesson will teach you what relational databases are, how you can load data into them and how you can query databases to extract just the information that you need.

Subject:: Applied Science; Computer Science; Information Science; Mathematics; Measurement and Data
Material Type:: Module
Provider:: The Carpentries
Author:: Christina Koch; Donal Heidenblad; Katy Felkner; Rémi Rampin; Timothée Poisot
Date Added:: 03/20/2017

More Less

Data Management with SQL for Social Scientists

Unrestricted Use

CC BY

Data Management with SQL for Social Scientists

Rating

This is an alpha lesson to teach Data Management with SQL for Social Scientists, We welcome and criticism, or error; and will take your feedback into account to improve both the presentation and the content. Databases are useful for both storing and using data effectively. Using a relational database serves several purposes. It keeps your data separate from your analysis. This means there’s no risk of accidentally changing data when you analyze it. If we get new data we can rerun a query to find all the data that meets certain criteria. It’s fast, even for large amounts of data. It improves quality control of data entry (type constraints and use of forms in Access, Filemaker, etc.) The concepts of relational database querying are core to understanding how to do similar things using programming languages such as R or Python. This lesson will teach you what relational databases are, how you can load data into them and how you can query databases to extract just the information that you need.

Subject:: Applied Science; Computer Science; Information Science; Mathematics; Measurement and Data; Social Science
Material Type:: Module
Provider:: The Carpentries
Author:: Peter Smyth
Date Added:: 08/07/2020

More Less

Data Organization in Spreadsheets for Ecologists

Unrestricted Use

CC BY

Data Organization in Spreadsheets for Ecologists

Rating

Good data organization is the foundation of any research project. Most researchers have data in spreadsheets, so it’s the place that many research projects start. We organize data in spreadsheets in the ways that we as humans want to work with the data, but computers require that data be organized in particular ways. In order to use tools that make computation more efficient, such as programming languages like R or Python, we need to structure our data the way that computers need the data. Since this is where most research projects start, this is where we want to start too! In this lesson, you will learn: Good data entry practices - formatting data tables in spreadsheets How to avoid common formatting mistakes Approaches for handling dates in spreadsheets Basic quality control and data manipulation in spreadsheets Exporting data from spreadsheets In this lesson, however, you will not learn about data analysis with spreadsheets. Much of your time as a researcher will be spent in the initial ‘data wrangling’ stage, where you need to organize the data to perform a proper analysis later. It’s not the most fun, but it is necessary. In this lesson you will learn how to think about data organization and some practices for more effective data wrangling. With this approach you can better format current data and plan new data collection so less data wrangling is needed.

Subject:: Applied Science; Computer Science; Information Science; Mathematics; Measurement and Data
Material Type:: Module
Provider:: The Carpentries
Author:: Christie Bahlai; Peter R. Hoyt; Tracy Teal
Date Added:: 03/20/2017

More Less

Data Organization in Spreadsheets for Social Scientists

Unrestricted Use

CC BY

Data Organization in Spreadsheets for Social Scientists

Rating

Lesson on spreadsheets for social scientists. Good data organization is the foundation of any research project. Most researchers have data in spreadsheets, so it’s the place that many research projects start. Typically we organize data in spreadsheets in ways that we as humans want to work with the data. However computers require data to be organized in particular ways. In order to use tools that make computation more efficient, such as programming languages like R or Python, we need to structure our data the way that computers need the data. Since this is where most research projects start, this is where we want to start too! In this lesson, you will learn: Good data entry practices - formatting data tables in spreadsheets How to avoid common formatting mistakes Approaches for handling dates in spreadsheets Basic quality control and data manipulation in spreadsheets Exporting data from spreadsheets In this lesson, however, you will not learn about data analysis with spreadsheets. Much of your time as a researcher will be spent in the initial ‘data wrangling’ stage, where you need to organize the data to perform a proper analysis later. It’s not the most fun, but it is necessary. In this lesson you will learn how to think about data organization and some practices for more effective data wrangling. With this approach you can better format current data and plan new data collection so less data wrangling is needed.

Subject:: Applied Science; Information Science; Mathematics; Measurement and Data; Social Science
Material Type:: Module
Provider:: The Carpentries
Author:: David Mawdsley; Erin Becker; François Michonneau; Karen Word; Lachlan Deer; Peter Smyth
Date Added:: 08/07/2020

More Less

Data Sharing by Scientists: Practices and Perceptions

Unrestricted Use

CC BY

Data Sharing by Scientists: Practices and Perceptions

Rating

Background Scientific research in the 21st century is more data intensive and collaborative than in the past. It is important to study the data practices of researchers – data accessibility, discovery, re-use, preservation and, particularly, data sharing. Data sharing is a valuable part of the scientific method allowing for verification of results and extending research from prior results. Methodology/Principal Findings A total of 1329 scientists participated in this survey exploring current data sharing practices and perceptions of the barriers and enablers of data sharing. Scientists do not make their data electronically available to others for various reasons, including insufficient time and lack of funding. Most respondents are satisfied with their current processes for the initial and short-term parts of the data or research lifecycle (collecting their research data; searching for, describing or cataloging, analyzing, and short-term storage of their data) but are not satisfied with long-term data preservation. Many organizations do not provide support to their researchers for data management both in the short- and long-term. If certain conditions are met (such as formal citation and sharing reprints) respondents agree they are willing to share their data. There are also significant differences and approaches in data management practices based on primary funding agency, subject discipline, age, work focus, and world region. Conclusions/Significance Barriers to effective data sharing and preservation are deeply rooted in the practices and culture of the research process as well as the researchers themselves. New mandates for data management plans from NSF and other federal agencies and world-wide attention to the need to share and preserve data could lead to changes. Large scale programs, such as the NSF-sponsored DataNET (including projects like DataONE) will both bring attention and resources to the issue and make it easier for scientists to apply sound data management principles.

Subject:: Ecology; Life Science; Social Science
Material Type:: Reading
Provider:: PLOS ONE
Author:: Arsev Umur Aydinoglu; Carol Tenopir; Eleanor Read; Kimberly Douglass; Lei Wu; Maribeth Manoff; Mike Frame; Suzie Allard
Date Added:: 08/07/2020

More Less

Data Wrangling and Processing for Genomics

Unrestricted Use

CC BY

Data Wrangling and Processing for Genomics

Rating

Data Carpentry lesson to learn how to use command-line tools to perform quality control, align reads to a reference genome, and identify and visualize between-sample variation. A lot of genomics analysis is done using command-line tools for three reasons: 1) you will often be working with a large number of files, and working through the command-line rather than through a graphical user interface (GUI) allows you to automate repetitive tasks, 2) you will often need more compute power than is available on your personal computer, and connecting to and interacting with remote computers requires a command-line interface, and 3) you will often need to customize your analyses, and command-line tools often enable more customization than the corresponding GUI tools (if in fact a GUI tool even exists). In a previous lesson, you learned how to use the bash shell to interact with your computer through a command line interface. In this lesson, you will be applying this new knowledge to carry out a common genomics workflow - identifying variants among sequencing samples taken from multiple individuals within a population. We will be starting with a set of sequenced reads (.fastq files), performing some quality control steps, aligning those reads to a reference genome, and ending by identifying and visualizing variations among these samples. As you progress through this lesson, keep in mind that, even if you aren’t going to be doing this same workflow in your research, you will be learning some very important lessons about using command-line bioinformatic tools. What you learn here will enable you to use a variety of bioinformatic tools with confidence and greatly enhance your research efficiency and productivity.

Subject:: Applied Science; Computer Science; Genetics; Information Science; Life Science; Mathematics; Measurement and Data
Material Type:: Module
Provider:: The Carpentries
Author:: Adam Thomas; Ahmed R. Hasan; Aniello Infante; Anita Schürch; Dev Paudel; Erin Alison Becker; Fotis Psomopoulos; François Michonneau; Gaius Augustus; Gregg TeHennepe; Jason Williams; Jessica Elizabeth Mizzi; Karen Cranston; Kari L Jordan; Kate Crosby; Kevin Weitemier; Lex Nederbragt; Luis Avila; Peter R. Hoyt; Rayna Michelle Harris; Ryan Peek; Sheldon John McKay; Sheldon McKay; Taylor Reiter; Tessa Pierce; Toby Hodges; Tracy Teal; Vasilis Lenis; Winni Kretzschmar; dbmarchant
Date Added:: 08/07/2020

More Less

Data availability, reusability, and analytic reproducibility: evaluating the impact of a mandatory open data policy at the journal Cognition

Unrestricted Use

CC BY

Data availability, reusability, and analytic reproducibility: evaluating the impact of a mandatory open data policy at the journal Cognition

Rating

Access to data is a critical feature of an efficient, progressive and ultimately self-correcting scientific ecosystem. But the extent to which in-principle benefits of data sharing are realized in practice is unclear. Crucially, it is largely unknown whether published findings can be reproduced by repeating reported analyses upon shared data (‘analytic reproducibility’). To investigate this, we conducted an observational evaluation of a mandatory open data policy introduced at the journal Cognition. Interrupted time-series analyses indicated a substantial post-policy increase in data available statements (104/417, 25% pre-policy to 136/174, 78% post-policy), although not all data appeared reusable (23/104, 22% pre-policy to 85/136, 62%, post-policy). For 35 of the articles determined to have reusable data, we attempted to reproduce 1324 target values. Ultimately, 64 values could not be reproduced within a 10% margin of error. For 22 articles all target values were reproduced, but 11 of these required author assistance. For 13 articles at least one value could not be reproduced despite author assistance. Importantly, there were no clear indications that original conclusions were seriously impacted. Mandatory open data policies can increase the frequency and quality of data sharing. However, suboptimal data curation, unclear analysis specification and reporting errors can impede analytic reproducibility, undermining the utility of data sharing and the credibility of scientific findings.

Subject:: Applied Science; Information Science
Material Type:: Reading
Provider:: Royal Society Open Science
Author:: Alicia Hofelich Mohr; Bria Long; Elizabeth Clayton; Erica J. Yoon; George C. Banks; Gustav Nilsonne; Kyle MacDonald; Mallory C. Kidwell; Maya B. Mathur; Michael C. Frank; Michael Henry Tessler; Richie L. Lenne; Sara Altman; Tom E. Hardwicke
Date Added:: 08/07/2020

More Less

Unrestricted Use

CC BY

Databases and SQL

Rating

Software Carpentry lesson that teaches how to use databases and SQL In the late 1920s and early 1930s, William Dyer, Frank Pabodie, and Valentina Roerich led expeditions to the Pole of Inaccessibility in the South Pacific, and then onward to Antarctica. Two years ago, their expeditions were found in a storage locker at Miskatonic University. We have scanned and OCR the data they contain, and we now want to store that information in a way that will make search and analysis easy. Three common options for storage are text files, spreadsheets, and databases. Text files are easiest to create, and work well with version control, but then we would have to build search and analysis tools ourselves. Spreadsheets are good for doing simple analyses, but they don’t handle large or complex data sets well. Databases, however, include powerful tools for search and analysis, and can handle large, complex data sets. These lessons will show how to use a database to explore the expeditions’ data.

Subject:: Applied Science; Computer Science; Information Science; Mathematics; Measurement and Data
Material Type:: Module
Provider:: The Carpentries
Author:: Amy Brown; Andrew Boughton; Andrew Kubiak; Avishek Kumar; Ben Waugh; Bill Mills; Brian Ballsun-Stanton; Chris Tomlinson; Colleen Fallaw; Dan Michael Heggø; Daniel Suess; Dave Welch; David W Wright; Deborah Gertrude Digges; Donny Winston; Doug Latornell; Erin Alison Becker; Ethan Nelson; Ethan P White; François Michonneau; George Graham; Gerard Capes; Gideon Juve; Greg Wilson; Ioan Vancea; Jake Lever; James Mickley; John Blischak; JohnRMoreau@gmail.com; Jonah Duckles; Jonathan Guyer; Joshua Nahum; Kate Hertweck; Kevin Dyke; Louis Vernon; Luc Small; Luke William Johnston; Maneesha Sane; Mark Stacy; Matthew Collins; Matty Jones; Mike Jackson; Morgan Taschuk; Patrick McCann; Paula Andrea Martinez; Pauline Barmby; Piotr Banaszkiewicz; Raniere Silva; Ray Bell; Rayna Michelle Harris; Rémi Emonet; Rémi Rampin; Seda Arat; Sheldon John McKay; Sheldon McKay; Stephen Davison; Thomas Guignard; Trevor Bekolay; lorra; slimlime
Date Added:: 03/20/2017

More Less

Data policies of highly-ranked social science journals

Unrestricted Use

CC BY

Data policies of highly-ranked social science journals

Rating

By encouraging and requiring that authors share their data in order to publish articles, scholarly journals have become an important actor in the movement to improve the openness of data and the reproducibility of research. But how many social science journals encourage or mandate that authors share the data supporting their research findings? How does the share of journal data policies vary by discipline? What influences these journalsâ€™ decisions to adopt such policies and instructions? And what do those policies and instructions look like? We discuss the results of our analysis of the instructions and policies of 291 highly-ranked journals publishing social science research, where we studied the contents of journal data policies and instructions across 14 variables, such as when and how authors are asked to share their data, and what role journal ranking and age play in the existence and quality of data policies and instructions. We also compare our results to the results of other studies that have analyzed the policies of social science journals, although differences in the journals chosen and how each study defines what constitutes a data policy limit this comparison.We conclude that a little more than half of the journals in our study have data policies. A greater share of the economics journals have data policies and mandate sharing, followed by political science/international relations and psychology journals. Finally, we use our findings to make several recommendations: Policies should include the terms â€œdata,â€� â€œdatasetâ€� or more specific terms that make it clear what to make available; policies should include the benefits of data sharing; journals, publishers, and associations need to collaborate more to clarify data policies; and policies should explicitly ask for qualitative data.

Subject:: Psychology; Social Science
Material Type:: Reading
Author:: Abigail Schwartz; Dessi Kirilova; Gerard Otalora; Julian Gautier; MercÃ¨ Crosas; Sebastian Karcher
Date Added:: 08/07/2020

More Less

Data reuse and the open data citation advantage

Unrestricted Use

CC BY

Data reuse and the open data citation advantage

Rating

Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation benefit”. Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

Subject:: Applied Science; Information Science; Life Science; Social Science
Material Type:: Reading
Provider:: PeerJ
Author:: Heather A. Piwowar; Todd J. Vision
Date Added:: 08/07/2020

More Less

Data sharing in PLOS ONE: An analysis of Data Availability Statements

Unrestricted Use

CC BY

Data sharing in PLOS ONE: An analysis of Data Availability Statements

Rating

A number of publishers and funders, including PLOS, have recently adopted policies requiring researchers to share the data underlying their results and publications. Such policies help increase the reproducibility of the published literature, as well as make a larger body of data available for reuse and re-analysis. In this study, we evaluate the extent to which authors have complied with this policy by analyzing Data Availability Statements from 47,593 papers published in PLOS ONE between March 2014 (when the policy went into effect) and May 2016. Our analysis shows that compliance with the policy has increased, with a significant decline over time in papers that did not include a Data Availability Statement. However, only about 20% of statements indicate that data are deposited in a repository, which the PLOS policy states is the preferred method. More commonly, authors state that their data are in the paper itself or in the supplemental information, though it is unclear whether these data meet the level of sharing required in the PLOS policy. These findings suggest that additional review of Data Availability Statements or more stringent policies may be needed to increase data sharing.

Subject:: Applied Science; Computer Science; Health, Medicine and Nursing; Information Science; Social Science
Material Type:: Reading
Provider:: PLOS ONE
Author:: Alicia Livinski; Christopher W. Belter; Douglas J. Joubert; Holly Thompson; Lisa M. Federer; Lissa N. Snyders; Ya-Ling Lu
Date Added:: 08/07/2020

More Less

Deep Dive into Open Scholarship: Collaboration and Replication

Unrestricted Use

CC BY

Deep Dive into Open Scholarship: Collaboration and Replication

Rating

This deep dive session on replications and large-scale collaborations introduces a glossary of relevant terms, the problems these initiatives address, and some tools to get started. Panelists start with content knowledge transfer but switch to more interactive conversation for Q&A and conversation.

Subject:: Education
Material Type:: Lesson
Author:: Erin Miller; Jay Carter; Scott Peters; Matt Makel
Date Added:: 03/15/2021

More Less

Deep Dive into Open Scholarship: Data, Materials, and Code Transparency

Unrestricted Use

CC BY

Deep Dive into Open Scholarship: Data, Materials, and Code Transparency

Rating

In this deep dive session, Dr. Willa van Dijk discusses how transparency with data, materials, and code is beneficial for educational research and education researchers. She illustrates these points by sharing experiences with transparency that were crucial to her success. She then shifts gears to provide tips and tricks for planning a new research project with transparency in mind, including attention to potential pitfalls, and also discusses adapting materials from previous projects to share.

Subject:: Education
Material Type:: Lesson
Author:: Willa van Dijk
Date Added:: 03/15/2021

More Less

Deep Dive into Open Scholarship: Preprints and OA

Unrestricted Use

CC BY

Deep Dive into Open Scholarship: Preprints and OA

Rating

In this deep dive session, we discuss the current model of scholarly publishing, and highlight the challenges and limitations of this model of research dissemination. We then focus on the value of open access and elaborate on different open access levels (Gold, Bronze, and Green), before discussing how preprints/postprints may be leveraged to promote open access.

Subject:: Education
Material Type:: Lesson
Author:: Stacy Shaw; Bryan Cook
Date Added:: 03/15/2021

More Less

OSKB

Search Resources

Education Standards

Subject Area

Education Level

Material Type

License Types

Content Source

Primary User

Media Format

Educational Use

Language

Providers