Python for Statistics, Data Analysis, and Data Science#

This guide, intended for beginners and more experienced programmers alike, will help you find a Python setup and build a research workflow that works for you.

For beginner to intermediate programmers, this guide will get you started with a powerful but beginner-friendly Python setup that will support both Python learning and many research needs. If you are a beginner, we recommend that you:

  1. Read the general information on this page to learn what Python is, who this guide is for, and whether it is right for you.

  2. Continue to the Which Setup is Right for Me? page to learn more about Google Colab, our recommendation for beginner users.

  3. Proceed to our guide for getting set up in Google Colab.

  4. (Optional, as needed) See the additional resources section for information on free access (through Tufts) to the online learning platform Udemy for beginner-level tutorials and resources on Python programming. Google Colab can be used for many Jupyter Notebook-based introductory courses/tutorials or as your first setup after completing an introductory course.

For intermediate to advanced programmers: if you intend to use Python for statistics, data analysis, or data science (see Who is this Guide For? below) and you think you may be ready for a more advanced setup, you can skip directly to our setup guide at Which Setup is Right for Me? There you can compare our beginner-to-intermediate recommendation (Google Colab) to our more advanced option (Microsoft VS Code with Miniforge) to weigh the pros and cons of each setup, and then follow the links to proceed to the setup instructions for whichever option you choose.

For those using the Tufts High-Performance Computing (HPC) Cluster, the Which Setup is Right for Me? page will also provide recommendations and links to information for how to use Python most effectively on the HPC.

There are many useful textbooks and tutorials on Python, but we find that it can still be daunting to sort through the many setups and find one that works for you and your research. This guide intends to fill that gap.

Read on to learn more about Python and whether it might be right for you.

What is Python?#

Python is a powerful, open-source programming language that can be used for statistics, machine learning, data visualization, and many other research tasks. It is a particularly strong choice for researchers who want access to cutting-edge tools for advanced data science, including packages for machine learning, artificial intelligence, and natural language processing. Best of all, it’s free, and can be used by anyone anywhere without worrying about licensing agreements or use restrictions.

Who is this Guide For?#

Python is a general purpose programming language, which means it’s useful not just for academic researchers and data scientists, but also for software developers and other professions as well. This guide, however, is intended specifically for those planning to use Python for:

  • Statistics

  • Data Analysis

  • Data Visualization

  • Data Science

  • Machine Learning

  • AI

  • Natural Language Processing

Those who wish to use Python for other purposes (such as software or application development) may wish to consult other sources.

Is Python Right for Me?#

If you are planning to use Python, be advised that it does have a bit of a learning curve. Be prepared to invest a bit more time learning how to set up and use it.

A programming language (like R or Python) is the most powerful and flexible way to work with data, but not everyone needs to learn to program. If you plan to do only basic statistics and do not need the full power of a programming language, you may find that there are simpler, more beginner-friendly solutions that meet your needs. For example, Jamovi and PSPP are two options for free software that, while less powerful than programming languages, offer intuitive graphical interfaces and can handle a large variety of common statistical methods.

Proprietary statistical suites such as Stata, SAS, and SPSS are also popular alternatives among researchers in some fields, and offer a mix of intuitive graphical user interfaces and scripting/programming tools, depending on the user’s preferred workflow. Note that these require paid subscriptions, which can limit access for you or your collaborators or limit the transferability of your skills if you change employers. We also find that, once your needs start to go beyond basic statistics, these options may not offer a significant advantage in terms of ease of use compared to programming languages like R or Python, while also not offering the full power and flexibility of the latter. For this reason, you may find that R or Python are better options for you unless there is a strong tendency in your field to use one of these proprietary software options and doing so will better position you to work with collaborators.

R or Python?#

For those who are ready to learn a programming language and are planning to do statistics, data analysis, and data science, you typically face a choice between two options: R and Python. If you’re new to programming and you plan to do mostly data analysis, statistics, and data visualization, you may find that R is easier to learn and use, while also free and powerful enough to handle even advanced research needs. Python has a steeper learning curve than R, and requires more active management of the coding environment, but its popularity among data scientists for machine learning, AI, and natural language processing means that it is particularly powerful for researchers working in those domains. Both R and Python are powerful, and your choice may depend on personal preference, what’s common in your field, and what your collaborators use.

Next Steps#

If you are ready to learn Python and want help getting set up to program, continue to Which Python Setup is Right for Me?

Additional Resources#

Beginner-Level Python Tutorials#

If you are a Tufts Student, Staff, or Faculty member, you can use your Tufts ID to get free access to Udemy, an online learning portal with video courses in a wide variety of topics. You can access it at tufts.udemy.com and sign in with your Tufts credentials. You should look for a course that teaches Python specifically for Data Analysis, Data Science, or Statistics.