Software and tools
Python
Assignments and projects in this course will be based on Python 3. We will be using the following packages throughout this course:
NumPy and SciPy: The industry-standard Python libraries for working with vectors, matrices and general numerical computing. My notes from neural networks introduce numpy an review basic linear algebra. Note that while Numpy will be used, linear algebra is not a prerequisite for this class.
Pandas The most widely used Python library for data frames and general data analysis.
Altair: A Python library for visualization based on the Vega and Vega-Lite visualization grammars. Altair allows for concise, declarative plotting and is particularly well-suited for creating complex visualizations with minimal code.
SciKit-Learn: A popular library for basic machine learning and statistical modeling.
Javascript
D3.js A low-level Javascript library for data visualization. The current standard for complex web-based visualization.
ObservableJS: A Javascript-based framework for interactive data analysis and visualization. Includes its own high-quality data visualization library Observable Plot, which itself is built on D3.
Quarto
Quarto is an open-source publishing system for scientific and technical work. I strongly recommend using Quarto for homeworks and the final project. Similar to Jupyter, Quarto allows for Python code, code outputs and markdown-formatted text to be mixed within a single document, however Quarto provides significantly more flexibility when it comes to formatting and rendering. For homeworks, you will typically render documents to HTML or PDF for upload to Canvas, but Quarto can render to a variety of formats. In fact, this entire website was built with Quarto!
The disadvantage of Quarto compared to Jupyter is that Quarto documents must be compiled and run all at once, rather than interactively. However, Quarto can convert Jupyter notebooks to Quarto files for further formatting or even render notebooks directly. See this guide for details.
To get started with Quarto in VSCode, see this guide.
Latex (style) equations
For homework assignments you may occasionally want to typeset mathematical expressions or derivations as Latex-style equations. Latex equations are supported directly within Jupyter and Quarto. To write an equation in a text/markdown cell, simply surround the equation with $ symbols as: $y = x^2 + 1$, which produces the output: \(y=x^2 +1\). You can write block equation using double dollar-signs as $$y = x^2 + 1$$, which puts the equation on its own centered line.
An extensive reference for Latex equations is available here.
VSCode (Optional)
Visual Studio Code is a free development environment developed by Microsoft. It is available for Mac, Windows and Linux, and provides convenient tools for working with Python, Git, Jupyter and Quarto. It is what I use to develop the materials for this course, and it is what I would recommend using for homework assignments. This is completely optional however. You are welcome to use whatever environment you feel most comfortable with.
Here are resources for getting started:
Recommended extensions for data science and working with Jupyter notebooks are listed here.
Instructions for setting up Python in VSCode are here.
Instructions for working with Jupyter notebooks in VSCode are here.
Instructions for setting up Quarto in VSCode are here.
Instructions for setting up Git in VSCode are here.
Alternative tools
While course materials will use the Python/Pandas/Lets-Plot stack, other languages and libraries offer similar functionality with different trade-offs. As the homework assignments, quizzes and projects in this class are largely language and library-agnostic, you may consider experimenting with some of the following alternatives.
If you are interested in learning to use these alternatives and would like to be involved in helping to translate course materials, let me know!
Python alternatives
R: The most popular open-source language built specifically for statistics and data analysis. It is widely used in the statistics and data science community. It includes native support for data frames and many high-quality libraries such as ggplot2
Julia: Another popular open-source alternative to Python and R.
Pandas alternatives
Polars: A relatively new library that is rapidly gaining popularity as an alternative to Pandas. It aims to accomplish the same goals with improved performance and updated syntax.
SQLite: A lightweight database engine included with Python and accessible through the
sqlite3module. While SQL databases are less suited to general data science than Pandas, they offer many of the same operations with potential for greater scalibility. Both Pandas and Polars offer SQL interoperability.
Altair alternatives
Lets-plot/Plotnine: Visualization libraries based on the ggplot2 grammar of graphics. Lets-plot is designed for Kotlin and Python, while Plotnine is a Python implementation.
Seaborn: Another popular visualization library, closely integrated with Pandas. Often simpler for basic plots, but less flexible than other libaries.
Plotly: A cross-platform library with a Python interface and support for 3-D visualizations and animation. Has many cool features, but a less elegant and flexible inferface than alternatives.
Matplotlib: The de-facto standard for statistical and scientific visualization in Python. Base Matplotlib is losing favor due to its outdated interface and steep learning curve, but many libraries, such as Seaborn and Plotnine, are built on top of Matplotlib’s powerful rendering engine.
D3 & Observable alternatives
- Vega-Lite A high-level JSON-based visualization grammer and plotting library for the web.