Python Expectations and Resources

The expectation is that prior to starting the class you will know how to:

  • Set up the Python Anaconda environment on your computer (<1 hour)
  • Write simple Python programs and run them (2-3 hours)
  • Use Jupyter Notebook (30 minutes)
Even if you have no knowlege of Python you can learn the minimum in as little as 4 hours using the tutorials below (if necessary you can learn them during the first two weeks of the course). You will need other Python skills, which you can acquire during the course, but if you have time you can learn them before the class starts when you may be less rushed. Much of what you will need is best acquired by looking at examples and adapting them, but if you have this background knowledge that process will be smoother.

Here are additional skills that you will need:

  • Basic knowledge of Pandas, a Python data analysis library for readning and processing data (1-2 hours).
  • Abilty to use Matplotlib to generate charts (1 hour).
  • Abilty to use ScikitLearn, the main Python library you will use for the labs and project, which includes all of the data mining/machine learning algorithms. You will learn this as the course progresses, but you should start with a basic overview/tutorial, which I will assign in class (2 hours).

PYTHON RESOURCES

Feel free to suggest others for me to add, to benefit your classmates and future students.

Key references

Tutorials

The tutorial topics (1-6) are organized in a logical sequence. There are also two tutorials that follow (labelled i, ii), which cut across several of he topics and can get you started more quickly, but providee only selective coverage. In my opinion you are better off with the main sequence (1-6) but can use these other ones for review or as supplemental resources.

  1. Download Anaconda.

    This will include most of the libraries you will need (e.g., Pandas), include Qt Console for running iPython code, and Jupyter Notebook. If you start Anaconda Navigator it will show you the various interfaces. Any missing libraries can be installed later.

  2. Go through parts of the Python tutorial.

    You should complete Sections 1-3, most of Section 4 (control flow), most of Section 5 (data structures), a bit of Section 6 (Modules), and most of Section 7 (Input and Output). You can skip the rest. Start with Qt Console but at some point shift to Jupyter Notebook (next item).

  3. Learn to use Jupyter Notebook via a 20 minute Youtube video or a short tutorial.

    There is also a "User Interface Tour" option under the "Help" menu in Jupyter Notebook, but that may not be sufficient on its own. You should also check out "keyboard Shortcuts" under the help menu as you become more familar with Notebook. You can submit your notebooks for your labs, most likely as an exported pdf file.

  4. Learn the basics of Pandas using this 1 hour Youtube Video.

    While I recommend the video, you can use web-based tutorials by 1) going to Jupyter Notebook, selecting the "Help" Menu, selecting "pandas reference" and then going to the "get started guides" (the user guide and references may be useful too), or 2) consider the Kaggle 4-hour tutorial.

  5. Learn the basics of matplotlib by going to the Matplotlib website and viewing the quickstart guides and some of the examples, and/or the 30 minute Youtube video.

  6. Learn the basics of Scikit Learn with this 1 hour 40 minute Youtube video.

    Visit the official scikit-learn website and browse it a bit. Click on classification and the select "Decision Trees" (currently 1.10) and then do 1.10.1 in Jupyter Notebook, which will have you build a decision tree classifier for the iris data set. This is how you will probably construct most of your labs-- by finding similar examples on this site and then adapting them. In fact, if you search for a specific item, it will likely automatically find it on the site, with reference information and, much more importantly, specific examples that you can adapt. It is okay if you do not understand the content until we cover the topic in class.

Supplemental Tutorials

  1. An Introduction to Python for Data Science Applications: Covers python data structures and a very quick look at Numpy/SciPy, Matplotlib, and Pandas.
  2. A set of Jupyter notebook tutorial examples from our textbook authors that covers an intro to Python, Numpy and Pandas, Data Exploration and Preprocessing, Regression, Classification, etc.