The Data Mine’s Bookshelf

This page is still under construction. For now, you might find short hand names of the books. We are working on listing all the Purdue library links here. If you look up any of these books on Purdue’s library (which anyone can do, even non Purdue students) you will almost certainly find the book.

While most of these books are scattered throughout the Starter Guides on their respective topics, they are also listed here under their approximate content domain. All of these books come highly recommended. For Purdue students, most if not all of these books are free at the Purdue library link; for non-Purdue students, a good chunk of them should be free.

Data Analysis

Visualization

Analysis Techniques

Spatial Data Analysis

Computer Vision

Machine Learning

NLP

  • Natural Language Processing with Transformers by Lewis Tunstall, Leandro von Werra, and Thomas Wolf (O’Reilly, 2022)

  • Practical Natural Language Processing by Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, and Harshit Surana (O’Reilly, 2020)

  • Natural Language Processing with PyTorch by Delip Rao and Brian McMahan (O’Reilly, 2019)

  • GPT-3 by Sandra Kublik and Shubham Saboo (O’Reilly, 2022)

  • Natural Language Processing with Spark NLP by Alex Thomas (O’Reilly, 2020)

GAMS: Generalized Additive Models

Optimization

Specific Subject Analysis

Sports

  • Baseball hacks

  • Sport business analytics

Biology, Bioinformatics, Forestry

  • Statistical Methods in Bioinformatics

  • Developing Bioinformatics Computer Skills

  • Bioinformatics data skills

  • Blast

  • Modern statistics for modern biology

  • Deep learning for life sciences

  • Forest Analytics with R

Gathering Data

Data Mining

  • Programming Collective Intelligence

  • Mining the social web

General

Data Engineering
  • 97 Things every cloud engineer should know

  • 97 things data engineer

  • Foundations for architecting data solutions

  • Building secure and reliable systems

  • Designing Data Intensive Applications

  • 97 things every engineering manager should know

  • The enterprise big data lake

Platforms

Spark

  • Spark the definitive guide

  • High performance spark

  • Stream processing with Apache Spark

  • Advanced analytics with spark

  • Learning spark

Azure

  • Mastering azure analytics

Hive

  • Programming hive

Hadoop

  • Hadoop The definitie guide

  • Hadoop application architectures

  • Hadoop in practice

  • Data analytics with Hadoop

AWS

  • AWS cookbook

  • Migrating to aws: a managers guide

  • Data science on AWS

MapReduce

  • Mapreduce Design Patterns

Kafka

  • Mastering Kafka Streams

  • Architecting Modern Data Platforms

  • Kafka: The definitive Guide

Containers

Kubernetes

Productivity

Methodology

Devops

  • Intro to devops with chocolate, lego

Incorporating Diverse Backgrounds

  • Asked and Answered by Pamela E. Harris and Aris Winger (2020)

  • Practices and Policies by Pamela E. Harris and Aris Winger (2021)

  • Read and Rectify by Pamela E. Harris and Aris Winger (2022)

  • Testimonios by Pamela E. Harris, Alicia Prieto-Langarica, Vanessa Rivera Quiñones, Luis Sordo Vieira, Rosaura Uscanga, and Andrés R. Vindas Meléndez

  • Unleash Different by Rich Donovan (2018)

  • Why, How, and What of Data SCience for Social Impact

Version Control

SVN/Subversion

  • Version Control with Subversion

Git/Github

  • Learn git in a month of lunches

  • Building tools with Github

  • Git for Teams

  • Version Control with Git

Raspberry Pi

Miscellaneous Tools
  • Raspberry Pi cookbook

Open Source

  • Data analysis with open source tools

Command Line

  • Data science at the command line

Unix

GNU

  • Learning GNU Emacs

Tools

  • Flex and Bison