For all the hype around data and data-hyphenated terms (like “data-driven”), it is important to remember that data is a raw resource that has no actualized value until it is integrated into a product that uses said data to generate a meaningful output. Though the specific roles and responsibilities of data scientists vary from organization to organization (and even within organizations), data scientists are generally the ones responsible for executing the actual transformation of data from a raw resource into something of value for product-users. Indeed, data science is not so much the scientific study of data itself as it…
Election polling has again come under scrutiny after several discrepancies between polling predictions and election outcomes in the 2020 election. First, a number of presidential races in battleground states such as Michigan and Wisconsin turned out to be tighter than polling indicated. Second, in Florida and North Carolina, Biden was projected to win according to pre-election polling but ended up losing. Third, Trump had convincing vote share margins in states such as Ohio and Texas, which were thought to be closer races based on polling (see Silver, 2020).
Nate Silver argues that while there have been divergences between projected presidential…
Many problems that data scientists, statisticians, and other data practitioners encounter require a determination of whether the observations of interest are likely to belong to one category or another on some outcome. Examples include assessments of creditworthiness (e.g., will a potential borrower default on their debt?), the flagging of credit card purchases that may be fraudulent, and object classification (e.g., is this plant an iris setosa or not?). While it is technically possible to calculate a probability of belonging to one category versus another using a linear regression model, a more appropriate regression technique is logistic regression. …
A relational database is a collection of data tables — sets of rows and columns that store individual pieces of data — that can be connected to each other. In this way, a relational database is not totally dissimilar from an Excel workbook with related datasets stored across multiple worksheets. With that thought in mind, this post moves through an example using Python to transform an Excel spreadsheet into a database that can be queried using Structured Query Language (SQL).
The example in this post uses data from the Superstore-Sales dataset, which can be found here. This dataset is stored…
The first time I explored regression in Python I dove headfirst into scikit-learn, a package that provides a number of useful tools for developing predictive models. I ran a simple linear regression model and output my intercept, coefficients, and model fit metrics. Being a newcomer to Python, coming from a background heavily focused on statistical inference, and not yet fully grasping the differences between statistics and data science, I then spent a good amount of time looking for ways to output the standard errors, confidence intervals, and p-values of the regression weights.