## Week 1 Assignment – Statistics and Standard Deviation: collect data from rent.com

Learning Objectives Covered

- LO 01.02 – Identify the principles of statistical reasoning and their use in understanding, analyzing, and developing formal arguments
- LO 01.03 – Outline the overall process and particular steps in designing studies, collecting and analyzing data, interpreting and presenting results
- LO 02.01 – Identify and appropriately apply the elements of statistical thinking
- LO 02.02 – Determine sampling methods and the benefits of each
- LO 04.02 – Calculate standard deviation and determine the meaning of data

Career Relevancy

Knowing how to collect, read, and present data has many benefits not just to you, but to others. What if you believed there were not enough employees in your department to perform the primary business goals? How would you go about proving it? Once you collected information from your analysis, your conclusions are beneficial to not just your department, but the company as a whole.

Background

What happens after you have collected data, but before you show the results? You have to take all those numbers/information and make sense out of it. This is the foundational premise and purpose of statistics—to collect, organize, present, analyze, and interpret data to make better decisions.

**Statistical Thinking and Investigation**

There are key components to a statistical mindset, and at times, the numbers are only a small part of the overall process.

- Planning the study. What do you want to know? What is your testable research question?
- Examining the data. This step involves examining your data in an appropriate manner.
- Inferring from the data. What did you learn beyond what the data told you?
- Drawing conclusions. Based on the data, what conclusions can you draw? (Chance and Rossman, n.d.).

Statistics are powerful tools that use mathematics to perform technical analysis of data. Using statistics means we can gain deeper insight into how data is structured and how to apply data science techniques to get even more information (Seif, 2018). There are five basic statistical concepts that we will cover briefly.

- Statistical features. This is one of the first techniques you would apply and also the most common. This uses bias, variance, mean, median, and percentiles (Seif, 2018).
- Probability distributions. This is the function that represents the probabilities of all possible values in an experiment. Probability is the percent chance that some event will occur (Seif 2018).
- Dimensionality reduction. This term means we have a dataset and want to reduce the number of dimensions it can have (Seif, 2018).
- Over and under sampling. This sampling is used to deal with classification problems. Undersampling means we will select only some of the data from the majority class and use just the number of examples the minority class has, while oversampling means we will create copies of the minority class to have the same number as the majority class (Seif, 2018).
- Bayesian statistics. This involves applying math to analyze the probability of some event occurring, but the only data computed is prior data (Seif, 2018).

**Standard Deviation**

Learning standard deviation is a part of comprehending your data and making it useful. It is a measure of how spread out numbers are and is denoted by the Greek letter sigma. Basic terms to calculate the standard deviation:

Mean: the average of the numbers

Variance: the average of the squared differences from the mean

Standard deviation: square root of the variance.

**Sampling Methods**

We will review six statistical sampling methods that can be applied when all possible items in a population are quite large, making it time consuming and costly to comprehensively analyze all items (Westfall, n.d.).

- Random sampling. In this method, each item in the population has the same probability of being selected as part of the sample as any other item (Westfall, n.d.). The benefits of random sampling include simplicity and lack of bias.
- Systematic sampling. In this method, every nth element from the list is selected as the sample, starting with a sample element n randomly selected from the first k elements (Westfall, n.d.). Benefits include being easy to execute and understand.
- Stratified sampling. The statistical sampling method is used when representatives from each subgroup within the population need to be represented in the sample. Benefits include producing characteristics that are proportional to the overall population. Stratified sampling also gives smaller error estimates and greater precision.
- Cluster sampling. In cluster sampling, the population that is being sampled is divided into groups called clusters (Westfall, n.d.). Typically, clustering is less expensive and allows larger samples.
- Haphazard sampling. In this method, samples are selected based on convenience but preferably should still be chosen as randomly as possible. This sampling method doesn’t always work, but it can be advantageous when there are technical limits.
- Judgmental sampling. In judgmental sampling, the person doing the sample uses his/her knowledge or experience to select the items to be sampled (Westfall, n.d.). Benefits include almost real-time results with minimum time for execution.

To find the variance:

- Calculate the mean.
- For each number, subtract the average of the numbers and square the results.
- Average the squared difference.
- Take the square root of the variance and that is the standard deviation.

**Using Your Tools**

There was a time when the only tool to calculate standard deviation, variance, and mean was paper and pencil. Now there is an Analysis ToolPak (Links to an external site.) included in Excel to help perform data analysis. If you don’t already have the ToolPak installed, now is the perfect time to install it. Click on the highlighted link and walk through the steps on how to install it. After you have it installed it, you should have an option on the far right-hand side of the screen for Data Analysis. When you click Data Analysis, it will open a separate dialog box with options.

**Resources and References**

Chance, B. and Rossman, A. (n.d.). Statistical Thinking. *NOBA*. Retrieved on February 17, 2020 from https://nobaproject.com/modules/statistical-thinking (Links to an external site.) (Links to an external site.)

Math is Fun. (n.d.). Standard deviation and variance. Retrieved from https://www.mathsisfun.com/data/standard-deviation.html (Links to an external site.)

Seif, G. (2018, October 21). The 5 basic statistics concepts data scientists need to know. *Towards Data Science*. Retrieved on February 17, 2020 from https://towardsdatascience.com/the-5-basic-statistics-concepts-data-scientists-need-to-know-2c96740377ae (Links to an external site.)

Westfall, L. (n.d.). Sampling Methods. *The Certified Software Quality Engineer Handbook. *Retrieved on February 17, 2020 from http://westfallteam.com/Papers/Sampling%20Methods.pdf (Links to an external site.)

## Prompt

For this assignment, collect data from rent.com (Links to an external site.) to create your own data set. This should include a systematic sample of homes and/or apartments listed for rent in your zip code. Choose a value of n between 2 and 4 and restrict each to homes or apartments only (not land rentals etc.). Then choose whichever -nth home is listed in your zip code until you have a data set of 10 data points. If your zip code does not have enough rentals, choose a nearby zip code.

Your data set should include the following information: total cost, square footage, number of bedrooms, and number of bathrooms. Do the same for a zipcode of your choosing NOT in the same state.

Using descriptive statistics to compare the two data sets in terms of what is typical and the variation. Use the following template to present your findings. Determine which market has the best value for homes. In your findings, be sure and include 2 citations and references. Your comparison document should be a minimum of 350 words.