Get started with PySpark and Resilient Distributed Datasets (RDDs)

Diagram of how data gets partitioned among workers in spark. Shows how it starts with an RDD, then each partition creates an original and a copy. That is then distributed to a cluster of nodes. 

Apache Spark is a framework that allows us to work with big data. But how do we tell Spark what to do with our data? In this lesson, we'll get familiar with using PySpark (the Python API for Spark) to load and transform our data in the form of **RDDs** &mdash; **resilient distributed datasets**. 

RDDs are the foundational data structures of Spark. Newer Spark structures like **DataFrames** are built on top of RDDs. While DataFrames are more commonly used in industry, RDDs are not deprecated and are still called for in certain circumstances. For example, RDDs are useful for processing unstructured data, such as text or images, that don't fit nicely in the tabular structure of a DataFrame.

So what exactly is an RDD? According to our friends at Apache, the formal definition of an RDD is “a fault-tolerant collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.” Those are some complicated words! Let’s break down the three key properties of RDDs together:

- **Fault-tolerant** or **resilient**: data is copied and recoverable in the event of failure
- **Partitioned** or **distributed**: datasets are split up across the nodes in a cluster
- **Operated on in parallel**: tasks are executed on all the chunks of data at the same time

Now that we have a bit more context as to what RDDs are, let’s learn how to create one with PySpark in the next exercise!

Resilient Distributed Datasets (RDDs)

The entry point to Spark is called a **SparkSession**. There are many possible configurations for a SparkSession, but for now, we will simply start a new session and save it as `spark`:

```py
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() 
```

We can use Spark with data stored on a distributed file system or just on our local machine. Without additional configurations, Spark defaults to local with the number of partitions set to the number of CPU cores on our local machine (often, this is four). 

The `sparkContext` within a SparkSession is the connection to the cluster and gives us the ability to create and transform RDDs. We can create an RDD from data saved locally using the `parallelize()` function. We can add an argument to specify the number of partitions, which is generally recommended as 2-4 partitions per machine. Otherwise, Spark defaults to the total number of CPU cores. 
```py
# default setting
rdd_par = spark.sparkContext.parallelize(dataset_name)
```

If we are working with an external dataset, or possibly a large dataset stored on a distributed file system, we can use `textFile()` to create an RDD. Spark's default is to partition the text file in 128 MB blocks, but we can also add an argument to set the number of partitions within the function.
```py
# with partition argument of 10
rdd_txt = spark.sparkContext.textFile("file_name.txt", 10)
```

We can verify the number of partitions in `rdd_txt` using the following line:
```py
rdd_txt.getNumPartitions()
# output: 10
```

Finally, we need to know how to end our SparkSession when we are finished with our work:
```py
spark.stop()
```

Now that we know how to get started with PySpark, let’s introduce the dataset we’ll be working with throughout this lesson and set it up as an RDD!

---

_How to Use Your Jupyter Notebook:_
* _You can run a cell in the Notebook to the right by placing your cursor in the cell and clicking the `Run` button or the `Shift`+`Enter/Return` keys._
* _When you are ready to evaluate the code in your Notebook, press the `Save` button at the top of the Notebook or `command`+`s` keys before clicking the `Test Work` button at the bottom. **Be sure to save your solution code in the cell marked `## YOUR SOLUTION HERE ##` or it will not be evaluated.**_
* _When you are ready to move on, click `Next`._

![screenshot of the buttons at the top of the Jupyter Notebook interface with Save and Run highlighted](https://static-assets.codecademy.com/Courses/big-data-pyspark/Jupyter-buttons.png)

Start Coding with PySpark

Many of the Spark functions we use on RDDs are similar to those we regularly use in Python. We can also use lambda expressions within RDD functions. Lambdas allow us to apply a simple operation to an object in a single line without defining it as a function. Check out the following example of a lambda expression that adds the number 1 to its input.

```py
add_one = lambda x: x+1 # apply x+1 to x
print(add_one(10)) # this will output 11
```

Let's introduce a couple of PySpark functions that we may already be familiar with:

`map()` applies an operation to each element of the RDD, so it's often constructed with a lambda expression. This map example adds 1 to each element in our RDD:

```py
rdd = spark.SparkContent.parallelize([1,2,3,4,5])
rdd.map(lambda x: x+1)
# output RDD [2,3,4,5,6]
```

If our RDD contains tuples, we can map the lambda expression to the elements with a specific index value. The following code maps the lambda expression to just the first element of each tuple but keeps the others in the output:

```py
# input RDD [(1,2,3),(4,5,6),(7,8,9)]
rdd.map(lambda x: (x[0]+1, x[1], x[2]))
# output RDD [(2,2,3),(5,5,6),(8,8,9)]
```

`filter()` allows us to remove or keep data conditionally. If we want to remove all `NULL` values in the following RDD, we can use a lambda expression in our filter:

```py
# input RDD [1,2,NULL,4,5]
rdd.filter(lambda x: x is not None)
# output RDD [1,2,4,5]
```

You may have noticed that each function took an RDD as input and returned an RDD as output. In Spark, functions with this behavior are called **transformations**. You can find more transformations in [the official Spark documentation](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations).

We have one final note about transformations: we can only view the contents of an RDD by using a special function like `collect()`, which will print the data stored in the RDD. So to view the new RDD in the previous example, we would run the following:

```py
rdd.filter(lambda x: x is not None).collect()
```
```
[1,2,4,5]
```

Let's try working with some transformations!

Transformations

You may have noticed that transformations execute rather quickly! That’s because they didn’t execute at all. Spark executes transformations only when an **action** is called to return a value. This delay is why we call Spark transformations **lazy**. We call the transformations we do in pandas **eager** because they execute immediately.

So, why are Spark transformations lazy? Spark will queue up the transformations to optimize and reduce overhead once an action is called. Let’s say that we wanted to apply a map and filter to our RDD:

```py
rdd = spark.SparkContent.parallelize([1,2,3,4,5])
rdd.map(lambda x: x+1).filter(lambda x: x>3)
```

Instead of following the order that we called the transformations, Spark might load the values greater than 3 into memory first and perform the map function last. This swap will save memory and time because Spark loaded fewer data points and mapped the lambda to fewer elements.

In the last exercise, Spark executed our transformations only when the action `collect()` was called to return the entire contents of the new RDD as a list. We generally don't want to use `collect()` to pull large amounts of data into memory, so we can use `take(n)` to view the first **_n_** elements of a large RDD.

```py
# input RDD [1,2,3,4,5]
rdd.take(3)
```
```
[1, 2, 3]
```

We can use the action `reduce()` to return fewer elements of our RDD by applying certain operators. For example, say we want to add up all the values in the RDD. We can use `reduce()` with a lambda to add each element sequentially.

```py
# input RDD [1,2,3,4,5]
rdd.reduce(lambda x,y: x+y)
```
```
15
```

`reduce()` is powerful because it allows us to apply many arbitrary operations to an RDD &mdash; it unbinds us from searching for library functions that might not exist. However, it certainly has limitations, which we’ll dive into in the next exercise.

The key thing about actions is that, like transformations, they take an RDD as input, but they will always output a value instead of a new RDD.

Actions

The `reduce()` function we used previously is a powerful aggregation tool, but there are limitations to the operations it can apply to RDDs. Namely, `reduce()` must be **commutative** and **associative** due to the nature of parallelized computation. 

You’ve probably heard of both those terms back in elementary math class, and they probably make sense to you in that context. However, what do they mean in Spark?

Well, it all ties back to the fact that Spark operates in parallel &mdash; tasks that have commutative and associative properties allow for parallelization. 
* The commutative property allows for all parallel tasks to execute and conclude without waiting for another task to complete. 
* The associative property allows Spark to partition and distribute our data to multiple nodes because the result will stay the same no matter how tasks are grouped. 

Let’s try to break that down a bit further with math!
No matter how you switch up or break down summations, they’ll always have the same result thanks to the commutative and associative properties:
```tex
1+2+3+4+5 = (3+4+5)+(1+2) = (4)+(2+5+1)+(3) = 15
```
However, this is not the case with division:
```tex
1\div2\div3\div4\div5 \neq (3\div4\div5)\div(1\div2) \neq 4\div(2\div5\div1)\div3
```
The flowchart represents one of the possible ways that our list was partitioned into three nodes and ultimately summed. No matter how our data was partitioned or which summations were completed first, the answer will be 15.

![title](https://static-assets.codecademy.com/Courses/big-data-pyspark/rdd-add.svg)

This shows that the commutative and associative properties enable parallel processing because it gives us two very important concepts: the output doesn’t depend on the order in which tasks complete (commutative) nor does it depend on how the data is grouped (associative).

Associative and Commutative Properties

Imagine having an RDD containing two-letter state abbreviations. 

```py
# list of states
states = ['FL', 'NY', 'TX', 'CA', 'NY', 'NY', 'FL', 'TX']
# convert to RDD
states_rdd = spark.sparkContext.parallelize(states)
```

However, we want the region instead of the state. Regions are groupings of states based on their geographic location, such as "East" or "South". Currently, our RDD is partitioned in the Spark cluster, and we don’t know which nodes contain data on which states. 

In this situation, we need to send the conversion information to all nodes because it’s very likely that each node will contain multiple distinct states. We can provide each node with information on which states belong in each region. This information that is made available to all nodes is what Spark calls **broadcast variables**. Let’s see how we can implement them to convert the abbreviations! 

Let’s start off by creating a conversion dictionary called `region` that matches each state to its region:

```py
# dictionary of regions
region = {"NY":"East", "CA":"West", "TX":"South", "FL":"South"}
```

We can then broadcast our `region` dictionary and apply the conversion to each element in the RDD with our map function:

```py
# broadcast region dictionary to nodes
broadcast_var = spark.sparkContext.broadcast(region)
# map regions to states
result = states_rdd.map(lambda x: broadcast_var.value[x])
# view first four results
result.take(4)
# output : [‘South’, ‘East’, ‘South’, ‘West’]
```

This is Spark’s efficient method of sharing variables amongst its nodes (also known as **shared variables**). They ultimately improve performance by decreasing the amount of data transfer overhead because each node already has a cached copy of the required object. However, it should be noted that we would never want to broadcast large amounts of data because the size would be too much to serialize and send through the network.

Broadcast Variables

We’ve broadcasted a dictionary over to your nodes, and everything went well! We’re now curious as to how many "East" versus "West" entries there are. We could attempt to create a couple of variables to keep track of the counts, but we might run into serialization and overhead issues when datasets get really big. Thankfully, Spark has another type of shared variable that solves this issue: **accumulator variables**.

Accumulator variables can be updated and are primarily used as counters or sums. Conceptually, they’re similar to the sum and count functions in NumPy.

Let’s see how we can implement accumulator variables by counting the number of distinct regions. Since this will be a new dataset, let’s create an RDD first:
```py
region = ['East', 'East', 'West', 'South', 'West', 'East', 'East', 'West', 'North']
rdd = spark.sparkContext.parallelize(region)
```

We’ll start off by initializing the accumulator variables at zero:
```py
east = spark.sparkContext.accumulator(0)
west = spark.sparkContext.accumulator(0)
```

Let’s create a function to increment each accumulator by one whenever Spark encounters 'East' or 'West':
```py
def countCoasts(r):
    if 'East' in r: east.add(1)
    elif 'West' in r: west.add(1)
```

We’ll take the function we created and run it against each element in the RDD.
```py 
rdd.foreach(lambda x: countCoasts(x))
print(east) # output: 4
print(west) # output: 3
```
This seems like a simple concept, but accumulator variables can be very powerful in the right situation. They can keep track of the inputs and outputs of each Spark task by aggregating the size of each subsequent transformation. Instead of counting the number of east or west coast states, we could count the number of NULL values or the resulting size of each transformation. This is important to monitor for data loss. 

This doesn’t mean you should add accumulator variables to everything though. It’s best to avoid using accumulators in _transformations_. Whenever Spark runs into an exception, it will re-execute the tasks. This will incorrectly increment the accumulator. However, Spark will guarantee that this does not happen to accumulators in _actions_.

Accumulators can be great as debugging or summary tools, but they’re not infallible when used in transformations.

Accumulator Variables

Congratulations! You've just finished your first coding adventure with PySpark! In this lesson, we learned that:
* RDDs are the foundational data structure of Spark
* RDDs are fault-tolerant, partitioned, and operated on in parallel
* Transformations are lazy and do not execute until an action is called

We also learned how to:
* Transform and summarize RDDs with transformations and actions
* Send information to all nodes with broadcast variables
* Debug work with accumulator variables

Review

RDDs with PySpark

Learn one way that Spark handles big data -- through Resilient Distributed Datasets (RDDs).

Spark RDDs with PySpark

Big data refers to data that is too large to handle with our current computing power and is relative to the system's total available Random Access Memory (RAM).

Big data refers to any dataset that is bigger than the industry-accepted size limit of big data.

Big data refers to any dataset that cannot be read on a user interface (UI).

This quiz will test your knowledge about big data and its implications when it comes to processing and storing it.

Introduction to Big Data

## Big Data Challenges
Every single day, over 2.5 quintillion bytes of data are created. That's 2.5 with 18 zeroes after it! From transactional sales data to Internet of Things (IoT) devices, data sources grow in both size and velocity at a rapid rate. When thinking about the massive scale of data, we might wonder: where are all of these data stored? And how do we get enough computing power to process it?

Traditionally, we can view a basic dataset as a table in Excel or an equivalent application. These standard solutions require that we pull an entire dataset into memory on a single processing machine. When a data table becomes very large, it will exceed the random access memory (RAM) available for computation and either crash or take too long to process, making analysis impossible. Thus, we need to find alternative ways to store and process this big data!

## Big Data Storage
A popular solution for big datasets is a distributed file system on a network of hardware called a cluster. A **cluster** is a group of several machines called **nodes**, with a **cluster manager** node and multiple **worker** nodes.

![Illustration showing the structure of a cluster. The cluster manager has computing power and sends commands to three worker nodes. The worker nodes have both storage and computing power.](https://static-assets.codecademy.com/Courses/big-data-pyspark/storage-computing-cluster.svg)

The cluster manager manages resources and sends commands to the worker nodes that store the data. Data saved on worker nodes are replicated multiple times for fault tolerance. This allows access to the complete dataset even in the event that one of the worker nodes goes offline. This type of file storage system is also easily and infinitely scalable, as additional worker nodes can be added indefinitely.

### Hadoop Distributed File System

One commonly used framework for a cluster system is called **Hadoop Distributed File System (HDFS)**, which is part of a set of tools distributed by Apache. HDFS was designed to store vast amounts of data to be processed using another framework called MapReduce. However, implementing a distributed file system like this requires a specific hardware configuration that can be a costly barrier to entry for many companies. For this reason, cloud-hosted HDFS is a popular fix. Microsoft Azure and Amazon Web Services (AWS) offer cloud-based HDFS solutions, allowing companies to outsource a system's setup and hardware management for a fixed monthly cost. 

Because HDFS solutions both store and process data on each worker node, they ensure that we have enough computing power to tackle our data problems. When data grow in size, our number of nodes may be increased to add more storage and computing power. This is advantageous for scaling but can become expensive as the number of nodes increases.


### Object Storage

Another type of distributed file system is growing quickly in popularity because it separates storage from computing power. **Object storage** is a framework that is only for storage so that we can use any kind of computing power or framework on top of our data. Cloud providers like Microsoft Azure, Amazon Web Services (AWS), and Google Cloud host object storage layers, where we can store any kind of file and dataset. 

These storage layers have an advantage over HDFS in that they have a low barrier to entry and are very flexible. Users can store any kind of file in a variety of formats, from CSVs and Parquet to other open-source formats that provide better performance and reliability such as Delta and Iceberg. This separation also allows us to grow either storage or computing power independently of the other to meet needs more efficiently.

<Assessment id="f0ad455bab1c45429ccd1baf6531b0d5" />

## Big Data Computing
Clusters can be composed of any kind of computing power resource. Traditionally, clusters would be a collection of server racks that a person or organization could connect to directly. In modern systems, these machines are typically virtual machines (VMs) that are "spun up" in one of the cloud providers' environments. This kind of approach has many benefits, namely that it is much more cost-efficient and more scalable than dealing with physical machines.

So how exactly does a big data computing system work? The first, main method by which data are collected and analyzed across each node is called **MapReduce**. MapReduce is a framework composed of two actions: map and reduce. The map function collects specifically defined elements of data from each node as key-value tuple pairs. The reduce function is an analytical function applied to each key-value pair dataset whose solution is returned as output. 

Check out the following interactive diagram to see a representation of MapReduce in action. Select each function sequentially to split the data, map the count of each shape, shuffle the counts by shape, reduce to just three counts, and return the results. MapReduce speeds up the processing of big data by having each worker node perform these operations on its own chunk of the dataset so that all workers are engaged and not waiting on another process to finish.

<iframe src="https://static-assets.codecademy.com/Courses/big-data-pyspark/sc-mapreduce-int/index.html" title="Interactive diagram illustrating the MapReduce process" height='700px' frameBorder="0"></iframe>

MapReduce was the standard for big data processing for a while, but over time it could not keep up with the rate at which data were growing and changing. With time, Apache Spark emerged as a better alternative for processing. Spark's main benefit was the ability to process data in the node's memory instead of processing on disk as MapReduce does. This provided much better performance and unlocked new capabilities for working with big data.

In order to get value out of big data, we need to utilize the best strategies for storing our data and providing computing power for our analysis. With these in place, we can be ready to scale and grow our analyses!

<FreeResponseQuestion id="cb6bf6d7910840fc9755d1676462db07" />

<Assessment id="54c6b8c44cb544c2ab790ebf426fbb76" />

Learn about the challenges of storing and analyzing big data

Big Data Storage and Computing

[algorithms]: https://www.codecademy.com/resources/docs/general/algorithm
[google search]: https://www.google.com/search?q=cases+where+drivers+drove+into+lakes+and+rivers+because+their+GPS+instructed+them+to&rlz=1C5CHFA_enUS742US742&oq=cases+where+drivers+drove+into+lakes+and+rivers+because+their+GPS+instructed+them+to&aqs=chrome..69i57.233j0j7&sourceid=chrome&ie=UTF-8
[A Reuters article from 2018]: https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G
[The Gender Shades project]: http://gendershades.org/index.html
[Buolamwini et al. 2018, _Proceedings of Machine Learning Research_]: http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf

[Image of bus driving into a lake]: https://static-assets.codecademy.com/Courses/data-literacy/analyses/1683.svg
[Bar plot showing compositions of the three benchmarking datasets. Over 80% of the Adience dataset is equal parts lighter males and females. The IJB-A dataset is about 60% lighter male and 20% lighter female, with less than 10% darker female. The new PPB dataset is nearly equal quarters of lighter male, lighter female, darker male, and darker female.]: https://static-assets.codecademy.com/Courses/data-literacy/analyses/FacialBenchmarks.svg

Would you follow your GPS anywhere? Even into a lake? That may sound ridiculous, but a quick [google search] brings up dozens of cases where drivers drove into lakes and rivers because their GPS instructed them to. Following GPS instructions against your better judgment is one example of **automation bias**. 

![Image of bus driving into a lake]

As humans, we have many biases, both implicit and explicit. Biases are systematic errors in thinking influenced by cultural and personal experiences. Biases distort our perception and cause us to make incorrect decisions. One bias that many humans share is automation bias. Automation bias stems from the idea that computers or machines are more trustworthy than humans because they are more objective. Automation bias is at the root of why people follow their GPS into trouble, even when contradictory information is available.

Computers, data, and [algorithms] are not actually completely objective. It is true that data analysis can help us make better decisions, but it is not immune to bias. Humans create technologies and algorithms. As a result, they often have human biases encoded into them. It's clear that we need to pay attention to other information streams (our eyes and ears) when we drive with GPS. Similarly, we need to look at more information sources when we evaluate data analysis results or reports.

If we want to be responsible when we use data and algorithms, we need to understand the different types of bias that show up at each stage of analyzing data. Let’s take a closer look at some types of bias that impact data analysis and data-driven decision-making.

## Bias in data collection

Before we can analyze data or use machine learning algorithms, we need to collect data. Data collection is subject to **selection bias** (also called sample bias). Selection bias occurs when study subjects (_i.e._, the sample) are not representative of the population. Selection bias can be due to poor study design if the sample is too small or is not randomized. Selection bias can also crop up when the only data available is influenced by **historical bias** &mdash; systematic influence based on historic social and cultural beliefs. 

[A Reuters article from 2018] highlights how the company Amazon produced a machine-learning algorithm that suffered from such a selection bias. The company designed the algorithm to help recruiters hire top talent. The model was trained on thousands of resumes from people that were or were not hired by Amazon. It learned 50,000 phrases associated with resumes and began to ignore common phrases, such as the names of programming languages. However, the algorithm also learned to downgrade resumes that contained the word "women’s." This included resumes that referenced women’s colleges, teams, or committees. 

This is an example of selection bias because the data used to train the algorithm were not representative of the modern applicant pool. The majority of Amazon's past applicants and employees were male. This means a larger proportion of the successful resumes in the training data came from male applicants. Amazon did not explicitly train the algorithm to use gender. Yet, the algorithm still found and used gender-associated terms to weed out women candidates. 

We can do our best to avoid selection bias by doing everything possible to have a representative sample, not just a convenient one. For example, it’s a good idea to include data inputs from multiple sources to diversify data. This is easier said than done, however, and we need to acknowledge and address historical bias in data sources and work towards building frameworks to increase inclusivity. 

## Bias in building and optimizing algorithms

**Algorithmic bias** arises when an algorithm produces systematic and repeatable errors that lead to unfair outcomes, such as privileging one group over another. Algorithmic bias can be initiated through selection bias and then reinforced and perpetuated by other bias types.

Facial recognition software is an area where algorithmic bias can do a lot of harm. This software is sold to police departments and used to recognize criminals in surveillance footage. If the software systematically makes more mistakes depending on race or gender, people in some groups will be incorrectly pursued more often, which has serious, negative outcomes for individuals.
 
[The Gender Shades project] tested commercial facial recognition software for these kinds of biases. IBM, Microsoft, and Face++ are three companies that offer facial recognition software with a binary gender classifier feature. Researchers assessed the accuracy of these algorithms and discovered that they suffered from algorithmic bias. The algorithms were good at identifying lighter males, okay at identifying darker males and lighter females, and very bad at identifying darker females. 

Each software used proprietary algorithms and did not report performance results with benchmarking datasets. However, the developers probably tested the software on one of two commonly-used benchmarking datasets: Adience or IJB-A. These datasets include few dark-skinned people and especially low proportions of dark-skinned females. Testing an algorithm with a non-representative dataset leads to **evaluation bias**. Testing with a non-representative benchmarking dataset would give high overall accuracy scores, even if the algorithms were inaccurate for certain groups.
 
Another key point when it comes to algorithmic bias in facial recognition software is that the algorithms are proprietary, making them “black boxes”. In addition to not knowing what data were used to train and test the algorithm, we can’t know how it was designed or how it works. As a result, it’s impossible to evaluate the algorithms themselves.
 
Avoiding algorithmic bias relies on transparency, especially concerning data used for training and testing an algorithm. In response to the poor performance of facial recognition with darker females, a new benchmarking dataset was developed (PPB) that is more representative of the full spectrum of humanity. This is a big step forward, as long as the new dataset is actually used by companies making and selling facial recognition software.

![Bar plot showing compositions of the three benchmarking datasets. Over 80% of the Adience dataset is equal parts lighter males and females. The IJB-A dataset is about 60% lighter male and 20% lighter female, with less than 10% darker female. The new PPB dataset is nearly equal quarters of lighter male, lighter female, darker male, and darker female.]

Data for this plot came from [Buolamwini et al. 2018, _Proceedings of Machine Learning Research_].

## Bias in interpreting results and drawing conclusions 

Bias also influences the final stages of data analysis: interpreting results and drawing conclusions. The following bias types are ones we should watch out for when evaluating or generating data reports:

- **Confirmation bias** is our tendency to seek out information that supports our views. Confirmation bias influences data analysis when we consciously or unconsciously interpret results in a way that supports our original hypothesis. To limit confirmation bias, clearly state hypotheses and goals before starting an analysis, and then honestly evaluate how they influenced our interpretation and reporting of results.

- **Overgeneralization bias** is inappropriately extending observations made with one dataset to other datasets, leading to overinterpreting results and unjustified extrapolation. To limit overgeneralization bias, be thoughtful when interpreting data, only extend results beyond the dataset used to generate them when it is justified, and only extend results to the proper population. 

- **Reporting bias** is the human tendency to only report or share results that affirm our beliefs or hypotheses, also known as “positive” results. Editors, publishers, and readers are also subject to reporting bias as positive results are published, read, and cited more often. To limit reporting bias, report negative results and cite others who do, too. 

## Conclusions

Data and machine learning algorithms are now ubiquitous. They influence decisions about who is hired or fired, accepted into schools, or allowed to rent houses. They even influence which neighborhoods are more heavily policed and who is granted parole. Therefore, we must recognize that data and algorithms can be biased, just like the humans who create and train them. Learning more about the types of bias that influence how algorithms function will improve our ability to perform and interpret data analyses and will help us make more informed decisions.

Bias is everywhere in data. The key to combatting bias is knowing what to look out for.

Bias in Data

## Data is Everywhere

We generate data from all kinds of activities, whether a transaction at a local store, a website we visit, or even the location of our cell phone. At the time of this writing, an average of 500 million tweets are written on Twitter each day. Imagine being the person at Twitter who has to analyze this data! Data of this size is often referred to as _big data_.

What exactly is big data? In general, big data is any data that is too big for a typical modern computer to process and analyze. This means, however, that the definition of big data is relative to the amount of computing power we have available. For example:
* Most current personal computers have somewhere between 8-32 GB of random access memory (RAM) available for data processing. That means, from the perspective of a personal computer, any dataset larger than 10-20 GB might be too large to process. 
* A large enterprise can take advantage of larger computing resources (i.e., a warehouse of servers or the cloud), so 100+ GB might be the upper limit for the size of data the enterprise can handle.
* Data measured in terabytes is perhaps the largest amount of data being worked with at this time (1 TB = 1000 GB).

_**In the following applet, try adjusting the slider to see what qualifies as big data as we increase our modern computing power**_

<iframe src="https://static-assets.codecademy.com/Courses/big-data-pyspark/big-data-slider/index.html" title="Interactive illustration of big data getting larger as computing power becomes greater" height='600px' frameBorder="0"></iframe>

Big data hasn’t always been a concept with respect to data analysis. For most of history, we have been able to handle the amount of data we collect. Before computers, scientists would perform calculations on handwritten data for a research sample. With the invention of computers, we were able to process data more quickly and were generally able to keep up with the amount of data we had available. 

In more recent history, however, sources of data have continued to grow and are outpacing the growth of computing power. In the mid-2000s, after the massive growth of the internet, many analysts in the industry were struggling to handle their own data. Roger Mougalas coined the term “big data” when referring to a dataset that was unmanageable with current business intelligence tools.

<Assessment id="4f6eac2481684fcebe4fc20dbcbdf6a9" />


## The 3 Vs

Big data is a relative concept that can be difficult to grasp. It may be easier to define big data using the features that make it hard to handle in the first place. We can generally categorize big data by what are known as the three Vs: volume, velocity, and variety. Depending on where you look (or who you ask), there may be a different number of Vs. Some will say that there are 4, 5, 10, or even 17 Vs of big data! The three Vs we talk about here are the core of most definitions and give a complete picture of big data's features.

![Image illustrating the 3 Vs of big data: volume, velocity, and variety. The volume image shows four database icons. The velocity image shows a bar graph with bars of increasing size. The variety image shows icons of music notes, a play button, a document, an image of mountains, and a map marker.](https://static-assets.codecademy.com/Courses/big-data-pyspark/big-data-3Vs.svg)

### Volume
Big data is “big”. While this may seem obvious, it’s an important concept to cover. As previously mentioned, the definition of “big” is that the data is bigger than the amount of available computing power. Currently, zettabytes of data are created every year (for reference, a zettabyte is 1 billion terabytes).
### Velocity
Big data has velocity, meaning that it is growing quickly. If data were simply large, but slow-changing, then over time our computing power would eventually catch up to the size of the data. Through means like apps and sensors, data becomes faster, cheaper, and easier to collect automatically and continuously.
### Variety
Big data also has variety, meaning that it comes in different, and sometimes complex, forms. In today’s data ecosystem, data comes in many more formats than the data tables of old. Data can be categorized as structured (data tables with rows and columns), semi-structured (think JSON files with nested data), and unstructured (audio, image, and video data). Each of these data formats presents different challenges in processing.

<Assessment id="90e3f33edc1e4a519a2f434c5077dad3" />

## Big Data Applications

What can we do with big data? When do we run into big data in the real world? Let's explore a few examples of big data applications across various industries.

### Social Media
With an average of 500 million tweets per day, Twitter data would definitely qualify as big data. Despite the massive amount of data at its disposal, [Twitter provides analytics for each user](https://analytics.twitter.com/about) with the ability to dive into historical Tweet activity and identify trends. In order to provide this for each Twitter user, they must be using some kind of big data toolkit to store and analyze the data.

### Healthcare
If we look at the healthcare industry, a rising trend that many providers want to enable is known as evidence-based medicine. Healthcare providers want to combine data from several sources, including cell phone apps, diagnostic tests, and previous medical records, to give recommendations for each patient. Providers hope this will avoid expensive and unnecessary tests and improve patient outcomes. In this case, there is a lot of data from many different sources and formats, but the efforts could provide a huge impact for their patients.

### Finance
In the financial industry, credit card companies aim to reduce the amount of fraudulent transactions, as these cost money and cause hardship for their customers. Using different tools, credit card companies are able to analyze every single credit card transaction and use machine learning models to identify transactions that could be fraudulent. This saves a massive amount of money for themselves and for their customers!

## Final Thoughts
No matter which industry we look at, we will find numerous examples of big data everywhere, and there are more every day. However, we need to be aware that big data often comes with both big challenges and big effects. With the right tools and techniques, we can begin to extract value from big data while being mindful of both its limitations and impacts.


Learn about applications of big data and how we can describe it with the 3 Vs

What is Big Data?

Most data analysis solutions require that data be accessed and then pulled into memory before data processing tasks can be performed. When the dataset being analyzed is so large that it exceeds the random access memory (RAM) limitations of the computer being used, analysis is impossible. 

For example, most standard laptop computers today come with 16 GB or 32 GB of RAM with 2-5 GB allocated to the operating system and background processes. If there is a 100 GB dataset that needs to be analyzed, then pulling the dataset into a pandas DataFrame, for example, is not possible.

![GIF showing a 100 GB dataset being put into an average computer. The computer returns the message "Memory ERROR" with a large red X.](https://static-assets.codecademy.com/Courses/big-data-pyspark/spark-error.gif)

To solve the problem of storing big data, we can split up a dataset and store it across multiple computers. This framework is known as a distributed file system. A popular choice is the Hadoop Distributed File System (HDFS), which splits up a dataset and stores it across multiple worker nodes in a cluster. 

MapReduce is a computing framework for analyzing datasets housed on a distributed file system like HDFS. MapReduce is a disk-oriented method, which means MapReduce writes data to disk in intermediate steps of analysis. While this method allows us to process data stored on the HDFS, it can still be a slow process for analyzing larger datasets.

## Apache Spark and PySpark

As big data processing needs have grown, new technology has been developed. Spark is an analytics engine originally developed at UC Berkeley and eventually donated to the open-sourced Apache Software Foundation. Spark was designed as a solution for processing big datasets and was specifically developed to build data pipelines for machine learning applications. Like MapReduce, Spark does not have its own file storage system and is designed to be used with distributed file systems like HDFS. However, Spark can also be run on a single node (single computer) in stand-alone mode with a non-distributed dataset.

![Diagram illustrating how Spark works with a computing cluster to distribute a 100 GB dataset and process it using the RAM of 5 worker nodes. 20 GB of data are given to each worker. Spark sends a task to the cluster manager. The cluster manager sends the task to be processed by all 5 workers at once, each of which has 32 GB of RAM for processing.](https://static-assets.codecademy.com/Courses/big-data-pyspark/spark-gb-ram.svg)

Spark uses the RAM of each cluster node in unison, harnessing the power of multiple computers. Spark applications execute analyses up to 100 times faster than MapReduce because Spark caches data and intermediate tables in RAM rather than writing them to disk. However, as datasets become larger, the advantage of using RAM decreases and can disappear altogether.

### What is PySpark?

Spark was originally developed in Scala (an object-oriented and functional programming language). This presented users with the additional hurdle of learning to code in Scala to work with Spark. PySpark is an API developed to minimize this learning obstacle by allowing programmers to write Python syntax to build Spark applications. There are also APIs for Java and R.

## How Spark Works

Now that we have a general understanding of Spark, let's explore how a Spark application works in more detail. The Spark driver is the entry point of a Spark application and is used to create a Spark session. The driver program communicates with the cluster manager to create resilient distributed datasets (RDDs). To create an RDD, the data is divided up and distributed across worker nodes in a cluster. Copies of the RDD across the nodes ensure that RDDs are *fault-tolerant*, so information is recoverable in the event of a failure. Two types of operations can be performed on RDDs:
1. **Transformations** manipulate RDDs on the cluster. 
2. **Actions** return a computation back to the main driver program. 

![GIF showing the Spark driver program communicating with the cluster manager to tell worker nodes to create and manipulate RDDs. The worker nodes cache the data and execute the tasks, sending the results of any actions performed back to the Spark driver program.](https://static-assets.codecademy.com/Courses/big-data-pyspark/spark-cluster.gif)

The cluster manager determines the resources that the Spark application requires and assigns a specific number of worker nodes as executors to handle the processing of RDDs. Spark can be run on top of three different cluster managers, including Hadoop’s YARN and Apache Mesos.

## Spark Modules

The driver program is the core of the Spark application, but there are also modules that have been developed to enhance the utility of Spark. These modules include:
* **Spark SQL**: an API that converts SQL queries and actions into Spark tasks to be distributed by the cluster manager. This allows for the integration of existing SQL pipelines without redevelopment of code and subsequent testing required for quality control. 
* **Spark Streaming**: a solution for processing live data streams that creates a discretized stream (Dstream) of RDD batches.
* **MLlib and ML**: machine learning modules for designing pipelines used for feature engineering and algorithm training. ML is the DataFrame-based improvement on the original MLlib module.
* **GraphX**: a robust graphing solution for Spark. More than just visualizing data, this API converts RDDs to resilient distributed property graphs (RDPGs) which utilize vertex and edge properties for relational data analysis.

## Spark Limitations

Spark is a powerful tool for working with big data, but it does come with some limitations:
1. **Expensive hardware requirements**: Spark provides a solution for more time-efficient analyses of large distributed datasets, but Spark analyses are much less cost-effective. The costs associated with Spark come from the need for a lot of RAM built into the worker nodes of a cluster. RAM is much more expensive than disk memory.
2. **Real-time processing is not possible**: Spark Streaming offers near real-time data processing, but true real-time data analysis is not supported.
3. **Manual optimization is required**: The benefits and power of Spark must be optimized by the developer, which requires an advanced understanding of the program and backend, creating a technical hurdle for developers.

## Spark Use Cases

TripAdvisor is an online travel site that sources, generates, and analyzes massive amounts of data per day. All of the data processing for TripAdvisor is done using Spark. Natural language processing of reviews is [an example shared by TripAdvisor in an article on their site.](https://www.tripadvisor.com/engineering/using-apache-spark-for-massively-parallel-nlp/)

MyFitnessPal is a popular application for smartwatches and smartphones owned by Under Armour that tracks the diet and exercise of its users. The app utilizes Spark to analyze user data for their users and internal marketing demographic classification. You can [read more about how MyFitnessPal uses Spark in this article by the Wall Street Journal.](https://www.wsj.com/articles/BL-CIOB-7254) 

We can find many other examples of Spark use cases for commercial business and research projects in recent years. There are also many other big data analysis solutions that are built using Spark code or using similar concepts. Some examples include:
* Delta Lake
* Apache Mesos
* Rumble (Apache)
* DataBricks

For all of these reasons, understanding Spark is fundamental for anyone who works with big data.

Learn about Apache Spark and its application for big data analysis

What is Spark?

Transformations in Pandas are "eager", whereas transformations in Spark are "lazy".

Transformations will always run faster on Spark than Pandas.

Transformations in Pandas are "lazy", whereas transformations in Spark are "eager".

```py
sc.parallelize([1,2,3,4,5], 3).reduce(lambda x,y: x/y)
```


```py
sc.parallelize([1,2,3,4,5], 3).reduce(lambda x,y: x*y)
```
  


```py
sc.parallelize([1,2,3,4,5], 3).reduce(lambda x,y: x+y)
```

The code takes a count of the color green by running through each element in `rdd` and adding 1 to `greens` if the element is `green`.

The code runs through the RDD and sets the number of `green` elements in the RDD to 1.

The code runs through the RDD and sets the number of `green` elements in the RDD to 0.

The code creates a new RDD with three elements that are all `green`.

color = ['purple', 'green', 'green', 'purple', 'blue', 'yellow', 'blue', 'green', 'blue']
rdd = spark.sparkContext.parallelize(color)

greens = spark.sparkContext.accumulator(0)

def count_green(x):
    if x == 'green': greens.add(1)
    
rdd.foreach(lambda x: count_green(x))

```py
double = lambda x: x * 2
double(rdd)
```

data = [1,2,3,4,5]
rdd = spark.sparkContext.parallelize(data)

Introduction to PySpark RDDs

Use PySpark SQL and DataFrames to analyze Wikipedia clickstream data

Before we can load or query this data, we'll need to create a `SparkSession`. Create a new `SparkSession` and assign it to a variable named `spark`.

Use the `SparkSession` you've just created and the sample data from below to create an RDD. This data is a rough estimate of Clickstream counts for the article "Hanging_Gardens_of_Babylon". Later in the project we'll verify these counts with an analysis of the full data.

```python
sample_clickstream_counts = [
    ["other-search", "Hanging_Gardens_of_Babylon", "external", 47000],
    ["other-empty", "Hanging_Gardens_of_Babylon", "external", 34600],
    ["Wonders_of_the_World", "Hanging_Gardens_of_Babylon", "link", 14600],
    ["Babylon", "Hanging_Gardens_of_Babylon", "link", 2500]
]
```


Using the RDD you created in the previous step, create a DataFrame named `clickstream_sample_df` with the following column names:

- source_page
- target_page
- link_category
- link_count


Read the files in `./cleaned/clickstream/` into a new Spark DataFrame named `clickstream`. The raw clickstream data uses tab (`\t`) as a delimiter and has a header with column names. Once you've loaded the data, display the first few rows of the DataFrame in the notebook.


Print the schema of the DataFrame in the notebook. Do all the data types look correct?


Because we're only analyzing data from the English language Wikipedia, we can remove the language code column. Drop this column from the DataFrame and display the new schema in the notebook.


The column names `referrer` and `resource` can be a bit confusing to someone not familiar with this dataset. Rename them to `source_page` and `target_page`, respectively, and display the first few rows and DataFrame schema in the notebook.


In the following exercises, we're going to perform a few queries against the `clickstream` DataFrame using PySpark DataFrame methods and SQL. Add the `clickstream` DataFrame as a temporary view named `clickstream` so we can use SQL to perform analysis against it.


Earlier in the project we created an RDD using approximated data, let's check the real data to see how close that estimate was. Filter the dataset to entries with `Hanging_Gardens_of_Babylon` as the `target_page` and order the result by `click_count` using PySpark DataFrame methods.  

Perform the same analysis as the previous exercise using a SQL query. Display your results in the notebook to confirm both methods yielded the same results.

Are more readers directed to articles through intra-article links, external searches, or other methods? We can answer this question by calculating the sum of `click_count` and grouping by `link_category` across the entire dataset. First, try to get to a solution using PySpark DataFrame methods.

Let's create a new DataFrame named `internal_clickstream` that only contains article pairs where `link_category` is `link`. Because we're filtering on that column, we should also limit the new DataFrame's columns to just "source_page", "target_page", and "click_count" columns. Display the first few rows and DataFrame schema in the notebook.


Save the `internal_clickstream` DataFrame as CSV files in a directory called `./results/article_to_article_csv/`. If you read this CSV file back from disk without specifying any options to the reader, will it be identical to the DataFrame you just saved?

Save the `internal_clickstream` DataFrame as parquet files in a directory called `./results/article_to_article_pq/`. Parquet is Spark's default file format and offers efficient data compression, is faster to perform analysis with than CSV, and preserves information about a dataset's schema on disk.

Now that we've saved the results of our work, we should close the `SparkSession` and underlying `sparkContext`. What would happen if you were to try to call `clickstream.show()` after closing the `SparkSession`?

Analyzing Wikipedia Clickstreams with PySpark

Leverage your SQL skills in PySpark to handle big data.

While we can directly analyze data using Spark's Resilient Distributed Datasets (RDDs), we may not always want to perform complicated analysis directly on RDDs. Luckily, Spark offers a module called Spark SQL that can make common data analysis tasks simpler and faster. In this lesson, we'll introduce Spark SQL and demonstrate how it can be a powerful tool for accelerating the analysis of distributed datasets.

The name Spark SQL is an umbrella term, as there are several ways to interact with data when using this module. We'll cover two of these methods using the PySpark API:

* First, we'll learn the basics of inspecting and querying data in a Spark DataFrame.

* Then, we'll perform these same operations using standard SQL directly in our PySpark code. 

Before using either method, we must start a `SparkSession`, the entry point to Spark SQL. The session is a wrapper around a `sparkContext` and contains all the metadata required to start working with distributed data. 

The code below uses `SparkSession.builder` to set configuration parameters and create a new session. In the following example, we set one configuration parameter (`spark.app.name`) and call the `.getOrCreate()` method to initialize the new `SparkSession`.

```python
spark = SparkSession.builder\
    .config('spark.app.name', 'learning_spark_sql')\
    .getOrCreate()
```

We can access the `SparkContext` for a session with `SparkSession.sparkContext`.
    
```python
print(spark.sparkContext) 
# <SparkContext master=local[*] appName=learning_spark_sql>
```

From here, we can use the `SparkSession` to create DataFrames, read external files, register tables, and run SQL queries over saved data. When we're done with our analysis, we can clear the Spark cache and terminate the session with `SparkSession.stop()`. Now that we're familiar with the basics of `SparkSession`, the next step is to begin using Spark SQL to interact with data!

---

_How to Use Your Jupyter Notebook:_
* _You can run a cell in the Notebook to the right by placing your cursor in the cell and clicking the `Run` button or the `Shift`+`Enter/Return` keys._
* _When you are ready to evaluate the code in your Notebook, press the `Save` button at the top of the Notebook or `command`+`s` keys before clicking the `Test Work` button at the bottom. **Be sure to save your solution code in the cell marked `## YOUR SOLUTION HERE ##` or it will not be evaluated.**_
* _When you are ready to move on, click `Next`._

![screenshot of the buttons at the top of the Jupyter Notebook interface with Save and Run highlighted](https://static-assets.codecademy.com/Courses/big-data-pyspark/Jupyter-buttons.png)

Introducing PySpark SQL

A PySpark SQL DataFrame is a distributed collection of data with a specific row and column structure. Under the hood, DataFrames are built on top of RDDs. Like pandas, PySpark SQL DataFrames allow a developer to analyze data more easily than by writing functions directly on underlying data.

DataFrames can be created manually from RDDs using `rdd.toDF(["names", "of", "columns"])`. In the example below, we create a DataFrame from a manually constructed RDD and name its columns `article_title` and `view_count`.
    
```python
# Create an RDD from a list
hrly_views_rdd  = spark.sparkContext.parallelize([
    ["Betty_White" , 288886],
    ["Main_Page", 139564],
    ["New_Year's_Day", 7892],
    ["ABBA", 8154]
])

# Convert RDD to DataFrame
hrly_views_df = hrly_views_rdd\
    .toDF(["article_title", "view_count"])
```

Let's take a look at our new DataFrame. We can use the `DataFrame.show(n_rows)` method to print the first `n_rows` of a Spark DataFrame. It can also be helpful to pass `truncate=False` to ensure all columns are visible.

```python
hrly_views_df.show(4, truncate=False)
```

```markdown
+--------------+-----------+
| article_title| view_count|
+--------------+-----------+
|   Betty_White|     288886|
|     Main_Page|     139564|
|New_Year's_Day|       7892|
|          ABBA|       8154|
+--------------+-----------+
```

Great! Now that this data is loaded in as a DataFrame, we can access the underlying RDD with `DataFrame.rdd`. You likely won't need the underlying data often, but it can be helpful to keep in mind that a DataFrame is a structure built on top of an RDD. When we check the type of `hrly_views_df_rdd`, we can see that it's an RDD!

```python
# Access DataFrame's underlying RDD
hrly_views_df_rdd = hrly_views_df.rdd

# Check object type
print(type(hrly_views_df_rdd)) 
# <class 'pyspark.rdd.RDD'>
```



Creating Spark DataFrames

In this exercise, we'll learn how to pull in larger datasets from external sources. To start, we'll be using a dataset from Wikipedia that counts views of all articles by hour. For demonstration's sake, we'll use the first hour of 2022. Let's take a look at the code we might use to read a CSV of this data from a location on disk.

```python
print(type(spark.read)) 
# <class 'pyspark.sql.readwriter.DataFrameReader'>

# Read CSV to DataFrame
hrly_views_df = spark.read\
.option('header', True) \
.option('delimiter', ' ') \
.option('inferSchema', True)\ 
.csv('views_2022_01_01_000000.csv')
```
There are a few things going on in this code, let's go through them one at a time:

This code uses the `SparkSession.read` function to create a new `DataFrameReader` 

The `DataFrameReader` has an `.option('option_name', 'option_value')` method that can be used to instruct Spark how exactly to read a file. In this case, we used the following options: 

- `.option('header', True)` — Indicate the file already contains a header row. By default, Spark assumes there is no header.

- `.option('delimiter', ' ')` — Indicates each column is separated by a space (' '). By default, Spark assumes CSV columns are separated by commas.

- `.option('inferSchema', True)` — Instructs Spark to sample a subset of rows before determining each column's type. By default, Spark will treat all CSV columns as strings.

The `DataFrameReader` also has a `.csv('path')` method which loads a CSV file and returns the result as a DataFrame. There are a few quick ways of checking that our data has been read in properly. The most direct way is checking `DataFrame.show()`.

```python
# Display first 5 rows of DataFrame
hrly_views_df.show(5, truncate=False)
```

```
+--------+---------------------------+------+-------+
|language|article_title              |hourly|monthly|
+--------+---------------------------+------+-------+
|en      |Cividade_de_Terroso        |2     |0      |
|en      |Peel_Session_(Autechre_EP) |2     |0      |
|en      |Young_Street_Bridge        |1     |0      |
|en      |Troy,_Alabama              |1     |0      |
|en      |Charlotte_Johnson_Wahl     |10    |0      |
+--------+---------------------------+------+-------+
```

Looks Good! In this exercise, we used a `DataFrameReader` to pull a CSV from disk into our local Spark environment. However, Spark can read a wide variety of file formats. You can refer to the [PySpark documentation](https://spark.apache.org/docs/latest/api/python/search.html?q=DataFrameReader) to explore all available `DataFrameReader` options and file formats. In the following exercise, we'll start to analyze the contents of this file.


Spark DataFrames from External Sources

In this exercise, we're going to start to analyze our pageview data and learn how Spark can help with data exploration. Like Pandas, Spark DataFrames offer a series of operations for cleaning, inspecting, and transforming data. Earlier in the lesson, we mentioned that all DataFrames have a schema that defines their structure, columns, and datatypes. We can use `DataFrame.printSchema()` to show a DataFrame's schema.

```python
# Display DataFrame schema
hrly_views_df.printSchema()
```
```
root
|-- language_code: string (nullable = true)
|-- article_title: string (nullable = true)
|-- hourly_count: integer (nullable = true)
|-- monthly_count: integer (nullable = true)
```

We can then use `DataFrame.describe()` to see a high-level summary of the data by column. The result of `DataFrame.describe()` is a DataFrame in itself, so we append `.show()` to get it to display in our notebook. 

```python
hrly_views_df_desc = hrly_views_df.describe()
hrly_views_df_desc.show(truncate=False)
```

```
+-------+-------------+-------------+------------+-------------+
|summary|language_code|article_title|hourly_count|monthly_count|
+-------+-------------+-------------+------------+-------------+
|  count|      4654091|      4654091|     4654091|      4654091|
|   mean|         null|         null|     4.52417|          0.0|
| stddev|         null|         null|   182.92502|          0.0|
|    min|           aa|            -|           1|            0|
|    max|       zu.m.d|            -|      288886|            0|
+-------+-------------+-------------+------------+-------------+
```

From this summary, we can see a few interesting facts.
- About 4.65  million unique pages were visited this hour
- The most visited page had almost 289,000 visitors, while the mean page had just over 4.5 visitors.

Because this data was taken from the first hour of the month, it looks like the column `monthly_count` only contains zeros. Because it contains no meaningful information, we can drop this field with `DataFrame.drop("columns", "to", "drop")`.

```python
# Drop `monthly_count` and display new DataFrame
hrly_views_df = hrly_views_df.drop('monthly_count')
hrly_views_df.show(5)    
```

```
+-------------+---------------------------+------------+
|language_code|article_title              |hourly_count|
+-------------+---------------------------+------------+
|en           |Cividade_de_Terroso        |           2|
|en           |Peel_Session_(Autechre_EP) |           2|
|en           |Young_Street_Bridge        |           1|
|en           |Troy,_Alabama              |           1|
|en           |Charlotte_Johnson_Wahl     |          10|
+-------------+---------------------------+------------+
```

The data is starting to look pretty good, but let's make one more adjustment. The column `article_title` is a bit misleading: it seems this data contains articles, files, image pages, and wikipedia metadata pages.  We can replace this misleading header with a better name using `DataFrame.withColumnRenamed()`.

```python
hrly_views_df = hrly_views_df\
.withColumnRenamed('article_title', 'page_title')
``` 

Now when we call `.printSchema()` we see that the schema reflects the updates we've made to the DataFrame.

```
root
|-- language_code: string (nullable = true)
|-- page_title: string (nullable = true)
|-- hourly_count: integer (nullable = true)
```

You may have noticed that Spark assigned all columns `nullable = true`. Intuitively, we know that `article_title` shouldn't be null, but when the DataFrameReader reads a CSV, it assigns `nullable = true` to all columns. This is fine for now, but in some scenarios, you may wish to explicitly define a file's schema. If interested, you can refer to [PySpark's documentation on defining a file's schema.](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.schema.html)


Inspecting and Cleaning Data With PySpark

It's time to start performing some analysis–this is where PySpark SQL really shines. PySpark SQL DataFrames have a variety of built-in methods that can help with analyzing data. Let's get into a few examples!

Imagine we'd like to filter our data to pages from a specific Wikipedia `language_code` (e.g., `"kw.m"`). This site is not very active, so it's easy to use all of this hour's data for demonstration purposes. We can display this result with the code below:
    
```python
hrly_views_df\
    .filter(hrly_views_df.language_code == "kw.m")\
    .show(truncate=False)
```

```
+-------------+-----------------------+------------+-------------------+
|language_code|article_title          |hourly_count|monthly_count|
+-------------+-----------------------+------------+-------------------+
|kw.m         |Bresel_Diabarth_Spayn  |1           |0                  |
|kw.m         |Can_an_Pescador_Kernûak|1           |0                  |
|kw.m         |Ferdinand_Magellan     |1           |0                  |
|kw.m         |Justė_Arlauskaitė      |16          |0                  |
|kw.m         |Lithouani              |2           |0                  |
|kw.m         |Nolwenn_Leroy          |1           |0                  |
|kw.m         |Ohio                   |1           |0                  |
|kw.m         |Taywan                 |1           |0                  |
+-------------+-----------------------+------------+-------------------+
```

This code uses the `DataFrame.filter()` method to select relevant rows. This is analogous to a SQL "WHERE" clause. In this case, our condition checks the column `language_code` for the value `"kw.m"`. What if we want to remove the `monthly_count` column and display the data ordered by the `hourly_count`? To do so, we could use the following: 

```python
hrly_views_df\
    .filter(hrly_views_df.language_code == "kw.m")\
    .select(['language_code', 'article_title', 'hourly_count'])\
    .orderBy('hourly_count', ascending=False)\    
    .show(5, truncate=False)
```

```markdown
+-------------+-----------------------+------------+-------------------+
|language_code|article_title          |hourly_count|total_monthly_count|
+-------------+-----------------------+------------+-------------------+
|kw.m         |Justė_Arlauskaitė      |16          |0                  |
|kw.m         |Lithouani              |2           |0                  |
|kw.m         |Bresel_Diabarth_Spayn  |1           |0                  |
|kw.m         |Can_an_Pescador_Kernûak|1           |0                  |
|kw.m         |Nolwenn_Leroy          |1           |0                  |
+-------------+-----------------------+------------+-------------------+
```

- `DataFrame.select()` is used to choose which columns to return in our result. You can think of `DataFrame.select(["A", "B", "C"])` as analogous to `SELECT A, B, C FROM DataFrame` in SQL.

- `DataFrame.orderBy()` is analogous to SQL's `ORDER BY`. We use `.orderBy('hourly_count', ascending=False)` to specify the sort column and order logic. This would be analogous to `ORDER BY hourly_count DESC` in SQL.

What if we'd like to select the sum of `hourly_count` by `language_code`? This could help us answer questions like "Which sites were most active this hour?" We can do that with the following:

```python
hrly_views_df\
    .select(['language_code', 'hourly_count'])\
    .groupBy('language_code')\
    .sum() \
    .orderBy('sum(hourly_count)', ascending=False)\
    .show(5, truncate=False)
```

```markdown
+-------------+-----------------+
|language_code|sum(hourly_count)|
+-------------+-----------------+
|en.m         |8095763          |
|en           |2693185          |
|de.m         |1313505          |
|es.m         |963835           |
|ru.m         |927583           |
+-------------+-----------------+
```

This code uses `DataFrame.groupBy('language_code').sum()` to calculate the sum of all columns grouped by `language_code`, `.groupBy(field)` and `.sum()` are analogous to SQL's `GROUP BY` and `SUM` functions respectively. This code also orders our results with `.orderBy()`, using the name of the constructed column, `'sum(hourly_count)'`.

There are many ways to use the DataFrame methods to query our data. However, if you're familiar with SQL, you may prefer to use standard SQL statements. In the next section, we'll explore how you can use standard SQL to explore data with PySpark.


Querying PySpark DataFrames

PySpark DataFrame's query methods are an improvement on performing analysis directly on RDDs. However, working with DataFrame methods still requires some practice, and the code can become quite verbose. Luckily, we can analyze data in Spark with standard SQL through the `SparkSession.sql()` method. This exercise will closely mirror the previous one, and we'll answer the same questions from that exercise using standard SQL. 

Before querying a DataFrame with SQL in Spark, it must be saved to the SparkSession's catalog. The following code saves the DataFrame as a local temporary view in memory. As long as the current `SparkSession` is active, we can use `SparkSession.sql()` to query it.

```python
hrly_views_df.createOrReplaceTempView('hourly_counts')
```

Each of the three sections of SQL below performs the same function as the DataFrame query methods described in the previous exercise. With the query below, we can filter our data to pages from a specific Wikipedia `language_code` (e.g., `"kw.m"`) using a `WHERE` clause.

```python
query = """SELECT * FROM hourly_counts WHERE language_code = 'kw.m'"""
spark.sql(query).show(truncate=False)
```
```
+-------------+-----------------------+------------+-------------+
|language_code|article_title          |hourly_count|monthly_count|
+-------------+-----------------------+------------+-------------+
|kw.m         |Bresel_Diabarth_Spayn  |           1|            0|
|kw.m         |Can_an_Pescador_Kernûak|           1|            0|
|kw.m         |Ferdinand_Magellan     |           1|            0|
|kw.m         |Justė_Arlauskaitė      |          16|            0|
|kw.m         |Lithouani              |           2|            0|
|kw.m         |Nolwenn_Leroy          |           1|            0|
|kw.m         |Ohio                   |           1|            0|
|kw.m         |Taywan                 |           1|            0|
+-------------+-----------------------+------------+-------------+
```

In the query below, we display all pages with `"kw.m"` as their `language_code` ordered by the `hourly_count` using an `ORDER BY` clause.

```python
query = """SELECT language_code, article_title, hourly_count
    FROM hourly_counts
    WHERE language_code = 'kw.m'
    ORDER BY hourly_count DESC"""

spark.sql(query).show(truncate=False)
```

```
+-------------+-----------------------+------------+-------------------+
|language_code|article_title          |hourly_count|total_monthly_count|
+-------------+-----------------------+------------+-------------------+
|kw.m         |Justė_Arlauskaitė      |          16|                  0|
|kw.m         |Lithouani              |           2|                  0|
|kw.m         |Bresel_Diabarth_Spayn  |           1|                  0|
|kw.m         |Can_an_Pescador_Kernûak|           1|                  0|
|kw.m         |Nolwenn_Leroy          |           1|                  0|
+-------------+-----------------------+------------+-------------------+
```

Finally, we select the sum of `hourly_count` by `language_code` over the entire DataFrame using a SQL statement with `GROUP BY`, `SUM`, and  `ORDER BY`.

```python
query = """SELECT language_code, SUM(hourly_count) as sum_hourly_count
    FROM hourly_counts
    GROUP BY language_code
    ORDER BY sum_hourly_count DESC"""

spark.sql(query).show(5, truncate=False)
```

```
+-------------+-----------------+
|language_code|sum(hourly_count)|
+-------------+-----------------+
|en.m         |8095763          |
|en           |2693185          |
|de.m         |1313505          |
|es.m         |963835           |
|ru.m         |927583           |
+-------------+-----------------+
```

Although querying data with SQL and DataFrame methods may look quite different, behind the scenes, Spark SQL translates everything to the same internal code. This means that as a developer, you can focus more on writing code for analysis in your preferred style rather than low-level execution details.


Querying PySpark with SQL

Once you've done some analysis, the next step is often saving the transformed data back to disk for others to use. In this final topic, we're going to cover how to efficiently save PySpark DataFrames.

Similar to the `SparkSession.read()` method, Spark offers a `SparkSession.write()` method. Let's perform a slight modification to our original Wikipedia views dataset and save it to disk. This code just uses `.select()` to select all columns except the `monthly_count` column (recall that earlier we discovered this column only contains zeros).

Because Spark runs all operations in parallel, it's typical to write DataFrames to a directory of files rather than a single CSV file. In the example below, Spark will split the underlying dataset and write multiple CSV files to `cleaned/csv/views_2022_01_01_000000/`. We can also use the `mode` argument of the `.csv()` method to overwrite any existing data in the target directory.

```python
hrly_views_df\
    .select(['language_code', 'article_title', 'hourly_count'])\
    .write.csv('cleaned/csv/views_2022_01_01_000000/', mode="overwrite")
```

Using `SparkSession.read()`, we can read the data from disk and confirm that it looks the same as the DataFrame we saved.

```python
# Read DataFrame back from disk
hrly_views_df_restored = spark.read\
    .csv('cleaned/csv/views_2022_01_01_000000/')
hrly_views_df_restored.printSchema()
```
```
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
```

Close, but not quite! It looks like this file didn't retain information about column headers or datatypes. Unfortunately, there's no way for a CSV to retain information about its format. Each time we read it, we'll need to tell Spark exactly how it must be processed. 

Luckily, there is a file format called "Parquet" that's specially designed for big data and solves this problem among many others. Parquet offers efficient data compression, is faster to perform analysis on than CSV, and preserves information about a dataset's schema. Let's try saving and re-reading this file to and from Parquet instead.

```python
# Write DataFrame to Parquet
hrly_views_slim_df
    .write.parquet('cleaned/parquet/views_2022_01_01_000000/', mode="overwrite")

# Read Parquet as DataFrame
hrly_views_df_restored = spark.read\
    .parquet('cleaned/parquet/views_2022_01_01_000000/')

# Check DataFrame's schema
hrly_views_df_restored.printSchema()
```
```
root
|-- language_code: string (nullable = true)
|-- article_title: string (nullable = true)
|-- hourly_count: integer (nullable = true)
```

Great, now anyone who wants to query this data can do so with the much more efficient Parquet data format!

Saving PySpark DataFrames

The Spark ecosystem can be quite expansive, but the skills you've gained from this lesson should help you as you begin to branch out and run your own analyses. In this lesson you've learned:

- How to construct Spark DataFrames from raw data in Python and Spark RDDs.

- How to read and write data from disk into Spark DataFrames, including an introduction to file formats optimized for big-data workloads.

- How to perform data exploration and cleaning on distributed data.

- How the PySpark SQL API can allow you to perform analysis on distributed data more easily than working directly with RDDs by using DataFrames.

- How to use the PySpark SQL API to query your datasets with standard SQL.


Concluding Remarks

PySpark SQL

PySpark DataFrames are a centralized alternative that does not require resilient distributed datasets (RDDs).

PySpark DataFrames are built on top of resilient distributed datasets (RDDs). 

PySpark DataFrames must have a defined column and row schema. 

PySpark DataFrames are distributed collections of data.

Spark can only parallelize operations on DataFrames created from Parquet files.

Parquet files are efficiently compressed and thus smaller than CSV files containing the same data.

Parquet files preserve information about a DataFrame's schema. 

Performing analysis on parquet files is often faster than CSV files.

`large_cities_rdd.toDF(["city", "continent", "population_millions"])`

large_cities_rdd  = spark.sparkContext.parallelize([
    ["Tokyo", "Asia", 37.43],
    ["Sao Paulo", "South America", 21.84],
    ["Mexico City", "North America", 21.67],
    ["Cairo", "Africa", 20.48]
])


"SELECT * FROM locations WHERE num_locations > 5"

Analyze website domain data using DataFrames, SQL, and RDDs with the big data tool PySpark.


One of your colleagues has made good progress analyzing this dataset using only PySpark RDDs, but has asked you to continue work on this project with SparkSQL. To get familiar with the dataset, you should run their analysis. 

Run the notebook cell that initializes a new `sparkContext` and reads the domain graph CSV file as an RDD. Display the first 10 entries of this file in the notebook.


Your colleague has written a function called `fmt_domain_graph_entry` that formats an entry in the domain dataset. Apply this function to `common_crawl_domain_counts` and save it as a new RDD called `formatted_host_counts`. Display the first 10 entries in the notebook.

Your colleague has written another function called `extract_domain_graph_host_count`. Apply this function to `common_crawl_domain_counts` and save the result to an RDD named `host_counts`. Study the function, what do you think the result will look like? Display the first 10 entries in this file in the notebook.


Using `host_counts`, calculate the total number of subdomains in the dataset, and save the result to a variable named `total_host_counts`.

We can do a bit more analysis more easily with PySpark SQL. Stop the current SparkSession and SparkContext before starting on our SparkSQL analysis.


Before we can load or query this data for ourselves, we'll need to create a `SparkSession`. Create a new `SparkSession` and assign it to a variable named `spark`.


Read `./crawl/cc-main-limited-domains.csv` into a new Spark DataFrame named `common_crawl`. This dataset doesn't have headers, so we can use Spark's auto-generated column names for now. Once you've loaded the data, display the first few rows of the DataFrame in the notebook.

Because this dataset doesn't have headers, we'll have to set them ourselves. Let's rename the autogenerated columns to the following: 

- site_id
- domain
- top_level_domain
- num_subdomains


Before moving on to analyzing this dataset, let's save it as parquet files. This will help our other colleagues work with it more easily. Save the `common_crawl` DataFrame as parquet files in a directory called `./results/common_crawl/`. 


That's a much better format for this dataset. Read `./results/common_crawl/` into a new DataFrame to confirm our DataFrame was saved properly. Display the first few rows of the new DataFrame and the schema in the notebook.


Answer the next few questions using either DataFrame methods or PySpark SQL. If you'd like extra practice, you can try using both methods. If you'd like to use SQL, initialize a temporary view from the `common_crawl_domains` DataFrame named `crawl`. 

Calculate the total number of subdomains for each top-level domain in the dataset, and order your result from highest to lowest total subdomain count.


From this dataset, we can also determine which top-level domains contain the most subdomains. Calculate the total number of subdomains for each top-level domain in the dataset and order your result from highest to lowest total subdomain count.

Let's say our analysis is particularly interested in the number of subdomains maintained by different government agencies. Filter the dataset to the website of the United States National Parks Service (domain: `nps`, top-level domain: `gov`) and display the columns top_level_domain, domain, and num_subdomains in your result.


Now that we've finished our exploration, we should close the `SparkSession` and underlying `sparkContext`. 

Analyze Common Crawl Data with PySpark

### Why Big Data with PySpark? 

This course is an introduction to the underlying concepts behind big data with a practical and hands-on approach with PySpark. Big data is everywhere, and touches data science, data engineering, and machine learning. It is becoming central to marketing, strategy, and research. This course covers the applications and implications of big data on finance, social media, health, and medicine. PySpark makes it easy to start analyzing big data, making the potential of big data accessible to anyone who knows Python. 


### Take-Away Skills 
In this course, you will learn how to handle big data with PySpark. In addition to learning how to manage the data, you will also be exposed to the conceptual underpinnings that make working with big data possible.  


Learn about how we define big data, how big data is stored and processed, and what ethical considerations we need to keep in mind.

Learn about how PySpark lets you do SQL-like queries on big data datasets.

Spark DataFrames with PySpark SQL

Combine everything you've learned so far about PySpark to work with a big data dataset!

Putting it all together

See how big data is used across different industries and learn how to work with big data using PySpark!

Introduction to Big Data with PySpark

PRO SALE: Get 50% off annual Pro memberships using code [LLM50](https://www.codecademy.com/checkout?plan_id=proGoldAnnualV2&discountCode=LLM50&plan_type=pro)