diff --git a/_episodes/03-starting-with-data.md b/_episodes/03-starting-with-data.md index 358e86398..1fe993bd2 100644 --- a/_episodes/03-starting-with-data.md +++ b/_episodes/03-starting-with-data.md @@ -47,13 +47,13 @@ directory structure, however, that is not our focus today. We are studying ocean waves and temperature in the seas around the UK. -For this lesson we will be using a subset of data from Centre for Environment Fisheries and Aquaculture Science (Cefas). -WaveNet, Cefas’ strategic wave monitoring network for the United Kingdom, provides a single source of real-time wave data from a network of wave buoys located in areas at risk from flooding. https://wavenet.cefas.co.uk/ +For this lesson we will be using a subset of data from Centre for Environment Fisheries and Aquaculture Science (Cefas). +WaveNet, Cefas’ strategic wave monitoring network for the United Kingdom, provides a single source of real-time wave data from a network of wave buoys located in areas at risk from flooding. For more information, see the [Cefas WaveNet website](https://wavenet.cefas.co.uk/). If we look out to sea, we notice that waves on the sea surface are not simple sinusoids. The surface appears to be composed of random waves of various lengths and periods. How can we describe this complex surface? - + By making some simplifications and assumptions, we fit an idealised 'spectrum' to describe all the energy held in different wave frequencies. This describes the wave energy at a point, covering the energy in small ripples (high frequency) to long period (low frequency) swell waves. This figure shows an example idealised spectrum, with the highest energy around wave periods of 11 seconds. - + ![An idealised wave spectra for a wave period of 11 seconds](../fig/wave_spectra.png) We can go a step further, and also associate a wave direction with the amount of energy. These simplifications lead to a 2D wave spectrum at any point in the sea, with dimensions frequency and direction. Directional spreading is a measure of how wave energy for a given sea state is spread as a function of direction of propagation. For example the wave data on the left have a small directional spread, as the waves travel, this can fan out over a wider range of directions. @@ -62,8 +62,8 @@ We can go a step further, and also associate a wave direction with the amount of When it is very windy or storms pass-over large sea areas, surface waves grow from short choppy wind-sea waves into powerful swell waves. The height and energy of the waves is larger in winter time, when there are more storms. wind-sea waves have short wavelengths / wave periods (like ripples) while swell waves have longer periods (at a lower frequency). -The example file contains a obervations of sea temperatures, and waves properties at different buoys around the UK. - +The example file contains a obervations of sea temperatures, and waves properties at different buoys around the UK. + The dataset is stored as a `.csv` file: each row holds information for a single wave buoy, and the columns represent: @@ -98,7 +98,7 @@ record_id,buoy_id,Name,Date,Tz,Peak Direction,Tpeak,Wave,Height,Temperature,Spre ~~~ {: .output} ---- +--- ## About Libraries A library in Python contains a set of tools (called functions) that perform @@ -312,7 +312,7 @@ Let's look at the data using these. > > > ## Solution > > -> > 1. +> > 1. > > > > ~~~ > > Index(['record_id', 'buoy_id', 'Name', 'Date', 'Tz', 'Peak Direction', 'Tpeak', @@ -347,7 +347,7 @@ Let's look at the data using these. > > So, `waves_df.head()` returns the first 5 rows of the `waves_df` dataframe. (Your Jupyter Notebook might show all columns). `waves_df.head(15)` returns the first 15 rows; i.e. the _default_ value (recall the functions lesson) is 5, but we can change this via an argument to the function > > > > 4. -> > +> > > > ~~~ > > record_id buoy_id Name ... Operations Seastate Quadrant > > 2068 2069 16 west of Hebrides ... crew swell north @@ -359,7 +359,7 @@ Let's look at the data using these. > > [5 rows x 13 columns] > > ~~~ > > {: .output} -> > +> > > > So, `waves_df.tail()` returns the final 5 rows of the dataframe. We can also control the output by adding an argument, like with `head()` > {: .solution} {: .challenge} @@ -417,9 +417,9 @@ array(['SW Isles of Scilly WaveNet Site', 'Hayling Island Waverider', > > 2. What is the difference between using `len(buoy_ids)` and `waves_df['buoy_id'].nunique()`? > in this case, the result is the same but when might be the difference be important? -> +> > > ## Solution -> > +> > > > 1. > > > > ~~~ @@ -432,9 +432,9 @@ array(['SW Isles of Scilly WaveNet Site', 'Hayling Island Waverider', > > [14 7 5 3 10 9 2 11 6 16] > > ~~~ > > {: .output} -> > +> > > > 2. -> > +> > > > We could count the number of elements of the list, or we might think about using either the `len()` or `nunique()` functions, and we get 10. > > > > We can see the difference between `len()` and `nunique()` if we create a DataFrame with a `None` value: @@ -445,7 +445,7 @@ array(['SW Isles of Scilly WaveNet Site', 'Hayling Island Waverider', > > print(length_test.nunique()) > > ~~~ > > {: .language-python} -> > +> > > > We can see that `len()` returns 4, while `nunique()` returns 3 - this is because `nunique()` ignore any `Null` value > {: .solution} {: .challenge} @@ -484,7 +484,7 @@ Name: Temperature, dtype: float64 > statistical methods in Pandas ignore NaN ("not a number") values. We can count the total number of > of NaNs using `waves_df["Temperature"].isna().sum()`, which returns 876. 876 + 1197 is 2073, which _is_ > the total number of rows in the DataFrame -{: .callout} +{: .callout} We can also extract one specific metric if we wish: @@ -535,7 +535,7 @@ windsea,326.0,1128.500000,188.099299,3.0,1036.25,1121.5,1273.5,1355.0,326.0,7.07 The `groupby` command is powerful in that it allows us to quickly generate summary stats. -This example shows that the wave height associated with water described as 'swell' +This example shows that the wave height associated with water described as 'swell' is much larger than the wave heights classified as 'windsea'. > ## Challenge - Summary Data @@ -544,14 +544,14 @@ is much larger than the wave heights classified as 'windsea'. > 2. What happens when you group by two columns using the following syntax and > then calculate mean values? > - `grouped_data2 = waves_df.groupby(['Seastate', 'Quadrant'])` -> - `grouped_data2.mean()` -> 3. Summarize Temperature values for swell and windsea states in your data. +> - `grouped_data2.mean(numeric_only=True)` +> 3. Summarize Temperature values for swell and windsea states in your data. > >> ## Solution ->> 1. The most complete answer is `waves_df.groupby("Quadrant").count()["record_id"][["north", "west"]]` - note that we could use any column that has a value in every row - but given that `record_id` is our index for the dataset it makes sense to use that ->> 2. It groups by 2nd column _within_ the results of the 1st column, and then calculates the mean (n.b. depending on your version of python, you might need `grouped_data2.mean(numeric_only=True)`) ->> 3. ->> +>> 1. The most complete answer is `waves_df.groupby("Quadrant").count()["record_id"][["north", "west"]]` - note that we could use any column that has a value in every row - but given that `record_id` is our index for the dataset it makes sense to use that +>> 2. It groups by 2nd column _within_ the results of the 1st column, and then calculates the mean (n.b. older versions of python might need `grouped_data2.mean()` without the `numeric_only=True` parameter) +>> 3. +>> >> ~~~ >> waves_df.groupby(['Seastate'])["Temperature"].describe() >> ~~~ @@ -561,7 +561,7 @@ is much larger than the wave heights classified as 'windsea'. >> >> ~~~ >> count mean std min 25% 50% 75% max ->> Seastate +>> Seastate >> swell 871.0 14.703502 3.626322 5.15 12.75 17.10 17.4000 18.70 >> windsea 326.0 7.981902 3.518419 5.15 5.40 5.45 12.4875 13.35 >> ~~~ @@ -602,11 +602,11 @@ waves_df.groupby('Name')['record_id'].count()['SW Isles of Scilly WaveNet Site'] ## Basic Maths Functions If we wanted to, we could perform math on an entire column of our data. For -example let's convert all the degrees values to radians. +example let's convert all the degrees values to radians. ~~~ # convert the directions from degrees to radians -# Sometimes people use different units for directions, for example we could describe +# Sometimes people use different units for directions, for example we could describe # the directions in terms of radians (where a full circle 360 degrees = 2*pi radians) # To do this we need to use the math library which contains the constant pi @@ -618,7 +618,7 @@ waves_df['Peak Direction'] * math.pi / 180 > ## Constants > -> It is normal for code to include variables that have values that should not change, for example. +> It is normal for code to include variables that have values that should not change, for example. > the mathematical value of _pi_. These are called constants. The maths library contains [three > numerical constants](https://docs.python.org/3/library/math.html#constants): _pi_, _e_, and _tau_, but > other built-in modules also contain constants. The `os` library (which provides a portable way of using @@ -644,7 +644,7 @@ waves_df['Peak Direction'] * math.pi / 180 > ## Challenge - normalising values > -> Sometimes, we need to _normalise_ values. A common way of doing this is to scale values between 0 and 1, using +> Sometimes, we need to _normalise_ values. A common way of doing this is to scale values between 0 and 1, using > `y = (x - min) / (max - min)`. Using this equation, scale the Temperature column > >> ## Solution @@ -673,4 +673,3 @@ calculated from our data. [spreadsheet-lesson5]: http://www.datacarpentry.org/spreadsheet-ecology-lesson/05-exporting-data {% include links.md %} - diff --git a/_episodes/04-data-types-and-format.md b/_episodes/04-data-types-and-format.md index 5aa587912..f7ce34896 100644 --- a/_episodes/04-data-types-and-format.md +++ b/_episodes/04-data-types-and-format.md @@ -321,7 +321,7 @@ This is a convenient place to highlight that the `apply` method is one way to ru the Buoy Station Names, we can write: ~~~ -waves_df["Names"].apply(len) +waves_df["Name"].apply(len) ~~~ {: .language-python} @@ -374,7 +374,7 @@ dates.apply(datetime.datetime.strftime, args=("%a",)) {: .language-python} >## Watch out for tuples! -> _Tuples_ are data structure similar to a list, but are _immutable_. They are created using parentheses, with items separated by commas: +> _Tuples_ are a data structure similar to a list, but are _immutable_. They are created using parentheses, with items separated by commas: > `my_tuple = (1, 2, 3)` > However, putting parentheses around a single object does not make it a tuple! Creating a tuple of length 1 still needs a trailing comma. > Test these: `type(("a"))` and `type(("a",))`. diff --git a/_episodes/05-index-slice-subset.md b/_episodes/05-index-slice-subset.md index 9f0d71ee4..01fcb3c4b 100644 --- a/_episodes/05-index-slice-subset.md +++ b/_episodes/05-index-slice-subset.md @@ -182,7 +182,7 @@ a = [1, 2, 3, 4, 5] >> 3. The error is raised because the list a has no element with index 5: it has only five entries, indexed from 0 to 4. >> 4. `a[len(a)]` also raises an IndexError. `len(a)` returns 5, making `a[len(a)]` equivalent to `a[5]`. >> To retreive the final element of a list, use the index -1, e.g. ->> +>> >> ~~~ >> a[-1] >> ~~~ @@ -423,7 +423,7 @@ It is worth noting that: _but_ -- indexing a data frame directly with labels will select columns (e.g. +- indexing a data frame directly with labels will select columns (e.g. `waves_df[['buoy_id', 'Name', 'Temperature']]`), while ranges of integers will select rows (e.g. waves_df[0:13]) @@ -447,7 +447,7 @@ waves_df.iloc[1:10, 1] the error will also occur if index labels are used without `loc` (or column labels used with it). -A useful rule of thumb is the following: +A useful rule of thumb is the following: - integer-based slicing of rows is best done with `iloc` and will avoid errors - it is generally consistent with indexing of Numpy arrays) - label-based slicing of rows is done with `loc` @@ -487,7 +487,7 @@ arrays) >> [3 rows x 13 columns] >> ~~~ >> {: .output} ->> +>> >> `waves_df[0]` results in a ‘KeyError’, since direct indexing of a row is redundant this way - `iloc` should be used instead (`waves_df[0:1]` could be used to obtain only the first row using this notation) >> >> `waves_df[:4]` slices from the first row to the fourth: @@ -525,7 +525,12 @@ select all rows that have a temperature less than or equal to 10 degrees waves_df[waves_df.Temperature <= 10] ~~~ +Or, we can select all rows that have a buoy_id of 3: +~~~ +waves_df[waves_df.buoy_id == 3] +~~~ +{: .language-python} Which produces the following output: @@ -537,13 +542,6 @@ Which produces the following output: ~~~ {: .language-python} -Or, we can select all rows that have a buoy_id of 3: - -~~~ -waves_df[waves_df.buoy_id == 3] -~~~ -{: .language-python} - We can also select all rows that do not contain values for Tpeak (listed as NaN): @@ -628,6 +626,7 @@ Experiment with selecting various subsets of the "waves" data. > Use the `isin` function to find all plots that contain buoy ids 5 and 7 > in the "waves" DataFrame. How many records contain these values? > +> > 3. Experiment with other queries. e.g. Create a query that finds all rows with a > Tpeak greater than or equal to 10. > @@ -637,7 +636,7 @@ Experiment with selecting various subsets of the "waves" data. > the "waves" data. > >> ## Solution ->> +>> >> This is possible in one-line: >> ~~~ >> waves_df[(pd.to_datetime(waves_df.Date, format="%d/%m/%Y %H:%M").dt.year == 2023) & (waves_df["Temperature"] <= 8)] @@ -694,17 +693,17 @@ Experiment with selecting various subsets of the "waves" data. >> {: .language-python} >> >> ~~~ ->> 5 +>> 288 >> ~~~ >> {: .output} >> ->> +>> >> ~~~ >> waves_df[waves_df['Tpeak'] >= 10] >> ~~~ >> {: .language-python} >> ->> +>> >> ~~~ >> waves_df[~waves_df['Quadrant'].isin(['south','east'])] >> ~~~ @@ -723,7 +722,7 @@ Experiment with selecting various subsets of the "waves" data. >> 2070 2071 16 west of Hebrides 18/10/2022 17:00 5.6 ... 34.0 crew swell north 2022 >> 2071 2072 16 west of Hebrides 18/10/2022 17:30 5.7 ... 31.0 crew swell north 2022 >> 2072 2073 16 west of Hebrides 18/10/2022 18:00 5.7 ... 34.0 crew swell north 2022 ->> +>> >> [1985 rows x 14 columns] >> ~~~ >> {: .output} diff --git a/_episodes/06-merging-data.md b/_episodes/06-merging-data.md index b78d3b890..a250aa6ef 100644 --- a/_episodes/06-merging-data.md +++ b/_episodes/06-merging-data.md @@ -40,15 +40,15 @@ buoys_df = pd.read_csv("data/buoy_data.csv", Take note that the `read_csv` method we used can take some additional options which we didn't use previously. Many functions in Python have a set of options that can be set by the user if needed. In this case, we have told pandas to assign -empty values in our CSV to NaN `keep_default_na=False, na_values=[""]`. +empty values in our CSV to NaN `keep_default_na=False, na_values=[""]`. We have explicitly requested to change empty values in the CSV to NaN, -this is however also the default behaviour of `read_csv`. +this is however also the default behaviour of `read_csv`. [More about all of the `read_csv` options here and their defaults.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv) # Concatenating DataFrames We can use the `concat` function in pandas to append either columns or rows from -one DataFrame to another. `waves2020_df` contains data from the year 2020, +one DataFrame to another. `waves2020_df` contains data from the year 2020, and which is in the same format as our `waves_df` to see how this works. ~~~ @@ -120,16 +120,16 @@ new_output = pd.read_csv('data/out.csv', keep_default_na=False, na_values=[""]) >> ## Solution >> ~~~ >> # read the files ->> waves_df = pd.read_csv("waves.csv", keep_default_na=False, na_values=[""]) ->> waves2020_df = pd.read_csv("waves_2020.csv", keep_default_na=False, na_values=[""]) +>> waves_df = pd.read_csv("data/waves.csv", keep_default_na=False, na_values=[""]) +>> waves2020_df = pd.read_csv("data/waves_2020.csv", keep_default_na=False, na_values=[""]) >> # concatenate >> combined_data = pd.concat([waves_df, waves2020_df], axis=0) >> # group by buoy_id, and output some summary statistics >> combined_data.groupby("buoy_id").describe() >> # write to csv ->> combined_data.to_csv("combined_wave_data.csv", index=False) +>> combined_data.to_csv("data/combined_wave_data.csv", index=False) >> # read in the csv ->> cwd = pd.read_csv("combined_wave_data.csv", keep_default_na=False, na_values=[""]) +>> cwd = pd.read_csv("data/combined_wave_data.csv", keep_default_na=False, na_values=[""]) >> # check the results are the same >> cwd.groupby("buoy_id").describe() >> ~~~ @@ -151,8 +151,8 @@ NOTE: This process of joining tables is similar to what we do with tables in an SQL database. For example, the `buoys_data.csv` file that we've been working with could be considered as a "lookup" -table. This table contains the data for 15 buoys. This new table details -where the buoy is (Country, Site Type, latitude and longitude), as well as water +table. This table contains the data for 15 buoys. This new table details +where the buoy is (Country, Site Type, latitude and longitude), as well as water depth and information about the observing platform (Manufacturer, Type, operator) The Name and buoy_id code are unique for each line. These buoys are identified in our waves data as well using the buoy_id (and more memorable 'Name'). Rather than adding 8 more @@ -163,7 +163,7 @@ of information to the waves data. Storing data in this way has many benefits including: -1. It ensures consistency in the spelling of buoy attributes (site name, manufacturer etc.) +1. It ensures consistency in the spelling of buoy attributes (site name, manufacturer etc.) given each buoy is only entered once. Imagine the possibilities for spelling errors when copying the data thousands of times! 2. It also makes it easy for us to make changes or add information about the buoys once @@ -210,7 +210,7 @@ Index(['record_id', 'buoy_id', 'Name', 'Date', 'Tz', 'Peak Direction', 'Tpeak',        'Wave Height', 'Temperature', 'Spread', 'Operations', 'Seastate',        'Quadrant'],       dtype='object') - + ~~~ {: .language-python} @@ -259,7 +259,7 @@ both the `wave_sub` and `buoys_df` DataFrames. In other words, if a row in column of `buoys_data`, it will not be included in the DataFrame returned by an inner join. Similarly, if a row in `buoys_df` has a value of `buoy_id` that does *not* appear in the `buoy_id` column of `wave_sub`, that row will not -be included in the DataFrame returned by an inner join. In our example, there is +be included in the DataFrame returned by an inner join. In our example, there is data from the `M6 Buoy`, but this buoy (id 10) does not exist in our buoy data. The two DataFrames that we want to join are passed to the `merge` function using @@ -271,7 +271,7 @@ DataFrame). For inner joins, the order of the `left` and `right` arguments does not matter. The result `merged_inner` DataFrame contains all of the columns from `wave_sub` -(record id, Tz, Peak Direction, Tpeak, etc.) as well as all the columns from +(record id, Tz, Peak Direction, Tpeak, etc.) as well as all the columns from `buoys_df` (buoy_id, Name, Manufacturer, Depth, Type, operator, Country, Site, Type, latitude, and longitude). @@ -321,8 +321,8 @@ merged_left[ pd.isnull(merged_left.Name_y) ] These rows are the ones where the value of `buoy_id` from `wave_sub` (in this case, `M6 Buoy`) does not occur in `buoys_df`. Also note that where the two -DataFrames have columns with the same name, Pandas appends `_x` to the column -from the "left" dataframe, and `_y` to the column from the "right" dataframe. +DataFrames have columns with the same name, Pandas appends `_x` to the column +from the "left" dataframe, and `_y` to the column from the "right" dataframe. ## Other join types @@ -345,10 +345,13 @@ The pandas `merge` function supports two other join types: > `buoys_data.csv` tables. Then calculate the mean: > > 1. Wave Height by Site Type -> 2. Temperature by Seastate and by Country +> 2. Temperature by Seastate and by Country > >> ## Solution >> ~~~ +>> # read the files +>> waves_df = pd.read_csv("data/waves.csv") +>> waves2020_df = pd.read_csv("data/waves_2020.csv") >> # Merging the data frames >> merged_left = pd.merge(left=waves_df,right=buoys_df, how='left', on="buoy_id") >> # Group by Site Type, and calculate mean of Wave Height @@ -368,7 +371,7 @@ The pandas `merge` function supports two other join types: >> {: .output} >> >> ~~~ ->> Seastate Country +>> Seastate Country >> swell England 17.324093 >> Scotland 10.935880 >> Wales 12.491667 @@ -380,13 +383,13 @@ The pandas `merge` function supports two other join types: >> {: .output} > {: .solution} {: .challenge} - + > ## Challenge - filter by availability > > 1. In the data folder, there is a `access.csv` file that contains information about the > data availability and access rights associated with each buoy. Use that data to summarize the number of > observations which are reusable for research. -> 2. Again using `access.csv` file, use that data to summarize the number of data records from operational +> 2. Again using `access.csv` file, use that data to summarize the number of data records from operational > buoys which are available in Coastal versus Ocean waters. > >> ## Solution @@ -395,7 +398,7 @@ The pandas `merge` function supports two other join types: >> # Read the access file >> access_df = pd.read_csv("data/access.csv") >> # Merge the dataframes ->> merged_access = pd.merge(left=waves_df,right=access, how='left', on="buoy_id") +>> merged_access = pd.merge(left=waves_df,right=access_df, how='left', on="buoy_id") >> # find the number available for research >> merged_access.groupby("data availability").count() >> # or, this also gives the same answer: @@ -405,7 +408,7 @@ The pandas `merge` function supports two other join types: >> >> 2. >> ~~~ ->> buoy_access = pd.merge(left=buoys_df, right=access, how="left", on="buoy_id") +>> buoy_access = pd.merge(left=buoys_df, right=access_df, how="left", on="buoy_id") >> buoy_access[buoy_access["data availability"]=="operational"].groupby("Site Type")["buoy_id"].count() >> ~~~ >> {: .language-python} diff --git a/_episodes/07-pandas-matplotlib.md b/_episodes/07-pandas-matplotlib.md index 09fafc121..77f380cdc 100644 --- a/_episodes/07-pandas-matplotlib.md +++ b/_episodes/07-pandas-matplotlib.md @@ -35,8 +35,7 @@ use any data that is relevant to your research. The file [`bouldercreek_09_2013.txt`]({{ page.root }}/data/bouldercreek_09_2013.txt) contains stream discharge data, summarized at 15 minute intervals (in cubic feet per second) for a streamgage on Boulder -Creek at North 75th Street (USGS gage06730200) for 1-30 September 2013. If you'd -like to use this dataset, please download it and put it in your data directory. +Creek at North 75th Street (USGS gage06730200) for 1-30 September 2013. This dataset is already available on your data directory. ## Clean up your data and open it using Python and Pandas @@ -109,8 +108,8 @@ import matplotlib.pyplot as plt Now, let's read data and plot it! ~~~ -waves = pd.read_csv("data/waves.csv") -my_plot = waves.plot("Tpeak", "Wave Height", kind="scatter") +waves_df = pd.read_csv("data/waves.csv") +my_plot = waves_df.plot("Tpeak", "Wave Height", kind="scatter") plt.show() # not necessary in Jupyter Notebooks ~~~ {: .language-python} @@ -230,7 +229,7 @@ provide, offering a consistent environment to make publication-quality visualiza ~~~ fig, ax1 = plt.subplots() # prepare a matplotlib figure -waves.plot("Tpeak", "Wave Height", kind="scatter", ax=ax1) +waves_df.plot("Tpeak", "Wave Height", kind="scatter", ax=ax1) # Provide further adaptations with matplotlib: ax1.set_xlabel("Tpeak (highest energy wave periodicity; seconds)") @@ -258,7 +257,7 @@ p9_ax = my_plt_version.axes[0] # each subplot is an item in a list p9_ax.set_xlabel("Hindfoot length") p9_ax.tick_params(labelsize=16, pad=8) p9_ax.set_title('Scatter plot of weight versus hindfoot length', fontsize=15) -plt.show() # not necessary in Jupyter Notebooks +plt.show() # not necessary in Jupyter Notebooks ~~~ {: .language-python} @@ -272,6 +271,10 @@ plt.show() # not necessary in Jupyter Notebooks What about plotting after joining DataFrames? Let's plot the water depths at each of the buoys ~~~ +# reload the buoys data just in case we don't have it loaded still +buoys_df = pd.read_csv("data/buoy_data.csv") + + # water depth in the buoys dataframe is currently a string (it's suffixed by "m") so we need to fix that def fix_depth_string(i, depth): if type(depth) == str: @@ -280,6 +283,14 @@ def fix_depth_string(i, depth): for i, depth in enumerate(buoys_df["Depth"]): fix_depth_string(i, depth) +def fix_depth_string(i, depth): + if type(depth) == str: + buoys_df.loc[i, "Depth"] = float(depth.strip().rstrip("m")) + +for i, depth in enumerate(buoys_df["Depth"]): + fix_depth_string(i, depth) + + joined = pd.merge(left=waves_df, right=buoys_df, left_on='buoy_id', right_on='buoy_id') plt.bar(joined["Name_x"].unique(), joined["Depth"].unique()) ~~~ @@ -300,7 +311,7 @@ plt.bar(depths_df["names"], depths_df["depths"]) ~~~ {: .language-python} -Note that the return type of `.unique` is a Numpy ndarray, even though the column were of type Series! +Note that the return type of `.unique` is a Numpy ndarray, even though the column were of type Series! > ## Challenge - subsetting data before plotting > Plot Tpeak vs Wave Height from the West Hebrides site. Can you add appropriate labels and a title, and @@ -310,11 +321,11 @@ Note that the return type of `.unique` is a Numpy ndarray, even though the colum > > > > ~~~ > > fig, ax1 = plt.subplots() -> > waves[waves["buoy_id"] == 16].plot("Tpeak", "Wave Height", kind="scatter", ax=ax1) +> > waves_df[waves_df["buoy_id"] == 16].plot("Tpeak", "Wave Height", kind="scatter", ax=ax1) > > ax1.set_xlabel("Highest energy wave period") > > ax1.tick_params(labelsize=16, pad=8) -> > ax1.set_xbound(0, waves[waves["buoy_id"] == 16].Tpeak.max()+1) -> > ax1.set_ybound(0, waves[waves["buoy_id"] == 16]["Wave Height"].max()+1) +> > ax1.set_xbound(0, waves_df[waves_df["buoy_id"] == 16].Tpeak.max()+1) +> > ax1.set_ybound(0, waves_df[waves_df["buoy_id"] == 16]["Wave Height"].max()+1) > > fig.suptitle('Scatter plot of wave height versus Tpeak for West Hebrides', fontsize=15) > > ~~~ > > {: .language-python} @@ -328,7 +339,7 @@ Note that the return type of `.unique` is a Numpy ndarray, even though the colum > > ## Answers > > > > ~~~ -> > data = waves.groupby("buoy_id").max("Wave Height") +> > data = waves_df.groupby("buoy_id").max("Wave Height") > > x = data["Temperature"] > > y = data["Wave Height"] > > fig, plot = plt.subplots() # although we're not using the `fig` variable, subplots returns 2 objects @@ -347,12 +358,12 @@ Note that the return type of `.unique` is a Numpy ndarray, even though the colum > > > > ~~~ > > fig, ax = plt.subplots() -> > wh = waves[waves["buoy_id"] == 16] -> > pb = waves[waves["buoy_id"] == 11] -> > +> > wh = waves_df[waves_df["buoy_id"] == 16] +> > pb = waves_df[waves_df["buoy_id"] == 11] +> > > > ax.scatter(wh["Tpeak"], wh["Wave Height"]) > > ax.scatter(pb["Tpeak"], pb["Wave Height"], marker="*") -> > +> > > > ax.legend(["West Hebrides", "South Pembrokeshire"], loc="best") > > ~~~ > > {: .language-python} diff --git a/_episodes/08-geopandas.md b/_episodes/08-geopandas.md index 8dac5f280..b3770032d 100644 --- a/_episodes/08-geopandas.md +++ b/_episodes/08-geopandas.md @@ -7,18 +7,18 @@ questions: - "How can I visualise and analyse this data?" objectives: - "Import the Geopandas module to analyse latitude / longitude data." - - "Use Geopandas and Geoplot to help with visualisation." + - "Use Geopandas to help with visualisation." keypoints: - "Geopandas is the key module to help deal with geospatial data." - - "Using Geopandas and Geoplot we can create publication / web-ready maps." + - "Using Geopandas we can create publication / web-ready maps." --- ## Geospatial Data -Often in the Environmental Sciences, we need to deal with geospatial data. +Often in the Environmental Sciences, we need to deal with geospatial data. This is normally presented as latitude and longitude (either as decimal degrees or -as degrees/minutes/seconds), but can be presented in other formats (e.g. OSGB for UK +as degrees/minutes/seconds), but can be presented in other formats (e.g. OSGB for UK Grid References). A full discussion of geospatial vector data is beyond the scope of this @@ -48,7 +48,6 @@ we need directly within a Notebook: ~~~ conda install geopandas -c conda-forge -conda install geoplot -c conda-forge ~~~ {: .language-python} @@ -56,17 +55,16 @@ These might take several minutes to run. Once they've been installed, we can imp ~~~ import geopandas as gpd -import geoplot as gplt ~~~ {: .language-python} > ## Conda environments -> We're now at the stage where you might find it useful to have different python _environments_ for specific +> We're now at the stage where you might find it useful to have different python _environments_ for specific > tasks. When you open Anaconda Navigator, it will, by default, be running in your `base` environment. > However, you can create new environments via the Environments tab in the left-hand menu. Each environment -> can have different packages (or different versions of packages), different versions of python, etc - and +> can have different packages (or different versions of packages), different versions of python, etc - and > different packages can be installed via the Environments tab. However, note that individual Notebooks are _not_ -> associated with specific environments - they are associated with the current _active_ environment. A full +> associated with specific environments - they are associated with the current _active_ environment. A full > introduction to Conda environments can be found at [https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/](https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/) {: .callout} @@ -95,7 +93,7 @@ buoys_geo = gpd.GeoDataFrame( {: .language-python} The value we've given to the `crs` argument specifies that the data is latitude and longitude, rather -than any other coordinate system. We can now see that the `buoys_geo` DataFrame contains a new column, `geometry`, +than any other coordinate system. We can now see that the `buoys_geo` DataFrame contains a new column, `geometry`, which also has type `geometry`. So, what can we do with this data type? @@ -140,7 +138,7 @@ scotland.plot() ~~~ {: .language-python} -We can see it looks like Scotland! We can look at the `shape` of the DataFrame to see that it has 32 rows - this is the number of Local Authorities in Scotland, and 5 columns. +We can see it looks like Scotland! We can look at the `shape` of the DataFrame to see that it has 32 rows - this is the number of Local Authorities in Scotland, and 5 columns. We can find the "centroid" point of each Polygon - we can even plot this if we want an abstract map of Scotland! @@ -163,7 +161,7 @@ National Parks in Scotland, and we can plot it ~~~ # Notice this is a different file format to the geojson file we used for the Scottish Council Boundaries data -# This is one file which makes up the Shapefile data format. At a minimum, there needs to be corresponding `shx` and `dbf` files (with the same filenames) in the same directory, but `prj`, `sbx`, `sbn`, and `shp.xml` can store additional metadata +# This is one file which makes up the Shapefile data format. At a minimum, there needs to be corresponding `shx` and `dbf` files (with the same filenames) in the same directory, but `prj`, `sbx`, `sbn`, and `shp.xml` can store additional metadata cairngorms = gpd.read_file("data/cairngorms_boundary.shp") cairngorms.plot() ~~~ @@ -196,7 +194,7 @@ scotland.overlaps(cairngorms.iloc[0].geometry) > ## Challenge: overlaps > 1. Subset the Scotland dataset to show only the rows which overlap with the Cairngorms. Can you display only the names? > 2. Look in the Geopandas documentation (https://geopandas.org/en/stable/index.html) for -> the `disjoint` method. What do you think it will return when you run it in the way that we +> the `disjoint` method. What do you think it will return when you run it in the way that we > ran `overlap`? Try it - did you get the expected result? Can you plot this? > >> ## Solution @@ -204,8 +202,10 @@ scotland.overlaps(cairngorms.iloc[0].geometry) >> overlaps = scotland.overlaps(cairngorms.iloc[0].geometry) >> # get a Series of only the overlaps >> overlaps = overlaps.where(overlaps == True).dropna().index ->> OR, more concisely ->> overlaps = overlaps.index[overlaps] +>> +>> # OR, more concisely +>> # overlaps = overlaps.index[overlaps] +>> >> # use this to subset the Scotland dataframe >> scotland.loc[overlaps] >> # ...and get the names @@ -252,27 +252,39 @@ We can even display the Cairngorms data directly over the Scotland plot, which v ~~~ scotland_plot = scotland.explore() -cairngorms.explore(map=scotland_plot, style_kwds={"fillColor":"lime"}) +cairngorms.explore(m=scotland_plot, style_kwds={"fillColor":"lime"}) ~~~ {: .language-python} -## Back to our buoy data +## Back to our buoy data -Our buoy data is based around the UK. Geopandas includes some very low resolution maps which we can use to plot our geospatial data on +Our buoy data is based around the UK. We can use Cartopy low resolution maps which we can use to plot our geospatial data on ~~~ -world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres")) -ax = world.clip([-15, 50, 9, 61]).plot(color="white", edgecolor="black") -buoys_geo.plot(ax=ax, color="blue") +# Import matplotlib and cartopy +import matplotlib.pyplot as plt +import cartopy.crs as ccrs +import cartopy.feature as cfeature + +# Create the figure and axis with a PlateCarree projection +fig, ax = plt.subplots(figsize=(10, 6), subplot_kw={'projection': ccrs.PlateCarree()}) + +# Add features to the map +ax.add_feature(cfeature.LAND) +ax.add_feature(cfeature.OCEAN) +ax.add_feature(cfeature.BORDERS, linestyle=':') +ax.coastlines() +ax.set_extent([-15, 9, 50, 61], crs=ccrs.PlateCarree()) + +# Plot buoys points — the `.plot()` method can take an `ax` with projection +buoys_geo.plot(ax=ax, color='blue') ~~~ {: .language-python} -We use the `clip()` function to limit the bounds of the map to the most useful area for our needs. +We use cartopy to create a map with a PlateCarree projection, which is suitable for plotting latitude and longitude data. What about if we want a higher quality map? There are several ways of achieving this. We've already seen -that the `explore()` function gives us a way of generating an interactive map, but we can also use the Geoplot package, -or Geopandas directly, to create maps. We'll just use Geopandas, but Geoplot can give some more fine-grained control if you -require it. +that the `explore()` function gives us a way of generating an interactive map, but we can also use the Geopandas directly, to create maps. First, we need to import a basemap to plot buoy points onto. @@ -287,7 +299,7 @@ north_atlantic = gpd.read_file("data/north_atlantic.geojson") > European data is the EU ([https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units/nuts#nuts21](https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units/nuts#nuts21)), > while the Cairngorms data we looked at earlier came from the UK Government geospatial data catalogue > ([https://www.data.gov.uk/dataset/8a00dbd7-e8f2-40e0-bcba-da2067d1e386/cairngorms-national-park-designated-boundary](https://www.data.gov.uk/dataset/8a00dbd7-e8f2-40e0-bcba-da2067d1e386/cairngorms-national-park-designated-boundary)), and the -> Scottish data came from the Scottish Government ([https://data.spatialhub.scot/dataset/local_authority_boundaries-is/resource/d24c5735-0f1c-4819-a6bd-dbfeb93bd8e4](https://data.spatialhub.scot/dataset/local_authority_boundaries-is/resource/d24c5735-0f1c-4819-a6bd-dbfeb93bd8e4)) +> Scottish data came from the Scottish Government ([https://data.spatialhub.scot/dataset/local_authority_boundaries-is/resource/d24c5735-0f1c-4819-a6bd-dbfeb93bd8e4](https://data.spatialhub.scot/dataset/local_authority_boundaries-is/resource/d24c5735-0f1c-4819-a6bd-dbfeb93bd8e4)) {: .callout} We can then plot the location of the buoys, and save the figure as we saw earlier. Although we could use the same technique as in the previous example (where we set the map as the axis and plotted the buoy positions on this object), here we're showing we can also use Matplotlib subplots. This will allow us more control over the subsequent plot. However, subplots aren't suppoorted directly via Pandas or Geopandas, so we now need to import Matplotlib @@ -309,17 +321,17 @@ plt.savefig("b.png") >> >> ~~~ >> import matplotlib.pyplot as plt ->> +>> >> fig, ax = plt.subplots() >> north_atlantic.plot(ax=ax) >> buoys_geo.plot(ax=ax, color="red") ->> +>> >> for buoy in buoys_geo.iterfeatures(): >> ax.annotate(buoy["properties"]["Name"], xy=(buoy["properties"]["longitude"], buoy["properties"]["latitude"])) >> ~~~ >> {: .language-python} >> ->> The text is a little cramped! The next challenge will help fix this +>> The text is a little cramped! The next challenge will help fix this > {: .solution} {: .challenge} @@ -330,15 +342,15 @@ plt.savefig("b.png") > - if you have time, investigate how you might customise the plot > >> ## Solution ->> +>> >> ~~~ ->> # the overlap function won't work, because it works on a 1-to-1 row-wise basis, whereas we ->> want to find all the points which overlap with any of the areas +>> # the overlap function won't work, because it works on a 1-to-1 row-wise basis, whereas we +>> # want to find all the points which overlap with any of the areas >> buoy_areas = north_atlantic.geometry.apply(lambda x: buoys_geo.within(x).any()) >> north_atlantic[buoy_areas] >> >> # We can see that one of the areas is the "North Atlantic Ocean" - so this won't help fix the extent of the map! ->> # We can use a different way to set the bounds +>> # We can use a different way to set the bounds >> >> bounds = buoys_geo.total_bounds # The output of total_bounds is an array of minx,miny,maxx,maxy >> @@ -346,9 +358,9 @@ plt.savefig("b.png") >> ax.set_ylim([bounds[1]-0.5,bounds[3]+0.5]) >> ax.set_xlim([bounds[0]-0.5,bounds[2]+0.5]) >> north_atlantic.plot(ax=ax) ->> +>> >> buoys_geo.plot(ax=ax, color="red") ->> +>> >> for buoy in buoys_geo.iterfeatures(): >> ax.annotate(buoy["properties"]["Name"], xy=(buoy["properties"]["longitude"], buoy["properties"]["latitude"])) >> ~~~ @@ -370,10 +382,10 @@ plt.savefig("b.png") >> buoys_geo.plot(ax=ax, color="red") >> >> axis_labels = [] ->> +>> >> for buoy in buoys_geo.iterfeatures(): >> ax.annotate(int(buoy["id"])+1, xy=(buoy["properties"]["longitude"], buoy["properties"]["latitude"])) ->> axis_labels.append(f"{int(buoy['id'])+1}: {buoy['properties']['Name']}") +>> axis_labels.append(f"{int(buoy['id'])+1}: {buoy['properties']['Name']}") >> >> labels = AnchoredText("\n".join(axis_labels), loc='lower left', prop=dict(size=8), frameon=True, @@ -381,7 +393,7 @@ plt.savefig("b.png") bbox_transform=ax.transAxes ) >> labels.patch.set_boxstyle("round,pad=0.,rounding_size=0.2") ->> ax.add_artist(labels) +>> ax.add_artist(labels) >> fig.tight_layout() >> ~~~ >> {: .language-python}