Python Read Maximum Value From a Dataset
Often when faced with a large corporeality of data, a first step is to compute summary statistics for the data in question. Perchance the near common summary statistics are the mean and standard deviation, which let you to summarize the "typical" values in a dataset, merely other aggregates are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).
NumPy has fast born assemblage functions for working on arrays; nosotros'll discuss and demonstrate some of them here.
Summing the Values in an Array¶
As a quick example, consider calculating the sum of all values in an array. Python itself can do this using the congenital-in sum function:
In [two]:
Fifty = np . random . random ( 100 ) sum ( L )
The syntax is quite similar to that of NumPy's sum office, and the outcome is the same in the simplest case:
However, because it executes the operation in compiled code, NumPy's version of the operation is computed much more quickly:
In [4]:
big_array = np . random . rand ( 1000000 ) % timeit sum(big_array) % timeit np.sum(big_array)
10 loops, all-time of 3: 104 ms per loop grand loops, all-time of three: 442 µs per loop
Be careful, though: the sum function and the np.sum office are not identical, which tin can sometimes lead to confusion! In particular, their optional arguments have different meanings, and np.sum is aware of multiple array dimensions, as we will see in the following section.
Minimum and Maximum¶
Similarly, Python has built-in min and max functions, used to find the minimum value and maximum value of any given array:
In [five]:
min ( big_array ), max ( big_array )
Out[5]:
NumPy's corresponding functions take like syntax, and again operate much more than rapidly:
In [6]:
np . min ( big_array ), np . max ( big_array )
Out[vi]:
In [7]:
% timeit min(big_array) % timeit np.min(big_array)
10 loops, best of 3: 82.3 ms per loop k loops, best of 3: 497 µs per loop
For min, max, sum, and several other NumPy aggregates, a shorter syntax is to use methods of the array object itself:
In [viii]:
impress ( big_array . min (), big_array . max (), big_array . sum ())
1.17171281366e-06 0.999997678497 499911.628197
Whenever possible, make sure that you are using the NumPy version of these aggregates when operating on NumPy arrays!
Multi dimensional aggregates¶
One common type of assemblage functioning is an aggregate along a row or column. Say you accept some data stored in a two-dimensional array:
In [9]:
M = np . random . random (( iii , 4 )) print ( M )
[[ 0.8967576 0.03783739 0.75952519 0.06682827] [ 0.8354065 0.99196818 0.19544769 0.43447084] [ 0.66859307 0.15038721 0.37911423 0.6687194 ]]
Past default, each NumPy aggregation role will return the aggregate over the entire array:
Aggregation functions take an boosted argument specifying the axis along which the amass is computed. For example, nosotros can discover the minimum value within each cavalcade by specifying axis=0:
Out[11]:
The function returns four values, corresponding to the four columns of numbers.
Similarly, we tin can observe the maximum value within each row:
Out[12]:
The way the axis is specified here tin be confusing to users coming from other languages. The axis keyword specifies the dimension of the assortment that volition exist collapsed, rather than the dimension that will exist returned. So specifying axis=0 ways that the kickoff centrality will be complanate: for two-dimensional arrays, this means that values within each column will be aggregated.
Other aggregation functions¶
NumPy provides many other assemblage functions, merely we won't talk over them in detail hither. Additionally, most aggregates have a NaN-condom counterpart that computes the consequence while ignoring missing values, which are marked past the special IEEE floating-betoken NaN value (for a fuller discussion of missing data, see Handling Missing Data). Some of these NaN-safe functions were not added until NumPy one.8, so they will not be bachelor in older NumPy versions.
The following table provides a listing of useful aggregation functions available in NumPy:
| Office Proper noun | NaN-safe Version | Description |
|---|---|---|
np.sum | np.nansum | Compute sum of elements |
np.prod | np.nanprod | Compute product of elements |
np.mean | np.nanmean | Compute hateful of elements |
np.std | np.nanstd | Compute standard difference |
np.var | np.nanvar | Compute variance |
np.min | np.nanmin | Notice minimum value |
np.max | np.nanmax | Find maximum value |
np.argmin | np.nanargmin | Find alphabetize of minimum value |
np.argmax | np.nanargmax | Detect alphabetize of maximum value |
np.median | np.nanmedian | Compute median of elements |
np.percentile | np.nanpercentile | Compute rank-based statistics of elements |
np.any | North/A | Evaluate whether whatsoever elements are truthful |
np.all | North/A | Evaluate whether all elements are truthful |
Nosotros will run into these aggregates often throughout the remainder of the volume.
Instance: What is the Boilerplate Height of U.s. Presidents?¶
Aggregates available in NumPy tin exist extremely useful for summarizing a set of values. As a simple case, permit's consider the heights of all US presidents. This data is bachelor in the file president_heights.csv, which is a unproblematic comma-separated listing of labels and values:
In [13]:
!head -iv data/president_heights.csv guild,name,peak(cm) 1,George Washington,189 2,John Adams,170 3,Thomas Jefferson,189
We'll use the Pandas package, which we'll explore more than fully in Chapter 3, to read the file and extract this information (note that the heights are measured in centimeters).
In [14]:
import pandas as pd data = pd . read_csv ( 'data/president_heights.csv' ) heights = np . assortment ( data [ 'height(cm)' ]) print ( heights )
[189 170 189 163 183 171 185 168 173 183 173 173 175 178 183 193 178 173 174 183 183 168 170 178 182 180 183 178 182 188 175 179 183 193 182 183 177 185 188 188 182 185]
Now that nosotros take this data array, we tin compute a variety of summary statistics:
In [15]:
print ( "Mean peak: " , heights . mean ()) print ( "Standard departure:" , heights . std ()) print ( "Minimum height: " , heights . min ()) impress ( "Maximum height: " , heights . max ())
Mean height: 179.738095238 Standard difference: 6.93184344275 Minimum height: 163 Maximum tiptop: 193
Annotation that in each instance, the aggregation operation reduced the unabridged array to a single summarizing value, which gives us information well-nigh the distribution of values. Nosotros may also wish to compute quantiles:
In [xvi]:
impress ( "25th percentile: " , np . percentile ( heights , 25 )) print ( "Median: " , np . median ( heights )) impress ( "75th percentile: " , np . percentile ( heights , 75 ))
25th percentile: 174.25 Median: 182.0 75th percentile: 183.0
Nosotros see that the median summit of US presidents is 182 cm, or merely shy of six feet.
Of course, sometimes it's more useful to see a visual representation of this data, which we can achieve using tools in Matplotlib (we'll discuss Matplotlib more fully in Chapter four). For example, this code generates the post-obit nautical chart:
In [17]:
% matplotlib inline import matplotlib.pyplot as plt import seaborn ; seaborn . set () # set plot style
In [18]:
plt . hist ( heights ) plt . title ( 'Superlative Distribution of The states Presidents' ) plt . xlabel ( 'peak (cm)' ) plt . ylabel ( 'number' );
These aggregates are some of the fundamental pieces of exploratory data analysis that we'll explore in more depth in after chapters of the book.
Source: https://jakevdp.github.io/PythonDataScienceHandbook/02.04-computation-on-arrays-aggregates.html
0 Response to "Python Read Maximum Value From a Dataset"
Post a Comment