Python Read Maximum Value From a Dataset

Often when faced with a large corporeality of data, a first step is to compute summary statistics for the data in question. Perchance the near common summary statistics are the mean and standard deviation, which let you to summarize the "typical" values in a dataset, merely other aggregates are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).

NumPy has fast born assemblage functions for working on arrays; nosotros'll discuss and demonstrate some of them here.

Summing the Values in an Array¶

As a quick example, consider calculating the sum of all values in an array. Python itself can do this using the congenital-in sum function:

In [two]:

                            Fifty              =              np              .              random              .              random              (              100              )              sum              (              L              )            

The syntax is quite similar to that of NumPy's sum office, and the outcome is the same in the simplest case:

However, because it executes the operation in compiled code, NumPy's version of the operation is computed much more quickly:

In [4]:

                                big_array                =                np                .                random                .                rand                (                1000000                )                %                timeit                sum(big_array)                %                timeit                np.sum(big_array)              
10 loops, all-time of 3: 104 ms per loop grand loops, all-time of three: 442 µs per loop            

Be careful, though: the sum function and the np.sum office are not identical, which tin can sometimes lead to confusion! In particular, their optional arguments have different meanings, and np.sum is aware of multiple array dimensions, as we will see in the following section.

Minimum and Maximum¶

Similarly, Python has built-in min and max functions, used to find the minimum value and maximum value of any given array:

In [five]:

                                min                (                big_array                ),                max                (                big_array                )              

Out[5]:

(1.1717128136634614e-06, 0.9999976784968716)

NumPy's corresponding functions take like syntax, and again operate much more than rapidly:

In [6]:

                                np                .                min                (                big_array                ),                np                .                max                (                big_array                )              

Out[vi]:

(1.1717128136634614e-06, 0.9999976784968716)

In [7]:

                                %                timeit                min(big_array)                %                timeit                np.min(big_array)              
10 loops, best of 3: 82.3 ms per loop k loops, best of 3: 497 µs per loop            

For min, max, sum, and several other NumPy aggregates, a shorter syntax is to use methods of the array object itself:

In [viii]:

                                impress                (                big_array                .                min                (),                big_array                .                max                (),                big_array                .                sum                ())              
1.17171281366e-06 0.999997678497 499911.628197            

Whenever possible, make sure that you are using the NumPy version of these aggregates when operating on NumPy arrays!

Multi dimensional aggregates¶

One common type of assemblage functioning is an aggregate along a row or column. Say you accept some data stored in a two-dimensional array:

In [9]:

                                M                =                np                .                random                .                random                ((                iii                ,                4                ))                print                (                M                )              
[[ 0.8967576   0.03783739  0.75952519  0.06682827]  [ 0.8354065   0.99196818  0.19544769  0.43447084]  [ 0.66859307  0.15038721  0.37911423  0.6687194 ]]            

Past default, each NumPy aggregation role will return the aggregate over the entire array:

Aggregation functions take an boosted argument specifying the axis along which the amass is computed. For example, nosotros can discover the minimum value within each cavalcade by specifying axis=0:

Out[11]:

array([ 0.66859307,  0.03783739,  0.19544769,  0.06682827])

The function returns four values, corresponding to the four columns of numbers.

Similarly, we tin can observe the maximum value within each row:

Out[12]:

assortment([ 0.8967576 ,  0.99196818,  0.6687194 ])

The way the axis is specified here tin be confusing to users coming from other languages. The axis keyword specifies the dimension of the assortment that volition exist collapsed, rather than the dimension that will exist returned. So specifying axis=0 ways that the kickoff centrality will be complanate: for two-dimensional arrays, this means that values within each column will be aggregated.

Other aggregation functions¶

NumPy provides many other assemblage functions, merely we won't talk over them in detail hither. Additionally, most aggregates have a NaN-condom counterpart that computes the consequence while ignoring missing values, which are marked past the special IEEE floating-betoken NaN value (for a fuller discussion of missing data, see Handling Missing Data). Some of these NaN-safe functions were not added until NumPy one.8, so they will not be bachelor in older NumPy versions.

The following table provides a listing of useful aggregation functions available in NumPy:

Office Proper noun NaN-safe Version Description
np.sum np.nansum Compute sum of elements
np.prod np.nanprod Compute product of elements
np.mean np.nanmean Compute hateful of elements
np.std np.nanstd Compute standard difference
np.var np.nanvar Compute variance
np.min np.nanmin Notice minimum value
np.max np.nanmax Find maximum value
np.argmin np.nanargmin Find alphabetize of minimum value
np.argmax np.nanargmax Detect alphabetize of maximum value
np.median np.nanmedian Compute median of elements
np.percentile np.nanpercentile Compute rank-based statistics of elements
np.any North/A Evaluate whether whatsoever elements are truthful
np.all North/A Evaluate whether all elements are truthful

Nosotros will run into these aggregates often throughout the remainder of the volume.

Instance: What is the Boilerplate Height of U.s. Presidents?¶

Aggregates available in NumPy tin exist extremely useful for summarizing a set of values. As a simple case, permit's consider the heights of all US presidents. This data is bachelor in the file president_heights.csv, which is a unproblematic comma-separated listing of labels and values:

In [13]:

                                !head -iv data/president_heights.csv              
guild,name,peak(cm) 1,George Washington,189 2,John Adams,170 3,Thomas Jefferson,189            

We'll use the Pandas package, which we'll explore more than fully in Chapter 3, to read the file and extract this information (note that the heights are measured in centimeters).

In [14]:

                                import                pandas                as                pd                data                =                pd                .                read_csv                (                'data/president_heights.csv'                )                heights                =                np                .                assortment                (                data                [                'height(cm)'                ])                print                (                heights                )              
[189 170 189 163 183 171 185 168 173 183 173 173 175 178 183 193 178 173  174 183 183 168 170 178 182 180 183 178 182 188 175 179 183 193 182 183  177 185 188 188 182 185]            

Now that nosotros take this data array, we tin compute a variety of summary statistics:

In [15]:

                                print                (                "Mean peak:       "                ,                heights                .                mean                ())                print                (                "Standard departure:"                ,                heights                .                std                ())                print                (                "Minimum height:    "                ,                heights                .                min                ())                impress                (                "Maximum height:    "                ,                heights                .                max                ())              
Mean height:        179.738095238 Standard difference: 6.93184344275 Minimum height:     163 Maximum tiptop:     193            

Annotation that in each instance, the aggregation operation reduced the unabridged array to a single summarizing value, which gives us information well-nigh the distribution of values. Nosotros may also wish to compute quantiles:

In [xvi]:

                                impress                (                "25th percentile:   "                ,                np                .                percentile                (                heights                ,                25                ))                print                (                "Median:            "                ,                np                .                median                (                heights                ))                impress                (                "75th percentile:   "                ,                np                .                percentile                (                heights                ,                75                ))              
25th percentile:    174.25 Median:             182.0 75th percentile:    183.0            

Nosotros see that the median summit of US presidents is 182 cm, or merely shy of six feet.

Of course, sometimes it's more useful to see a visual representation of this data, which we can achieve using tools in Matplotlib (we'll discuss Matplotlib more fully in Chapter four). For example, this code generates the post-obit nautical chart:

In [17]:

                            %              matplotlib              inline              import              matplotlib.pyplot              as              plt              import              seaborn              ;              seaborn              .              set              ()              # set plot style            

In [18]:

                                plt                .                hist                (                heights                )                plt                .                title                (                'Superlative Distribution of The states Presidents'                )                plt                .                xlabel                (                'peak (cm)'                )                plt                .                ylabel                (                'number'                );              

These aggregates are some of the fundamental pieces of exploratory data analysis that we'll explore in more depth in after chapters of the book.

bordenroold1999.blogspot.com

Source: https://jakevdp.github.io/PythonDataScienceHandbook/02.04-computation-on-arrays-aggregates.html

0 Response to "Python Read Maximum Value From a Dataset"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel