I’m only aware of the describe() function. Are there any other functions similar to str(), summary(), and head()?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
In pandas the info() method creates a very similar output like R’s str():
> str(train) 'data.frame': 891 obs. of 13 variables: $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ... $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ... $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ... $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ... $ Child : num 0 0 0 0 0 NA 0 1 0 1 ... train.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
Method 2
This provides output similar to R’s str(). It presents unique values instead of initial values.
def rstr(df): return df.shape, df.apply(lambda x: [x.unique()]) print(rstr(iris)) ((150, 5), sepal_length [[5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.4, 4.8, 4.3,... sepal_width [[3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 2.9, 3.7,... petal_length [[1.4, 1.3, 1.5, 1.7, 1.6, 1.1, 1.2, 1.0, 1.9,... petal_width [[0.2, 0.4, 0.3, 0.1, 0.5, 0.6, 1.4, 1.5, 1.3,... class [[Iris-setosa, Iris-versicolor, Iris-virginica]] dtype: object)
Method 3
summary()~describe()head()~head()
I’m not sure about the str() equivalent.
Method 4
Pandas offers an extensive Comparison with R / R libraries. The most obvious difference is that R prefers functional programming while Pandas is object orientated, with the data frame as the key object. Another difference between R and Python is that Python starts arrays at 0, but R at 1.
R | Pandas ------------------------------- summary(df) | df.describe() head(df) | df.head() dim(df) | df.shape slice(df, 1:10) | df.iloc[:9]
Method 5
For a Python equivalent to the str() function in R, I use the method dtypes. This will provide the data types for each column.
In [22]: df2.dtypes Out[22]: Survived int64 Pclass int64 Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object
Method 6
I still prefer str() because it list some examples. A confusing aspect of info is that its behavior depends on some environment settings like pandas.options.display.max_info_columns.
I think the best alternative is to call info with some other parameters that will force a fixed behavior:
df.info(null_counts=True, verbose=True)
And for your other functions:
summary(df) | df.describe() head(df) | df.head() dim(df) | df.shape
Method 7
I don’t know much about R, but here are some leads:
str =>
difficult one… for functions you can use dir(), dir() on datasets will give you all the methods, so maybe that’s not what you want…
summary => describe.
See the parameters to customize the results.
head => your can use head(), or use slices.
head as you already do. To get the first 10 rows of a dataset called ds ds[:10] same for tail ds[:-10]
Method 8
I don’t think there is a direct equivalent to the str() function (or glimpse() from dplyr) in Pandas that gives the same information. I think an equivalent function would have to display the following:
- The number of rows and columns in the data frame
- The names of all the columns
- The data type stored in each column
- A quick look at the first few values in each column
Building on @jjurach’s answer, I wrote a helper function that works as a stand-in for the R str or glimpse function to quickly get an overview of my DataFrames. Here’s the code with an example:
import pandas as pd
import random
# an example dataframe to test the helper function
example_df = pd.DataFrame({
"var_a": [random.choice(["foo","bar"]) for i in range(20)],
"var_b": [random.randint(0, 1) for i in range(20)],
"var_c": [random.random() for i in range(20)]
})
# helper function for viewing pandas dataframes
def glimpse_pd(df, max_width=76):
# find the max string lengths of the column names and dtypes for formatting
_max_len = max([len(col) for col in df])
_max_dtype_label_len = max([len(str(df[col].dtype)) for col in df])
# print the dimensions of the dataframe
print(f"{type(df)}: {df.shape[0]} rows of {df.shape[1]} columns")
# print the name, dtype and first few values of each column
for _column in df:
_col_vals = df[_column].head(max_width).to_list()
_col_type = str(df[_column].dtype)
output_col = f"{_column}:".ljust(_max_len+1, ' ')
output_dtype = f" {_col_type}".ljust(_max_dtype_label_len+3, ' ')
output_combined = f"{output_col} {output_dtype} {_col_vals}"
# trim the output if too long
if len(output_combined) > max_width:
output_combined = output_combined[0:(max_width-4)] + " ..."
print(output_combined)
Running the function returns the following output:
glimpse_pd(example_df) <class 'pandas.core.frame.DataFrame'>: 20 rows of 3 columns var_a: object ['foo', 'bar', 'foo', 'foo', 'bar', 'bar', 'foo', 'bar ... var_b: int64 [0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, ... var_c: float64 [0.7346545694885085, 0.7776711488732364, 0.49558114902 ...
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0