pythcat package¶

Submodules¶

pythcat.pythcat module¶

pythcat.pythcat.misscat(df, threshold)¶

Drops rows containing missing values if the number of the missing values exceeds a threshold

Parameters:	df (pandas.core.frame.DataFrame) – The input dataframe threshold (float) – The threshold of the minimum missing values proportion needed to drop the row.
Returns:	dataframe of after dropping missing values
Return type:	pandas.core.frame.DataFrame

Examples

>>> data = pd.DataFrame(data = {"X": [1, None, 2],
"Y": [2, None, None], "Z": [1, 2, None]})
>>> pythcat.misscat(data, threshold = 0.3)
    X    Y    Z
    1.0  2.0  1.0

pythcat.pythcat.repwithna(df, rmvsym=False, format=None)¶

Replace uninformative strings (eg. empty strings like ‘’) in the data frame with NAs, so they can be removed as missing values. By default, empty strings will be replaced. If ‘rmvsym’ is set to ‘True’, strings containing only symbols will also be replaced. If ‘format’ is set with a regular expression, the function will replace all the strings of non-compliant formats with NAs.

Parameters:	df (pandas.core.frame.DataFrame) – The input dataframe rmvsym (boolean, default=False) – If True, remove all the strings containing only symbols format (String, default=None) – A regular expression representing the format of the string value in the data frame
Returns:	The new dataframe with uninformative strings replaced as NAs
Return type:	pandas.core.frame.DataFrame

Examples

>>> data = pd.DataFrame([['Momo', 23], ['momo', 11]],
    columns = ['Name', 'Age'])
>>> pythcat.repwithna(data, format='^M.+')
    Name  Age
    Momo   23
    NaN   11

pythcat.pythcat.suscat(df, columns, n=1, num='percent')¶

Detect suspected erroneous numeric data in user chosen columns of a dataframe

Parameters:

df (Pandas dataframe object) –
col (list or array of column indices for which to test for) – suspected erroneous data
n (integer value for amount of suspected values to return) –
type – This optional parameter specifies the whether n is a number of rows or percentage of values:

Returns:

dictionary with key as index of column and values as array of row indices
of suspected erroneous values

Examples

suscat(pd.DataFrame({‘Age’: [2, 23, 4, 11], ‘Number’: [11, 99, 23, 8]}), columns = [1], n = 2, num = ‘number’) > {1: [1,3]}

suscat(pd.DataFrame({‘Age’: [2, 23, 4, 11], ‘Number’: [11, 99, 23, 8]}), columns = [1], n = 25, num = ‘percent’) > {1: [2]}

pythcat.pythcat.topcorr(df, k='all', method='pearson')¶

Generates a pandas dataframe with the top k correlated pairs of features

Parameters:	df (pandas.core.frame.DataFrame) – The input dataframe k (str or int, default = 'all') – if k is an int, it is the number of top correlated feature pairs; if ‘all’, display all the pairs of features based on absolute correlation method (str, default = 'pearson') – method of correlation, can be either ‘pearson’, ‘kendall’, or ‘spearman’
Returns:	The dataframe of top k correlated features
Return type:	pandas.core.frame.DataFrame

Examples

>>> data = pd.DataFrame({'x': [1, 2], 'y': [3, 4]})
>>> pythcat.topcorr(data, 1)
    Feature 1 Feature 2  Absolute Correlation
    y         x                   1.0

pythcat package¶

Submodules¶

pythcat.pythcat module¶

Module contents¶

pythcat

Navigation

Related Topics