pythcat package

Submodules

pythcat.pythcat module

pythcat.pythcat.misscat(df, threshold)

Drops rows containing missing values if the number of the missing values exceeds a threshold

Parameters:
  • df (pandas.core.frame.DataFrame) – The input dataframe
  • threshold (float) – The threshold of the minimum missing values proportion needed to drop the row.
Returns:

dataframe of after dropping missing values

Return type:

pandas.core.frame.DataFrame

Examples

>>> data = pd.DataFrame(data = {"X": [1, None, 2],
"Y": [2, None, None], "Z": [1, 2, None]})
>>> pythcat.misscat(data, threshold = 0.3)
    X    Y    Z
    1.0  2.0  1.0
pythcat.pythcat.repwithna(df, rmvsym=False, format=None)

Replace uninformative strings (eg. empty strings like ‘’) in the data frame with NAs, so they can be removed as missing values. By default, empty strings will be replaced. If ‘rmvsym’ is set to ‘True’, strings containing only symbols will also be replaced. If ‘format’ is set with a regular expression, the function will replace all the strings of non-compliant formats with NAs.

Parameters:
  • df (pandas.core.frame.DataFrame) – The input dataframe
  • rmvsym (boolean, default=False) – If True, remove all the strings containing only symbols
  • format (String, default=None) – A regular expression representing the format of the string value in the data frame
Returns:

The new dataframe with uninformative strings replaced as NAs

Return type:

pandas.core.frame.DataFrame

Examples

>>> data = pd.DataFrame([['Momo', 23], ['momo', 11]],
    columns = ['Name', 'Age'])
>>> pythcat.repwithna(data, format='^M.+')
    Name  Age
    Momo   23
    NaN   11
pythcat.pythcat.suscat(df, columns, n=1, num='percent')

Detect suspected erroneous numeric data in user chosen columns of a dataframe

Parameters:
  • df (Pandas dataframe object) –
  • col (list or array of column indices for which to test for) – suspected erroneous data
  • n (integer value for amount of suspected values to return) –
  • type – This optional parameter specifies the whether n is a number of rows or percentage of values:
Returns:

  • dictionary with key as index of column and values as array of row indices
  • of suspected erroneous values

Examples

suscat(pd.DataFrame({‘Age’: [2, 23, 4, 11], ‘Number’: [11, 99, 23, 8]}), columns = [1], n = 2, num = ‘number’) > {1: [1,3]}

suscat(pd.DataFrame({‘Age’: [2, 23, 4, 11], ‘Number’: [11, 99, 23, 8]}), columns = [1], n = 25, num = ‘percent’) > {1: [2]}

pythcat.pythcat.topcorr(df, k='all', method='pearson')

Generates a pandas dataframe with the top k correlated pairs of features

Parameters:
  • df (pandas.core.frame.DataFrame) – The input dataframe
  • k (str or int, default = 'all') – if k is an int, it is the number of top correlated feature pairs; if ‘all’, display all the pairs of features based on absolute correlation
  • method (str, default = 'pearson') – method of correlation, can be either ‘pearson’, ‘kendall’, or ‘spearman’
Returns:

The dataframe of top k correlated features

Return type:

pandas.core.frame.DataFrame

Examples

>>> data = pd.DataFrame({'x': [1, 2], 'y': [3, 4]})
>>> pythcat.topcorr(data, 1)
    Feature 1 Feature 2  Absolute Correlation
    y         x                   1.0

Module contents