pythcat package¶
Submodules¶
pythcat.pythcat module¶
-
pythcat.pythcat.misscat(df, threshold)¶ Drops rows containing missing values if the number of the missing values exceeds a threshold
Parameters: - df (pandas.core.frame.DataFrame) – The input dataframe
- threshold (float) – The threshold of the minimum missing values proportion needed to drop the row.
Returns: dataframe of after dropping missing values
Return type: pandas.core.frame.DataFrame
Examples
>>> data = pd.DataFrame(data = {"X": [1, None, 2], "Y": [2, None, None], "Z": [1, 2, None]}) >>> pythcat.misscat(data, threshold = 0.3) X Y Z 1.0 2.0 1.0
-
pythcat.pythcat.repwithna(df, rmvsym=False, format=None)¶ Replace uninformative strings (eg. empty strings like ‘’) in the data frame with NAs, so they can be removed as missing values. By default, empty strings will be replaced. If ‘rmvsym’ is set to ‘True’, strings containing only symbols will also be replaced. If ‘format’ is set with a regular expression, the function will replace all the strings of non-compliant formats with NAs.
Parameters: - df (pandas.core.frame.DataFrame) – The input dataframe
- rmvsym (boolean, default=False) – If True, remove all the strings containing only symbols
- format (String, default=None) – A regular expression representing the format of the string value in the data frame
Returns: The new dataframe with uninformative strings replaced as NAs
Return type: pandas.core.frame.DataFrame
Examples
>>> data = pd.DataFrame([['Momo', 23], ['momo', 11]], columns = ['Name', 'Age']) >>> pythcat.repwithna(data, format='^M.+') Name Age Momo 23 NaN 11
-
pythcat.pythcat.suscat(df, columns, n=1, num='percent')¶ Detect suspected erroneous numeric data in user chosen columns of a dataframe
Parameters: - df (Pandas dataframe object) –
- col (list or array of column indices for which to test for) – suspected erroneous data
- n (integer value for amount of suspected values to return) –
- type – This optional parameter specifies the whether n is a number of rows or percentage of values:
Returns: - dictionary with key as index of column and values as array of row indices
- of suspected erroneous values
Examples
suscat(pd.DataFrame({‘Age’: [2, 23, 4, 11], ‘Number’: [11, 99, 23, 8]}), columns = [1], n = 2, num = ‘number’) > {1: [1,3]}
suscat(pd.DataFrame({‘Age’: [2, 23, 4, 11], ‘Number’: [11, 99, 23, 8]}), columns = [1], n = 25, num = ‘percent’) > {1: [2]}
-
pythcat.pythcat.topcorr(df, k='all', method='pearson')¶ Generates a pandas dataframe with the top k correlated pairs of features
Parameters: - df (pandas.core.frame.DataFrame) – The input dataframe
- k (str or int, default = 'all') – if k is an int, it is the number of top correlated feature pairs; if ‘all’, display all the pairs of features based on absolute correlation
- method (str, default = 'pearson') – method of correlation, can be either ‘pearson’, ‘kendall’, or ‘spearman’
Returns: The dataframe of top k correlated features
Return type: pandas.core.frame.DataFrame
Examples
>>> data = pd.DataFrame({'x': [1, 2], 'y': [3, 4]}) >>> pythcat.topcorr(data, 1) Feature 1 Feature 2 Absolute Correlation y x 1.0