
data-analysis
by IbIFACE-Tech
Paracle is a framework for building AI native app and project.
SKILL.md
name: data-analysis description: Analyze and interpret data to generate meaningful insights using statistical methods and visualization. Use when working with datasets, metrics, statistics, or when insights from data are needed. license: Apache-2.0 compatibility: Best with pandas, numpy, matplotlib. Requires file_system and code_executor tools. metadata: author: paracle version: "1.0.0" category: analysis level: advanced display_name: "Data Analysis" tags: - analytics - statistics - insights - data - intelligence capabilities: - statistical_analysis - pattern_recognition - data_visualization - insight_generation - correlation_analysis requirements: - skill_name: question-answering min_level: basic allowed-tools: Read Write Bash(python:) Bash(pandas:) Bash(numpy:*)
Data Analysis Skill
When to use this skill
Use this skill when:
- Analyzing datasets (CSV, JSON, Excel, databases)
- Calculating statistics (mean, median, mode, standard deviation)
- Identifying patterns and trends
- Detecting anomalies or outliers
- Generating insights from data
- Creating visualizations
- Comparing groups or segments
- Performing correlation analysis
Core capabilities
1. Descriptive Statistics
Calculate summary statistics to understand data distribution:
import pandas as pd
import numpy as np
def analyze_dataset(data: pd.DataFrame) -> dict:
"""Generate comprehensive statistical summary.
Args:
data: DataFrame to analyze
Returns:
Dictionary with statistical metrics
"""
stats = {
'shape': data.shape,
'columns': list(data.columns),
'dtypes': data.dtypes.to_dict(),
'missing_values': data.isnull().sum().to_dict(),
'numeric_summary': {},
'categorical_summary': {}
}
# Numeric columns analysis
numeric_cols = data.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
stats['numeric_summary'][col] = {
'mean': data[col].mean(),
'median': data[col].median(),
'std': data[col].std(),
'min': data[col].min(),
'max': data[col].max(),
'q25': data[col].quantile(0.25),
'q75': data[col].quantile(0.75),
}
# Categorical columns analysis
categorical_cols = data.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
stats['categorical_summary'][col] = {
'unique_values': data[col].nunique(),
'most_common': data[col].mode().iloc[0] if len(data[col].mode()) > 0 else None,
'distribution': data[col].value_counts().head(5).to_dict()
}
return stats
# Usage
df = pd.read_csv('sales_data.csv')
stats = analyze_dataset(df)
print(f"Dataset shape: {stats['shape']}")
print(f"Missing values: {stats['missing_values']}")
2. Pattern Recognition
Identify trends and patterns in time series or sequential data:
def detect_trend(data: pd.Series) -> dict:
"""Detect trend direction and strength.
Args:
data: Time series data
Returns:
Dict with trend direction, slope, and R²
"""
from scipy import stats as sp_stats
x = np.arange(len(data))
y = data.values
# Remove NaN values
mask = ~np.isnan(y)
x_clean = x[mask]
y_clean = y[mask]
if len(x_clean) < 2:
return {'trend': 'insufficient_data'}
# Linear regression
slope, intercept, r_value, p_value, std_err = sp_stats.linregress(x_clean, y_clean)
trend = {
'direction': 'increasing' if slope > 0 else 'decreasing' if slope < 0 else 'flat',
'slope': slope,
'r_squared': r_value ** 2,
'p_value': p_value,
'significant': p_value < 0.05
}
return trend
# Usage
monthly_sales = pd.Series([100, 120, 115, 135, 150, 145, 170, 180])
trend = detect_trend(monthly_sales)
print(f"Trend: {trend['direction']} (R²={trend['r_squared']:.3f})")
3. Anomaly Detection
Find outliers and unusual data points:
def detect_outliers(data: pd.Series, method: str = 'iqr') -> pd.Series:
"""Detect outliers using IQR or Z-score method.
Args:
data: Data series to check
method: 'iqr' (Interquartile Range) or 'zscore'
Returns:
Boolean series marking outliers as True
"""
if method == 'iqr':
q1 = data.quantile(0.25)
q3 = data.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = (data < lower_bound) | (data > upper_bound)
elif method == 'zscore':
z_scores = np.abs((data - data.mean()) / data.std())
outliers = z_scores > 3
else:
raise ValueError(f"Unknown method: {method}")
return outliers
# Usage
prices = pd.Series([100, 105, 102, 110, 500, 108, 103, 107]) # 500 is outlier
outliers = detect_outliers(prices)
print(f"Outliers detected: {prices[outliers].tolist()}")
4. Correlation Analysis
Understand relationships between variables:
def analyze_correlations(data: pd.DataFrame, threshold: float = 0.5) -> dict:
"""Find strong correlations between numeric columns.
Args:
data: DataFrame with numeric columns
threshold: Minimum absolute correlation value
Returns:
Dict with correlation matrix and strong correlations
"""
# Compute correlation matrix
corr_matrix = data.select_dtypes(include=[np.number]).corr()
# Find strong correlations
strong_correlations = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
col1 = corr_matrix.columns[i]
col2 = corr_matrix.columns[j]
corr_value = corr_matrix.iloc[i, j]
if abs(corr_value) >= threshold:
strong_correlations.append({
'var1': col1,
'var2': col2,
'correlation': corr_value,
'strength': 'strong' if abs(corr_value) > 0.7 else 'moderate'
})
return {
'correlation_matrix': corr_matrix,
'strong_correlations': strong_correlations
}
# Usage
df = pd.DataFrame({
'sales': [100, 150, 200, 250, 300],
'marketing_spend': [10, 15, 25, 30, 40],
'temperature': [20, 22, 19, 21, 23]
})
result = analyze_correlations(df, threshold=0.5)
print(f"Strong correlations found: {len(result['strong_correlations'])}")
Complete analysis workflow
Step 1: Load and inspect data
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# Basic inspection
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nFirst rows:\n{df.head()}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
Step 2: Clean data
def clean_dataset(df: pd.DataFrame) -> pd.DataFrame:
"""Clean dataset by handling missing values and duplicates."""
df_clean = df.copy()
# Remove duplicates
df_clean = df_clean.drop_duplicates()
# Handle missing values
# For numeric: fill with median
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
df_clean[col].fillna(df_clean[col].median(), inplace=True)
# For categorical: fill with mode
categorical_cols = df_clean.select_dtypes(include=['object']).columns
for col in categorical_cols:
df_clean[col].fillna(df_clean[col].mode()[0], inplace=True)
return df_clean
df_clean = clean_dataset(df)
Step 3: Analyze
# Get summary statistics
summary = analyze_dataset(df_clean)
# Detect outliers
for col in df_clean.select_dtypes(include=[np.number]).columns:
outliers = detect_outliers(df_clean[col])
print(f"{col}: {outliers.sum()} outliers detected")
# Check correlations
corr_results = analyze_correlations(df_clean)
print(f"\nStrong correlations:")
for corr in corr_results['strong_correlations']:
print(f" {corr['var1']} <-> {corr['var2']}: {corr['correlation']:.3f}")
Step 4: Visualize (optional)
import matplotlib.pyplot as plt
def create_visualization(df: pd.DataFrame, target_col: str):
"""Create comprehensive visualization."""
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Distribution plot
axes[0, 0].hist(df[target_col], bins=30, edgecolor='black')
axes[0, 0].set_title(f'{target_col} Distribution')
axes[0, 0].set_xlabel(target_col)
axes[0, 0].set_ylabel('Frequency')
# Box plot
axes[0, 1].boxplot(df[target_col])
axes[0, 1].set_title(f'{target_col} Box Plot')
axes[0, 1].set_ylabel(target_col)
# Time series (if applicable)
axes[1, 0].plot(df.index, df[target_col])
axes[1, 0].set_title(f'{target_col} Over Time')
axes[1, 0].set_xlabel('Index')
axes[1, 0].set_ylabel(target_col)
# Correlation heatmap
corr = df.select_dtypes(include=[np.number]).corr()
im = axes[1, 1].imshow(corr, cmap='coolwarm', aspect='auto')
axes[1, 1].set_title('Correlation Matrix')
plt.colorbar(im, ax=axes[1, 1])
plt.tight_layout()
plt.savefig(f'{target_col}_analysis.png')
print(f"Visualization saved to {target_col}_analysis.png")
Step 5: Generate insights
def generate_insights(df: pd.DataFrame, target_col: str) -> list:
"""Generate actionable insights from analysis."""
insights = []
# Check data quality
missing_pct = (df[target_col].isnull().sum() / len(df)) * 100
if missing_pct > 10:
insights.append(f"⚠️ High missing data rate: {missing_pct:.1f}%")
# Check distribution
skewness = df[target_col].skew()
if abs(skewness) > 1:
direction = "right" if skewness > 0 else "left"
insights.append(f"📊 Distribution is skewed {direction} (skewness: {skewness:.2f})")
# Check trend
if len(df) >= 10:
trend = detect_trend(df[target_col])
if trend['significant']:
insights.append(f"📈 Significant {trend['direction']} trend detected (p={trend['p_value']:.4f})")
# Check outliers
outliers = detect_outliers(df[target_col])
outlier_pct = (outliers.sum() / len(df)) * 100
if outlier_pct > 5:
insights.append(f"🔍 Outliers detected: {outliers.sum()} ({outlier_pct:.1f}%)")
# Check variability
cv = (df[target_col].std() / df[target_col].mean()) * 100
if cv > 50:
insights.append(f"📉 High variability detected (CV: {cv:.1f}%)")
return insights
insights = generate_insights(df_clean, 'sales')
for insight in insights:
print(insight)
Best practices
- Always inspect data first - Understand structure before analysis
- Clean data thoroughly - Handle missing values, duplicates, outliers
- Document assumptions - Note any data transformations or filters
- Validate results - Cross-check statistical findings
- Consider context - Interpret numbers in business context
- Visualize when helpful - Charts reveal patterns quickly
- Check for bias - Ensure representative sampling
Common pitfalls
❌ Correlation ≠ Causation: High correlation doesn't mean causation ❌ Cherry-picking: Don't select only favorable results ❌ Ignoring outliers: Investigate outliers, don't just remove them ❌ Overfitting: Avoid finding patterns in noise ❌ Sample size: Ensure sufficient data for statistical significance
Related skills
- code-generation: For creating analysis scripts
- text-summarization: For summarizing findings
- api-integration: For fetching external data
Required libraries
pip install pandas numpy scipy matplotlib seaborn
References
Score
Total Score
Based on repository quality metrics
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
Reviews
Reviews coming soon
