State-of-the-art RL agents for cryptocurrency trading using novel deep learning techniques.
The RL system implements three advanced agents that learn optimal trading strategies through interaction with a realistic market environment:
- DQN Agent: Deep Q-Network with Rainbow improvements
- PPO Agent: Proximal Policy Optimization with GAE
- Transformer Agent: Multi-head attention for sequential decision making
Realistic trading simulation with:
- Continuous action space (position sizing from -100% to +100%)
- Transaction costs (0.1%) and slippage (0.05%)
- Market impact modeling
- Risk-adjusted rewards (Sharpe ratio, drawdown penalty)
- Portfolio state tracking
- Dueling architecture (separate value and advantage streams)
- Noisy layers for exploration (no epsilon-greedy needed)
- Prioritized experience replay
- Double Q-learning
- N-step returns
- Actor network with Gaussian policy
- Critic network for value estimation
- Layer normalization
- Residual connections
- Multi-head self-attention (8 heads)
- Positional encoding
- 4 transformer blocks
- Shared actor-critic architecture
- Curriculum learning
- Early stopping
- Checkpointing
- Performance tracking
- Multi-agent comparison
from cryptvault.rl import TradingEnvironment, RLTrainer
from cryptvault.data.fetchers import CryptoDataFetcher
# Fetch data
fetcher = CryptoDataFetcher()
data = fetcher.fetch_data("BTC", days=200)
# Create environment
env = TradingEnvironment(data, initial_balance=100000)
# Train PPO agent
trainer = RLTrainer(env, agent_type="ppo")
stats = trainer.train(num_episodes=1000)
print(f"Final Return: {stats['final_return']:.2%}")
print(f"Sharpe Ratio: {stats['final_sharpe']:.2f}")from cryptvault.rl import compare_agents
# Compare all agents
comparison_df = compare_agents(
data,
agent_types=["dqn", "ppo", "transformer"],
num_episodes=500,
num_eval_episodes=20
)
print(comparison_df)from cryptvault.rl import train_ensemble
# Train ensemble of all agents
ensemble = train_ensemble(data, num_episodes=500)
print(f"Ensemble size: {ensemble['num_agents']}")- Parametric noise for exploration
- No epsilon-greedy needed
- Better exploration in continuous spaces
- Sample important transitions more frequently
- Importance sampling weights
- Faster learning
- Capture temporal dependencies
- Learn market patterns
- Attention visualization
- Bias-variance tradeoff
- Smoother value estimates
- Better policy gradients
- Separate value and advantage
- Better Q-value estimates
- Faster convergence
The system tracks:
- Total return (%)
- Sharpe ratio
- Sortino ratio
- Maximum drawdown (%)
- Win rate (%)
- Number of trades
- Average return per trade
- Learning rate: 1e-4
- Gamma: 0.99
- Tau (soft update): 0.005
- Buffer size: 100,000
- Batch size: 128
- N-step: 3
- Learning rate: 3e-4
- Gamma: 0.99
- GAE lambda: 0.95
- Clip epsilon: 0.2
- Value coefficient: 0.5
- Entropy coefficient: 0.01
- D-model: 256
- Num heads: 8
- Num layers: 4
- Learning rate: 1e-4
Run comprehensive tests:
python tests/rl/test_rl_system.pyTests include:
- Environment functionality
- Agent training
- Multi-agent comparison
- Baseline comparison (ML predictor, buy & hold)
Target metrics (after 500-1000 episodes):
- Return: >20% (vs 2.22% MAPE baseline)
- Sharpe Ratio: >2.0
- Win Rate: >60%
- Max Drawdown: <15%
torch>=2.0.0
numpy>=1.24.0
pandas>=2.0.0
Optional:
matplotlib>=3.7.0 # For plotting
def custom_reward(portfolio_value, sharpe, win_rate, drawdown):
return portfolio_value * 0.5 + sharpe * 0.3 + win_rate * 0.2 - drawdown * 2.0
env = TradingEnvironment(data, reward_fn=custom_reward)# Save
trainer.save_agent("best_model.pt")
# Load
trainer.load_agent("best_model.pt")# Plot training progress
trainer.plot_training_progress(save_path="training_plot.png")| Feature | DQN | PPO | Transformer |
|---|---|---|---|
| Action Space | Discrete | Continuous | Continuous |
| Memory | Replay Buffer | On-Policy | On-Policy |
| Exploration | Noisy Nets | Gaussian | Gaussian |
| Complexity | Medium | Low | High |
| Training Speed | Fast | Medium | Slow |
| Sample Efficiency | High | Medium | Low |
- Start with PPO: Most stable and reliable
- Use Transformer for long sequences: Better temporal modeling
- DQN for discrete actions: Fast and sample efficient
- Train for 500+ episodes: RL needs time to learn
- Monitor Sharpe ratio: Better than raw returns
- Use early stopping: Prevent overfitting
- Ensemble multiple agents: Reduce variance
- Increase training episodes
- Adjust reward function
- Tune hyperparameters
- Check data quality
- Reduce learning rate
- Increase batch size
- Use ensemble
- Add regularization
- Reduce network size
- Use GPU
- Decrease batch size
- Simplify environment
- Hierarchical RL (multi-timeframe)
- Meta-learning (adapt to new markets)
- Multi-asset portfolio optimization
- Risk-aware RL (CVaR, VaR)
- Offline RL (learn from historical data)
- Model-based RL (world models)
- Rainbow DQN: Hessel et al., 2017
- PPO: Schulman et al., 2017
- Attention is All You Need: Vaswani et al., 2017
- Noisy Networks: Fortunato et al., 2017
- GAE: Schulman et al., 2015
For questions or issues, contact: contact@meridianalgo.org